 Thank you, and I'm very grateful for the invitation and for the opportunity to talk during the siesta slot Or Wi-Fi whatever And so I should start by saying that this is joint work with a number of people including Chris Obal Guzman Vincent Roulet who is in the room and he's gonna be lasered in a second if I can find him where are you wait out here and Nicola Bumal who's now at Princeton and Martin Yaggy who's now at EPFL so the title changed but the content remain the same And and the content is a lot simpler than what the title sort of influence Okay So the idea in general is that when you get a complexity bounded it looks like this roughly speaking Okay, you get a certain dependence on the problem dimension and a certain dependence on the target precision So that's a generic complex optimization problem in this case you're lucky because the exponent is not too high etc But this is what it roughly looks like, okay if you're lucky it looks a little bit like this a little bit more like this and Well now you have a leaps its constant and at least you have some dependence on the problem structure Okay, but clearly one thing important is missing from this complexity bound and this is the data right The complexity bound you get on the complex optimization problem you're trying to solve only very very loosely depends on the problem data and If you want to study computational trade-offs or computational complexity versus statistical performance trade-offs in general the fact that you're bound on the complexity of the problem you're studying is so coarse and So roughly dependent on the structure of your data is a big issue Okay, not this is of course an upper bound so to begin with so It's hard to make comparisons between the complexity of values optimization problems using only upper bounds it's even worst because this upper bounds is is very loose and In particular only loosely dependent on the problem data, okay and so the the general idea behind the Two quick results. I'm going to present today is to get closer to data-driven complexity bounds, okay, and hopefully Get closer to something that would be tight data-driven complexity bounds on optimization problems Okay, if you want to start to make comparisons between the complexity of values optimization problems You need something more than upper bounds. You need tight bounds on the complexity And if you want to compare problems based on their structure You need bounds that depend on the problem structure and on the data in a meaningful way Okay, and so the material I'm going to present today is an effort in that direction of course, it's a broad topic and This did the two results I'm going to present today are just very specialized results in this direction But that's the general idea you want data-driven complexity bounds that hopefully are tight and Allow you to quantify in a much finer way what happens on the computational side when you're considering and comparing Statistical optimization sorry optimization problems coming from statistics for example So again, if you're studying the complexity versus statistical performance trade-off Well, it helps if you have a fine description of the complexity aspect, okay, and at this point we don't so you can talk a lot about the this this trade-off but at least on the complexity side are Measures of complex problem complexity and the way they relate to the problem structure are really approximate now Okay, and pretty far from the truth and they suffer from a certain number of clear deficiencies and among these deficiencies is affine in variance Okay, so it it makes sense that if you're going to describe the complexity of a problem the bound you use to describe this complexity should not be affected by elementary transformations of your problem, okay, and In particular if you make a simple affine change of coordinates the complexity of your problem should not change Okay, just a renormalization should not have any impact on the complexity of an optimization problem. Yes Well It's okay, so I guess You're implementing this on machines So machines are finite precision. So obviously there are limits to what I'm saying but on the other hand It it's okay, I see what you mean Good point It's a matter of robustness If your problem can be fixed by a simple affine change of coordinates, perhaps your method should be smart enough to recognize it That's so it's setting the bar a bit high. I guess maybe but at least as far as I'm concerned Maybe your algorithm should be robust enough to recognize that you're the problem you're facing numerically is simple conditioning and The algorithm itself should be robust to these conditioning issues The condition number is not affine invariant, but what I mean conditioning in a more abstract sense, so No, but that's the precisely the point So it's a conditioning issue you can fix it by just changing coordinates But what I mean by by affine invariance is you make an affine change of coordinates You you don't expect to your problem to change in nature. Okay in particular. You haven't done anything to your data So if you're looking at statistics, you're not manufacturing data It's a simple preprocessing step you could do etc So this is something that should be within reach basically, but I I agree that it's a partially subjective There's no imperative reason for this to be true But if it can't be true, it's a good we all agree. It's a good thing. Okay Is that okay? Feel free to question my motivations even more deeply Okay, so affine invariance is a good thing and we want more good things in optimization That's sort of the essence of the domain And so that's that's the first point So I'm gonna start by talking about affine invariance. Can we get it and if yes what does it say and Separate but equal and important in importance Renegas condition number and compressed sensing So this would be again in another in in the slightly different direction, but in the same spirit We see that in certain cases Like a compressed sensing problems the same quantity Drives both computational complexity and statistical performance So in this particular case, we are lucky because we have a single metric for both computational complexity and Sparse recovery thresholds. Okay, so in that case the problem is solved in general, of course, it's still open Okay, so affine invariance We're gonna consider a very simple generic optimization problem You minimize a convex function f over a simple compact convex set q Okay, and you're gonna assume here that f is convex and smooth and I'm gonna make all this precise later And that q is compact convex and simple and again simple will be made precise later. Okay, but that's the general problem we're interested in and if the Set over which we optimize is not too big. We have a complete theory for this problem okay, so we have Newton's method and I have Newton's method at each iteration. There's a simple thing it Takes a step in the direction given by the gradient, but it normalizes this step by the hash it Okay, and if you assume that the function is self-concordant and That the set q as a self-concordant buyer you have a very explicit estimate on the complexity of minimizing a convex self-concordant function Using Newton's method. If you haven't seen it before self-concordant the self-concordance assumption feels a bit bizarre To say the least but if you think about it for a second You see that it basically does two things at the same time if you read it in one direction It's essentially a lower bound on the second derivative of the function and so it's a way of writing strong convexity While maintaining affine invariance in the condition and If you read it in the other direction Then it becomes an upper bound on the absolute value of the third derivative of the function and That's a way of saying that the second derivative will not change too much when you move from one point to another and Optimization in general is just about minimizing quadratic functions in the end and what that's exactly What Newton's method does? Okay, and so it's kind of intuitive why this is the right set of assumptions You're using a quadratic model to minimize your function and you want this quadratic model to remain true Even outside your current point and that's why you bound you need to bound the third derivative, okay? so This really looks bizarre But in the end it's simply a way of combining both an upper bound on the third derivative and the lower bound on the second Derivative and to do so in a way that makes the condition Invariant with respect to an affine change of coordinates in the point x, okay? And this is why you have this exponent 1.5 here, okay? So everything it's it's bizarre, but everything has a meaning and once you make that assumption You write you use focus on solving a barrier problem, of course You need the barrier itself to be self-concordant But basically then you can bound the number of iterations required to get a solution with precision Epsilon by Some coefficient times the difference between the value of the barrier at the initial point and its optimum plus log of log of one over Epsilon Eterations, okay, and for all reasonable values of Epsilon. This is basically three okay, and This term here depends only on parameters that you set For the line search in the Newton's step, okay? So you decide what the value of this parameter is this term is and alpha is typically something like point two and Beta is point five. So this this is a fixed quantity, okay? So to summarize the complexity of Newton's method is basically Three so the number of iterations you need to reach essentially any precision is bounded by something like 375 times h of you x 0 minus minus h star plus 6 Okay, and that's clean enough for me I don't know about you, but Okay, but it's some power empirically valid up to a factor. Let's say 300 and It's a constant and What striking is what is not inside this bound? So in particular, it's independent from the dimension n Which is pretty spectacular and again, this is empirically valid and It's also a fine invariant, of course because it only depends on the values of the function, okay Well, the only thing missing from this bound is the cost of solving the Newton system the KKT system But that's just linear algebra. So this is something you can measure very precisely From your data and the structure of your problem, okay? And it's just just linear algebra Conditioning plus problem structures part city block structure, etc But you can get a very fine Estimate of the complexity of solving the KKT system using simple linear algebra procedures Okay, but doesn't this depend on some conditioning? What do you mean? This here the efficiency of solving the system doesn't it does if you look at things in a slightly more precise way And I will at the end of the talk, but but overall The number of iterations varies between 15 and 20 and so the Cost of every duration clearly, but you at least you you can read it directly from the data So you can write the cost of solving the Newton step as a function of the problem structure dimension It's a block structure things like that and the data conditioning So you have a completely explicit data driven description of the complexity of your problem Okay, and this is what you're looking for. It's affine invariant and it fits the structure of the data very well And in particular if you want to compare two problems, you just compare their complexity estimate Okay, coming from the linear algebra side So at least you have an upper bound that's useful for making comparisons between the complexity of various problems Okay, which is something you don't have for first-order methods. That's my next point basically Yes So Good point there should be probably Yeah scale some or somewhere, but it shouldn't make a difference No, I don't remember exactly but that that's an issue I know it's an issue I can't tell you why Okay, so this is the picture for Newton's method and essentially the topic is completely closed Okay, so we know exactly how to bound the number of iterations the method goes through and We know exactly how to measure the cost of these iterations So you have a problem and you can describe very precisely how much it's going to cost to solve it numerically using Newton's method Meaning when you have two problems, you know, which one is going to be more expensive than the other Okay So that's the beautiful world of small-scale problems. Yes So in addition to that I've mentioned it several times, but Newton's method is affine invariant Meaning if you do an affine change of coordinates to your problem The iterates of Newton's method applied to the transform problem will simply be the affine transformations of the iterates of Newton's method on the original problem formulation meaning that in particular the number of iterations doesn't change And and the complexity bound doesn't change either and so you have an affine invariant complexity bound on an affine invariant procedure And so again, this is a very important feature. So the problem is that at a certain point Your problem becomes too big and you can't even solve a single instance of the KKT system Let alone run several iterations of Newton's method Okay and so Beyond a certain scale second-order information is out of reach You cannot form the Hessian or solve the KKT system and second-order information It is what makes Newton's method affine invariant. Okay So what people do is they just drop second-order information and used gradients instead first-order information But the problem is now can we get Similarly clean complexity bounds for first-order methods that are tight enough to say something truly meaningful about the complexity of optimization problems and Satisfy some of these invariant properties Okay, and the answer is yes in certain cases So just to go back to an algorithm that is a big favorite among some members of the audience If you look at Frank Vol for example, which is also known as the conditional gradient method or the Fedorov method at each iteration you solve an affine minimization problem whose Vector is given by the gradient and You get an extreme point of the feasible set you take convex combinations of these extreme points and magically you converge to the optimum and Somewhat more magically you can show that the complexity of The Frank Vol algorithm is bounded by four times CF divided by the target epsilon The target precision epsilon where CF is a bound on the curvature of your function. Okay, and So it the good news about CF first this complexity estimate is fairly simple in itself But the additional piece of good news about this bound is that CF is Invariant which respect to an affine change of coordinates in your problem, okay So that's excellent news The only issue with this is that in many cases the dependence in the target precision epsilon is suboptimal Okay, in particular when f has a leap sheets continuous gradient then the lower bound can be as low as one over a square root of epsilon Okay, so your bound is nice, but it's really far from being tight And it's kind of useless if at some point you want for example to compare the complexity of several problems. Yes Why is it actually a fine invariant? I mean if I stretch or no scale the function by some Just It's this time here if you plug it in here, do you get a fine invariance? If the previous gradients make seem dependent on the product, but as soon as it's a differential, yeah I have another question So when you say tightness of the bound, you mean tightness of the bound for the algorithm or for the problem So the tightness is is is you make a certain number So you have a certain number of facts about your problem smoothness of the function Regularity of the set etc So you you have a black box and then you have a few you open a little bit the black box And you know a few things about your problem and suddenly So if you only have curvature then things are a bit different But if you know for example that f has ellipses gradient Then you can do better in terms of epsilon You need to know more about q2 what I mean is that if this was universally optimal and Affine invariant we wouldn't have a problem Yes, but what I mean is that in certain instances you have more than that and In that case then you have to use another algorithm If you want to be as close as possible to the bound and this other method will not have an affine invariant bound That's what I mean. So it of course this might be optimal in certain scenarios But when you leave these scenarios then you're in trouble because suddenly you lose your your method You're the best option for solving your problem doesn't have a clean complexity bound. That's what I mean. Okay other questions Okay, so at least we know it's possible for a first-order method to have an affine invariant bound So it's not the fact that we are losing second-order information that makes it impossible to produce invariant bounds Okay, it's really about algorithm design and the way we bound the complexity of the method okay so the sort of classical method that reaches Lower complexity bounds for smooth optimization. For example is the Nesteroff method of 1983. Okay The original paper was in a nucleon setting but in the general case The algorithm requires you to do two things Okay, so it's fully specified up to a certain number of choices. Okay so in particular When you write down Nesteroff's method for a specific optimization problem the general going not the one in 1983 You have to choose a norm Okay, and that norm is important because when you say that the gradient has a lip sheets continuous Sorry that the gradient is lip sheets continuous with constant L with respect to a norm your choice of norm will affect L Okay, and if you change L then you change your complexity bound So you choose a norm that has an impact on complexity And once you've chosen a norm you also need to choose a pox function for the set queue And you want that pox function to be strongly convex with respect to your choice of norm Okay, and again the strong convexity parameter will show up in the complexity bound So how you choose the pox will have an impact on the problem complexity Okay, and then once you've done that the algorithm basically computes a gradient Those two rough projection steps averages the result and iterates, okay And it will produce an epsilon solution in at most a square root of 8l dx star divided by epsilon sigma And now the dependence in epsilon is optimal because it's in one over a square root of epsilon however the constant L and Sigma used in the description of that bound are not invariant with respect to an affine change of coordinates. Okay, and So you do a simple transformation of your problem you scale the parameters and suddenly the complexity can vary massively So when you say optimal here When people say optimal here what they really mean is optimal with respect to epsilon In the general case if you don't really look carefully at how you choose your norm and how you choose your pox there's no reason for the numerator here to be optimal and And not only that this numerator can have a very significant impact on the complexity of your problem Okay, and we'll also This numerator makes the complexity bound vary with affine changes of coordinates in the original problem, okay So this is something you would like to fix So in particular what you want is a bound on the complexity of your problem that is not only optimal in epsilon But also in everything else, okay You want a complexity bound that is optimal up to a constant factor Just like what we got for Newton's method and The problem is that you cannot write such a bound using the quantities L and sigma and for that matter d of x star because all these quantities are Depend a vary with an affine change of coordinates in your problem Okay, so we have to find something else to write this complexity bound and make it affine invariant And we also have to check if this something else is optimal Okay, in the sense that it matches the best possible Bound on on this numerator Okay, and so just to give you an example of what this choice of norm and pox function means Suppose you look at it's an example taken from a paper by Jusitzky-Niemelsky-Lan and Shapiro where you have this simple matrix game problem, so it's an affine matrix game problem and You have some smoothing to do etc, but the details don't matter too much here But if you pick an Euclidean setting and in this case a square Euclidean pox You can bound the number of iterations by four times the spectral norm of the matrix a divided by n plus 1 If on the other hand you think a little bit more carefully and you pick the L1 norm and an un-tropic pox And then your bound becomes square root of log n log m max coefficient so the max norm the max Magnitude of the coefficients of the matrix a Okay, so again, I'm using the same algorithm I'm getting the same dependence on the number of iterations n however my numerator is changing very significantly And for example when the matrix a is Bernoulli the difference between these two terms is square root of n okay, so The method may be optimal in Epsilon But if you don't pay attention to your choice of norm and choice of pox You can still be a factor square root of n from the best complexity bound possible. Okay, and that's a big deal Okay, so there's a question of invariance. There's also a question of optimality You want the best possible implementation of your optimal method to make it fast Capital n is the number of iterations. It's it's not so after any iterations you reach that that precision No n is the number of it. Oh wait, wait, wait. No, okay, so it's Whatever makes it correct. So either this should be the precision and this the number of iterations are the opposite But not both at the same time good point. Okay, but you see what I mean Okay, sorry, so this should be either epsilon or n Whatever you prefer Let's move on and so Okay, so I hope I've convinced you that that choosing the norm and choosing the pox had an impact and was an important step And if you want things to behave properly, this should be done in a principled way okay, and I hear in some sense Looking for affine invariant bounds the so the optimal bounds are going to be in for affine invariant. Okay, so the optimal numerator in the in the Complexity of the problem is going to be affine invariant because you can always do an affine transformation of your norm and Renormalize your all your coefficients Okay But that doesn't mean that if you get an affine invariant bound it's going to be optimal but looking for affine invariance is probably a good way of Choosing a norm and choosing a pox. You just have to check afterwards that what you're getting is indeed optimal Okay, and this will turn out to be the case here So think about it for a minute. If you want Your choice of norm and your choice of pox to be invariant It cannot come from outside of your problem. Okay, if suddenly you decide that your norm has to be a clillion There's no way if it doesn't depend on your problem. There's no way it will vary as your problem varies Okay, so you have to extract your norm and your pox from the problem itself and There's a good way to start doing that in certain Classical scenarios one is when Q is a centrally symmetric Convex set with non-empty interior in that case Q is the unit ball of a norm and The natural norm to pick out of your optimization problem is the one corresponding to Q Okay, the Minkowski gauge of Q Okay, that turns out to be a norm good news and What makes this a particularly good choice is that when you pick This Q norm as the underlying norm defining your algorithm The corresponding Lipchitz constant for the gradient will be invariant with respect to an affine change of coordinates So you do an affine change of coordinates to your problem If you pick the new Q norm corresponding to the transforms a feasible set It will have the same lipchitz The gradient will have the same lipchitz constant with respect to the transform norm then with respect to the original one Okay, so choosing the norm corresponding to the feasible set Q Makes your lipchitz constant and the strong convexity constant of the pox Inviant which respect to an affine change of coordinates, okay? So that seems to be a good option now the last thing that remains to be done is to choose the pox and for choosing the pox and I need to start with two quick definitions. Yes If you deviate from the central symmetry Then you have to do a little bit more. I don't have Well, if you take the simplex then it makes sense to symmetrize it and then then it becomes the L1 ball Can you do anything other than the simplex? I wish I had a general answer in that case, but I don't at this point So we can cover centrally symmetric and everything that looks or can be Turned into a centrally symmetric set If we presented the matrix game in which Q was more or less Yeah, yeah, so it turns out I'm gonna handle it I'll discuss in a few slides a case where we optimize over simplex and where if you take the L1 ball everything works out fine Yeah, but in more interesting games That's for matrix games you need to play symmetrize. If you take like sequential games these strategy profiles are more general polytopes Which might deviate very very violently from simplex Open for I have no idea. I mean, I don't know so So far only centrally symmetric is easy, but in the general case, I don't know. Yes The definition of smoothness is like a square norm Could you replace it by a right-minded divergence to get like more probability to take it to account like the simplex and the entropy Why do you you you can have simplex you can do you just get the L1 norm I want that to read your history to yourself to consider like proximity of x and y for a square norm But what if you allow yourself to consider like more general function like by one of the differences you could Maybe a little bit beyond that is centrally symmetric Maybe yes, it's just that I still need a norm in the end and so the problem is how do I or brag Monday? Yeah, I see what you mean good point maybe but I Only have a proof of optimality at least in the so getting a fine invariance is probably easy But the the ultimate price is to get both affine invariance and optimality and so far We only have optimality in the centrally symmetric cases because we only have lower bounds in the centrally symmetric case So that's the idea So I think it's probably doable to come up with affine invariant procedures to construct a norm when your original set is not centrally symmetric No big deal probably but unfortunately It's not clear that you're gonna get the optimal norm corresponding to that particular set and I don't even know at this point What to look for so maybe Bagman divergence just give you additional flexibility, but I'm not sure how to use that flexibility yet. That's the idea okay, so We at least we have a norm in the centrally symmetric case and it's pretty obvious What's a little bit less of use is now that we have this norm it may not be smooth etc So we have to come up with a pox and for a pox I need the definition of the pox I need a few more definitions so the first is the bad act measure distance between two norms or two spaces and this is just Defined as the smallest product a b such that 1 over b times the first norm is smaller than the second Which is itself smaller than a times the first norm Okay, so this product a b measures the distortion between these two norms And and we're gonna use it in our choice of the pox and the second definition is a bit more obscure and less classical and it's called the regularity constant and So it's taken from a paper by you did ski enemy of ski on a large deviations of vector-valued martingales So it doesn't really have anything to do with the optimization context But it's a way of measuring the regularity of a banach space and so This regularity constant of a banach Is the smallest constant delta that does two things? So first we want the pox function p squared divided by 2 to have a leapsitz continuous gradient Which respect to the norm p? Okay, so we look for a smooth norm p to construct our pox We want that smooth norm to define a pox that is leapsitz continuous Which respect that has a sorry a leapsitz continuous gradient with respect to itself, okay? and We also want this norm p to be not too far away from the original norm For the of the banach, okay so this norm here may not be smooth, but we approximated by a smooth norm p and Because the norm p is smooth it allows us to this to derive a pox by just squaring it and Delta will control two things at the same time. It will control One how far p is from the original norm and two how smooth the pox is, okay? So there's a trade-off between the two. There's a square root here. Here delta is a direct bound on the parameter mu But for some Happy act by some happy accident. This is exactly the trade-off we need in our bound so what you can show using this definition is that The complexity of Nesteroff's method applied to this problem here Can be bounded by square root of four times lq times dq Where lq is the leapsitz constant of the gradient of f which respect to the norm induced by q and dq is the regularity constant of the banach space Rn and The dual norm of the q norm okay, and the good news is that These two terms here lq and dq are affine as sorry are invariant with respect to an affine change of coordinates in your problem Okay, so at least we solved our first problem So now the numerator in this complexity estimate we still have the optimal dependence in epsilon and at least now We have the affine invariance of the numerator with respect to an affine change of coordinates Yes Yeah, but you need to know that here you want to make it as constructive as possible That's the idea The infimum over alpha and all so you would take the infimum over all affine change of coordinates of l times d of x star Yeah, but how do you solve that problem? No, but lq you well, I'll show you in a minute You can do it for certain Specific instances in the general case you can't but for certain specific instances. It's easy so when q is an LP ball for example and and so Coming back to the simplex example when you're optimizing over a simplex it makes sense to symmetrize that set into an L1 ball and when you do that Classically people take the L1 norm as the underlying norm and the entropy as the prox function Because the entropy turns out to be strongly convex with respect to L1 And the bound you get is the L1 Lipschitz constant of your function times log n divided by epsilon. Okay, if you do the same thing but using what is suggested by The previous construction you don't select the L1 norm as the underlying norm You take something that is the reciprocal of the two log n norm. Okay, don't ask me why read the paper But basically This suggests another way of constructing a norm based on the geometry of your problem and you measure the Lipschitz continuity of your function with respect to that Norm as well and the complexity bound you get can be written L1 times log n divided by epsilon times 16 Yes I mean if you multiply everything by two shouldn't I don't think that should make a difference No, it has no impact on the objective value. Yeah Other questions. Yes. So the bound you just put in the previous slide That was for running the sterile algorithm with this norm and this prox. So with the the norm induced by Q And the prox which is induced by this definition of the regularity constant P of x exactly P of x square divided by two Okay, so if you can compute the Q norm and you can compute P of x Then you have a specification of Nestero's method, which is a finite variant Okay, and do you know that this is smaller than n over all I find that's the question I'm gonna discuss next So the clearly affine invariance is is not sufficient to give you optimality of the numerator So in the end what you want is a constant where sorry Complexity bound which is optimal up to an absolute constant and we don't have it yet. Yes So there is a paper showing that indeed on each specific problem The the lower bound is better than this in the sense that yes, I'm gonna come I'm gonna add a little paper. I haven't really digested this one yet So I'm not sure exactly what it means, but the lower bounds are pretty tight on this thing too So I have doubts. Yes, but but so the lower bounds in this finite dimensional setting We know that they hold only for the first iterations right after them We don't know so they may ask you the result tell us if the number of iterations Becomes bigger than n then the absolute method is a better idea So then so this only applies in the high-dimensional regime where you're doing fewer iterations than the dimension Clearly, but this is typically the regime you care about Okay, but I'm gonna talk about other scenarios where the nest of bounds are not optimal So when you have but all of this depends on getting a little bit more information on the black box So I'll come back to that in a minute Okay, but so Before saying anything about optimality this this bound actually allows us to say something about easy and hard problems if we read the bound and and try to understand what it means from a geometrical point of view So on easy problems. So easy problems where problem means here problems where LQ is low Okay Well LQ will be low if the norm is large in directions where the gradient is large Okay, and that means roughly then the sub level sets of f of x and q are aligned and What that means geometrically is that when the sub level sets are aligned There is a naffine change of coordinates that is going to make your problem roughly spherical so and Optimizing a spherical quadratic over a sphere is actually the simplest problem you can think of so that sort of makes sense Okay, and Second if you think about LP spaces the unit balls BP have low regularity constants, okay and regularity constant of the L1 ball is n that's the worst case But remember that here our regularity is defined on the dual space And so what that means is that? Problems written over unit balls BQ for Q between 1 and 2 are easier than problem written over cubes for example Just because of the way this bound works on LP balls, okay But the question is how good are these bounds so okay nice enough they are fine in variant But it doesn't mean that the complexity bound is tight And in particular the worst possible choice of product LD X star divided by sigma will also nicely be a fine in Variant, okay, but we don't expect that bound to be really good Okay, so the inf is that fine in variant the soup is also a fine in variant Okay, you'd rather take the inf so The question is can we show optimality or at least can we show optimality in certain particular cases? And it turns out that we can and in particular we can do so for LP balls Okay, so now we're going to focus on a very specific class of problems where Q is the LP ball Okay, and and in that case the bound is LP times DP divided by a square root of LP times DP divided by epsilon and The constants DP can be computed explicitly. Okay, same thing for the corresponding norms. They are simply LP norms Okay, so when P is between 2 and infinity DP is simply n to the power P minus 2 over P and When when P is between 1 and 2 you can bound DP by pretty tightly by this quantity here Okay, where C is an absolute constant and That means that We can compare this upper bound with lower bounds we have for optimising smooth convex functions over LP balls and For P in the range 1 2 the lower bound we have is this one here L divided by log n divided by epsilon Sorry L divided by epsilon log n and our complexity bound is 4 times an absolute constant times L log n divided by epsilon Yes That's essentially what you're doing when you define P so The original norm is directly derived from the set and may not be smooth And P is there to produce a smooth approximation of the set essentially So that's that's how it's constructed. It's not constructive in the sense that you don't manufacture it But it's the regularity constant is computed using a smooth approximation Okay, so it turns out that here the bound is optimal up to a polylogarithmic factor And in the range P between 2 and infinity the lower bound is given by this and The upper bound our upper bound is given by this and this is again optimal up to a polylog when K is proportional to n and By optimal here. I mean optimal in both epsilon and L Okay, which is a much stronger statement than simple optimality with respect to epsilon Is there the number of iterations? So there there is a there are annoying things that happen when You run so you have to show optimality for the entire square of values epsilon and n Okay, and here in certain regimes it turns out that from wolf is Optimal when you're not you're not running a number of iterations that proper that is proportional to n Okay, but to that's we can discuss about this offline. I don't have time to to Mention details here. So in particular You there are results that that that tell you that if you have a little bit more information on the smoothness of your space and on the regularity of your function Then in that case you can get much more precise estimates of the complexity And in that case we are not yet able to produce affine invariant bounds That that work similarly as the bounds I detailed before but we're we're close to that Okay, but this is probably too much information for today. Okay, so that's it We have affine invariant bounds on the complexity of Nestor's method that give you at least a principled way of specifying the method choosing a norm and choosing a prox and These bounds are affine invariant and in addition to that in certain problem instances for example when you're Optimizing over LP balls these bounds also turn out to be optimal and by optimal We mean now optimal in terms of their dependence which respect to the precision epsilon and Optimal which respect to the dependence on the problem parameters. Okay on the dimension Right, and so that that's important Okay, so very briefly in the few minutes. I have left I have a few slides about another scenario where we are lucky and where the Description of the problem complexity matches exactly the quantity we use to describe its statistical performance and even though There's no immediate connection with the topic. I just discussed in the end The objective is exactly the same describe the complexity of an optimization problem that is in a way that is data-driven and Allows you to compare the complexity of optimization problems in a meaningful way Okay, so here I have to specialize things a little bit again and focus on the Conic linear systems that follow the primal dual pair of conic linear systems One way you try to find a vector inside a cone and the intersection of the cone and the subspace and Another where you try to find a vector y such that minus a transpose y belongs to the dual cone Okay, and it turns out we can say Very fine things We have a very refined way of characterizing the complexity of solving such an alternative conical in our system Okay, and this this complexity description is written in terms of the distance to infeasibility Okay So what is the distance to infeasibility? It's essentially the minimum perturbation you need to introduce to the matrix a To make your problem infeasible. Okay, or Symmetrically the minimum perturbation you need to make to your problem matrix a to make the problem feasible, right? and Forget about the details here what distance to feasibility or infeasibility means is that Problems that are clearly feasible Our problems that are clearly infeasible will be much easier to solve That problems that are near the region where feasibility starts to break down or infeasibility starts to break down Yes, so just a quick question with to know I'm there is a spectral long or the you can choose but clearly here It's the spectrum now. Yeah But yeah, you may want to pick another one. It's it's it that that is flexible okay, so the What what run agar's condition number does is is measure this distance to feasibility or infeasibility in a scale invariant way Okay, so it takes this distance to infeasibility and Normalize it by the spectral norm right, so when row the distance to infeasibility is very small then Relatively speaking this condition number is going to be really high okay, and The idea of this condition number is to do for optimization problems are at least in this case alternative Conic linear systems what the traditional condition number does for linear systems, okay? So you compute the condition number of a linear system. It's going to tell you at the same time How hard the system is going to be? It's going to be to solve this problem numerically and how stable the solution will be The condition number defined by run agar does exactly the same thing for optimization problems it tells you first how hard it's going to be to solve this optimization problem and second How stable the solution will be? Okay, so this is a very very good if you can compute it, of course It's a very good description of the data-driven Description of the complexity of an optimization problem Okay, and so there's a lot of work on on bounding the complexity of classical optimization techniques using this condition number and More refined analysis of bio methods shows for example that this condition number shows up in the Bound on the number of iterations of interpoint methods and that the complexity of solving the KKT system of equations Actually varies with this condition number as well So this turns out to be a good way of measuring the complexity of an optimization problem Using its precise structure and the input data But when you focus on compressed sensing problems and sparse recovery problems where this should be the L1 norm here I forgot a pretty important index You can define in very precise terms the performance of this sparse recovery problem using Conically restricted minimum singular values of the matrix A Okay, and these are just Look very much like the quotient defining the classical singular values except you restrict the vector z to belong to a certain cone Which is the descent cone of the L1 norm and This these generalized eigenvalues allow you to bound the performance of the sparse recovery algorithm. Okay, and What's pretty striking here is that we're lucky enough that these two quantities match Meaning that Ronegaard's condition number is actually equal to The corner restricted eigenvalue controlling statistical recovery performance in compressing problems So the quantity that controls numerical performance on the optimization side In this case is also the quantity that controls sparse recovery Performance on the statistical side. Okay, and we can generalize this to much a broader class of recovery problems and More details are in the paper Okay, and numerically this works. So this is an example where we can go as far as possible in Reconciling complexity descriptions or complexity bounds with statistical performance metrics Okay, so what you see here is the classical phase transition is sparse recovery. So this is the probability of recovery using various algorithms, okay, and This is the phase transition in the condition number Okay, it's a non-symmetric It should be symmetric if we had symmetrized our definition of the condition number with what happens on this side of the phase Transition is not that interesting So the really interesting part is the fact that this condition number decreases as you move away from the phase transition threshold here Okay, and that these two phase transitions are pretty well synchronized Okay, and indeed what you observe is that the complexity of the corresponding numerical methods Has the same phase transition when you move away from around sorry the sparse recovery threshold Meaning that it's much harder numerically to solve sparse recovery problems when you're near The transition threshold and when you're much further away from it And this is something that was observed empirically starting with the papers by Dono in 2008 for example, but here it's made completely explicit by this concordance between the conditioning of The optimization problem on one side and the corner factor dagger values on the other I guess this is my cue to conclude before I get electrocuted and so So just to sum up somewhat unrelated Results, but the idea here is to produce complexity estimates that are near enough to the data to say something about the complexity of one optimization problem versus another And so in particular we we got a formulation of the optimal algorithm by Nesterov Whose complexity bounds are affine in our invariant which respect to an affine change of coordinates and turned out to be optimal at least on LQ balls and Completely data-driven complexity measure for sparse recovery problems, which matches exactly the Quantity is used to measure the statistical performance of these procedures Okay, so there's a long list of open problems if you're interested one is to show optimality of the product LQ DQ in the general case and as you mentioned Produce Q norm when Q is not centrally symmetric and then not obviously a norm We don't know how to do that. There's also a key question, which is CF is a lower bound on LQ DQ Is that lower bound tight? We'd like to know because it would solve a lot of our problems we don't know and Nesterov's bet is that it's not but you can bet against him And so the again the best known choice in non-symmetric sets Q have mentioned that already and Also more important in some sense a systematic tractable procedure for smoothing Q Okay, and that would preserve affine invariance and so that would allow us to Produce pox functions in cases where Q is not an LP ball We don't know how to do that yet and that makes the algorithm Unimplementable in the general case and so that's still open at this point. Thank you Yes, because then for the spectral ball, you know what DQ is and Is it tight or not? I don't remember probably Hard to say We could be optimistic and say yes up to a polylog And I guess that's the answer, but I'm not on potential but so basically the important part is that the The reason why we're able to check optimality here is that we know exactly what DQ is for LP balls Q balls sorry and the same thing is true for the spectral norm and all the matrix norm So we know exactly what the DQ constant is for these not matrix norms. So at least we have an explicit bound Yes You also have the same Analysis for the linear convergence rate. I guess so No, not yet You could Well, I guess you you get the strong convexity constant But for the right you have to check I don't know yet because Yeah, I guess it would work because then Q and no, I guess it holds to Yeah, but I haven't checked the lower bounds that's the point. So I don't know exactly if this is optimal or not But it should be but you You can just run the algorithm and it's in our convergence. Yeah. Yeah, everything depends on the ratio sigma over L Sigma is the affine invariant. L is a fine invariant the ratio should be a fine invariant But I haven't checked is the lower bounds and if we have the same same match. That's it. Yes so I don't understand what your motivation for obtaining these affine invariant for the personal method Somehow you're assuming that You're not considering the cost of computing the The proximal operator No, that's true So on the motivation side at least I think one way to answer the problem is to flip the question So what you're really interested in is the optimal bound period So you not only the optimal numerator and the optimal dependence on epsilon You also want the optimal dependence on the dimension and okay You can only do that get that if you pick the best choice of norm and the best choice of pox Okay, if you get the choice of norm and the choice of pox wrong Then you you lose you can lose a pretty significant factor in your bound Square root of n in the example. I just mentioned But aren't you explicitly assuming that all the pox are equally cost costly? I mean that essentially it's you're not taking into account. I'm not using the That's true But in this case they turn out to be in for help people's I agree with you that you don't have a generic statement On on the pox but in a sense when you compute a bound for Nesterov method method you start from the assumption that your Projection steps will be solvable whatever for for your given set Q So you need to you make that assumption in Nesterov whatever happens. So here. This is the same thing Sure, but if you want to do something to have a measure, which is really dependent on the data Is that secret perspective somehow problems which have a high amount of correlation should be harder and problems which you have no correlation And this might be reflected in so I agree with you that there's a gap in the analysis But at least the problem we decided to focus on right now is to Have a bound that is optimal in both and an epsilon Okay, right now we do not have that by far So in many cases so in some cases we know exactly how to pick the pox and it's a routine If you see L1 you pick the entropy and everything's fine But for a generic problem you're faced with a feasible set No one tells you which proxies you should pick and you could really lose a factor square root of N For example by not picking the right pox So what we wanted originally was a principal procedure to choose the norm and choose the pox And it turns out that when you do that by nature what you get is affine in biote otherwise you can simply So you should get something that looks like the infimum over all affine changes of coordinates of Ld over sigma Okay, otherwise you can always take an affine scaling of your norm. You see what I mean So whatever you do if you want to be optimal you need to be affine in violent And it's easier to look first for affine in violence That's that's the reason why I talk so much about affine in violence in the end the real goal is to find of course the optimal bound For Nestero's method in a principled way So in a way that tells you exactly how to select the norm and how to select the pox I agree with you that this doesn't solve the problem of computing the corresponding pox and solving the projection steps, but I Guess that will have to wait until we solve this problem first. I guess no so so I agree with you but but I My message is that getting the optimal bound in itself is not a resolved issue at this point yet