 So, thanks to the organizers for the invitation, I'm impressed by Spencer's energy. You came from California, I'm hoping I don't fall asleep, and I match your energy also. But I'll be talking about some convergence guarantees for iterative algorithms in an empirical risk minimization setting, so everything I talk about today is joint work with my collaborators, Ashwin, Christos and Monkchi, and most of this talk will be based on this first paper here which was joint with Ashwin and Christos, although I'll draw some examples from the second paper. So just to jump right in, throughout the talk I'm going to assume that there's an underlying statistical model whereby you observe n pairs of data and responses, where the responses yi are formed by applying some potentially nonlinear function psi, which you may or may not know, to suitably to find inner product between your data xi and a ground truth model theta star, which you do not know. And these observations may be corrupted by some additive noise epsilon i. Okay, so your goal in this is to estimate this ground truth model theta star. The way in which you do this is by employing empirical risk minimization, so in particular you pick an estimate data hat, which minimizes the empirical risk given here. And of course on the statistical side of things, the natural question to ask is how well this estimated model theta hat actually approximates the true model that you don't know, theta star. Now associated to this, there's a natural set of computational questions, so in order to actually produce this empirical risk minimizer, one typically implies some sort of iterative algorithm to actually optimize the empirical risk. This can come in several flavors, for instance a first order method, like a gradient or a sub gradient method, or a higher order method, for instance a Newton method, an alternating projection type method, and so on. There are many others in both of these categories. And in this, on this computational side of things, the natural question to ask is given this kind of zoo of algorithms that are available to you, which algorithm should you actually pick to minimize the empirical risk that you're interested in. Now one guiding philosophy that might prove useful, although certainly not the only one, is one of iteration complexity. So you might ask how many iterations your method takes until it converges to a solution which achieves error epsilon. Now worst case efficiency estimates for these iteration complexities have been known for a very long time, but in the setting that we're interested in, when xi, our data, is random, it's also known that these worst case efficiency estimates tend to be overly conservative. So I've highlighted this paper here, I'm going to return to it in just a second. Before I do that, I also want to mention that in the setting when your empirical risk may be non-convex, it becomes particularly important to also understand whether your method converges globally or not. So here by globally, I mean if I just initialize my method at random, will it converge to some statistically useful solution or just some nonsense that's unhelpful? Okay, so returning to this paper that I mentioned just a minute ago, we've kind of situated ourselves between statistics and computation here, and a very natural first question to ask is one of optimality theory. So in particular, if I restrict my set of algorithms to some perhaps large class, and I give myself t iterations of a certain method, the first question you might ask is after these t iterations, what is statistically the best error that I can achieve? Do I have a non-trivial lower bound on that error? And in that class of algorithms is there a matching upper bound for this? Now somewhat surprisingly, the answer to these questions is yes, and that matching algorithm is Bayes AMP. And so I would refer to you both of these papers, they're very nice works, which give a proof of this fact, but today my interest is a bit more exploratory in nature. So I'm not going to be talking about optimality theory, rather what I'm interested in is I have a particular empirical risk that I'm trying to minimize, and I've a set of candidate algorithms that I might use to minimize this risk, and now I just want to meaningfully compare and contrast between these algorithms. The way in which I propose to do so is generating what I call some sharp trajectory analyses. So this cartoon picture here, I have three different algorithms. The brown algorithm at the bottom, this is converging the fastest in terms of iteration complexity, so it takes the fewest iterations to converge to the final statistical solution. This green algorithm in the middle converges a bit slower, and this purple algorithm converges, or we can't even tell what it converges, but if it does, much slower. But now the point is that you can couple a picture like this or a trajectory analysis like this with a tight understanding of the per iteration complexity of your method to make a more informed choice of which algorithm you might actually use. So for instance, this green algorithm here, the middle algorithm, this might have a much lower per iteration complexity than the brown algorithm, and you might say that, okay, five more iterations, or yeah, five more iterations, I can tolerate that. But even the purple algorithm that has a lower per iteration complexity, its total iteration complexity might be prohibitively large, so your algorithm of choice is the green method. So this is kind of where we'll be going today. In order to just make things a little bit more concrete, I'll take a running example, which is real phase retrieval. So this is the simplest setting in which I can state all of my results. So here, the statistical model sets the non-linearity psi as just the absolute value. And now throughout, I'll make a Gaussian covariate assumption. So my data points xi are just standard Gaussian. All of them are iid. My noise epsilon i is also Gaussian with variance sigma squared. And just for regularity purposes, my ground truth model theta star, I'll assume to lie on this unit sphere in dimension d. Now associated to this, I'll consider the empirical risk, which is just this non-negatively squares loss. This is just the negative log likelihood corresponding to this statistical model, at least up to a constant. And the only thing that's important to note about this is that this loss is both non-convex as well as non-smooth. So from a worst case perspective, this is quite a difficult loss function to optimize, and we don't even know if globally we should be able to optimize it. Of course, this problem is well studied, so we do know that when the data xi is random, and in particular when the data is Gaussian, we do know of several methods which indeed converge quite quickly to the ground truth. And two of these methods will form my running examples for the algorithms today. So my prototypical first order method will be the sub gradient method for this problem. Its iterates are given explicitly here. And my prototypical higher order method is alternating minimization, whose iterates are given explicitly here. Okay, so now at this point, I can state my final assumption, again, which I'll make throughout the talk, which is perhaps my most stringent assumption, which is that at each iteration of my method, I'm going to completely resample all of the data. So I'm going to take n fresh new samples at every iteration of the method. So it's worth opening a small parenthesis and just situating this in the context of the other assumptions you might make. So on one end of the spectrum, you might make an online single sample type assumption. So in this setting, each iteration of your algorithm, you draw one fresh new sample or a constant number of fresh new samples, you use this fresh sample independent of everything else to take a step of your method. And what you might benefit from analytically, there is independence across your iterations. Okay? On the other kind of extreme end of the setting, you might make a full batch type assumption where at each iteration of your algorithm, you use all of the samples available to you. In this case, you clearly lose independence across iterations because the data will be correlated in some non-trivial way with the iterates. But what you might gain is because you're using all of the samples available to you, you might benefit from concentration of measure. Now, situated in between these is the strongest assumption you can make and the easiest to deal with analytically, which is the resampled assumption that I'm making. So this benefits from both worlds. You will benefit from concentration of measure and you'll get independent to cross iterates. So to be very clear, this is a strong assumption that I'm making here. And now it's also just, so we're on the same page in order to understand what the sample complexities, overall sample complexities of these methods look like, clearly for the full batch method. If I take n samples upfront, my sample complexity is n, it's just n. I use these at each iteration. For the single sample method, if I take capital T iterations of this method, my overall sample complexity is just however many iterations of the method I take. And in my resampled analysis, if I take capital T iterations of the method, my overall sample complexity will be n, little n, times capital T, number of iterations I take. Just so we have a concrete number in mind, all of the examples that I'm going to give throughout the talk today, capital T will scale at most logarithmically in the dimension. So you can keep in your head maybe that the sample, overall sample complexity of all of the methods I'm talking about is on the order n log b, at most. Although this could be much larger for examples I don't cover. So now I can kind of unveil the actual algorithms that I plotted in that toy picture from a few slides ago. So this brown curve corresponded to this alternating minimization method, the green curve in the middle with subgradient descent with step size one half, and the purple curve at the top with subgradient descent with a slightly larger step size of 0.95. So let me emphasize that these are completely deterministic predictions. They were made offline. I never saw any data to make these. And now what I'll do is I'll actually run these methods with the setting above the plot. So when I run these methods, empirical iterates are marked in black. And you can barely see it, but the lines that are left, these are just the empirical trajectories of those empirical iterates. So what you can take away from this simulation is that these deterministic predictions that we plotted completely offline, they capture the trajectory of the algorithm quite well. It's it's hard to tell the difference between these two curves. And what we'll use is we'll use these deterministic trajectories to get some fine grained convergence results about the actual algorithms, the actual empirical iterations. So for instance, in the course of this talk, we'll show that this alternating projections method converges super linearly with exponent 3 halves. And that these radiant methods with different step sizes enjoy linear convergence with different contraction factors or different rates of convergence. And just just to make clear, basically the rest of my talk, I'm going to be explaining this picture and essentially prove stating theorems, which and not proving those theorems, but explaining this. Okay. So I just want to give one slightly more funky example. So all of the examples that I showed in the previous slide, all three methods were converging and converging quite quickly. This is the same problem. So all of the parameters are set to be the same thing. But I just take a slightly more aggressive step size. And in this case, the empirical iterates oscillate around some some solution, which is not statistically useful. So if you remember on the previous side, the statistical error that these methods converge to was on the order 10 to the negative six. In this case, the the statistical error that this method is oscillated around is much higher. It's on the order 10 to the negative one. But nonetheless, these deterministic predictions capture this oscillatory behavior and capture the point around which it's oscillating. Okay. So these predictions aren't just restricted to the actual convergence of the algorithms. Okay, so underlying these deterministic predictions is a notion of state evolution. So this terminology is familiar from approximate message passing and indeed is borrowed from approximate message passing, although we are not going to be talking about approximate message passing today. And it's stated in terms of two scalar components. So the first is a signal component alpha, which is just the inner product of your current iterate theta with the ground truth theta star and a perpendicular component beta, which is just the norm of the projection of your current iterate onto the subspace orthogonal to the ground truth. Now, in terms of these two scalar quantities, I can state for the alternating minimization update in close form what these updates actually look like. So they're both stated in terms of the angle fee between your current iterate and the ground truth. The parallel component signal component alpha takes the following form and the perpendicular component takes the bottom line. And just to get a little bit of an idea of what these look like, we can plot the gradient field corresponding to these updates. Now, in this gradient field, what you're really interested in converging to is when you take theta equal to theta star, you want your method to converge to the actual ground truth. So at theta star, the signal component is one, the perpendicular component is zero. So that corresponds to the bottom right portion of the gradient field. Now, at random initialization, so if you want to understand global convergence, your signal component is very close to one, sorry, very close to zero, and your perpendicular component is very close to one. So that corresponds to starting your method at the top left corner of this gradient field. So indeed, if you do this and you just plot what happens when you run this algorithm through the gradient field, it looks as follows. So it still converges quite quickly to the point you desire. And now I can state these state evolution updates in a slightly more general setting. So this is the most general setting for which I can state a unified set of theorems, but certainly not the most general setting that you can imagine. So this is in terms of a statistical model, which is exactly what I said in the first slide. And now I have to tell you what the algorithms I'll consider are. So there are two sets of algorithms I'll consider. The first is a first order method, where the only specificity of the method is in this non-linearity omega. So omega is the only part of this method, which specifies what algorithm you're running. And to be very clear, omega need not have any connection to the non-linearity psi. So you can inherently mis-specify the model by setting omega to be whatever you want it to be. Now the only class of higher order methods I'll consider are written in the following way here. So each iterate is formed by solving a least squares problem. And again the only specificity of the method is through this non-linear function omega, which is the target of the least squares problem at each step. Okay, so let's just see a few examples of what these look like. So turning back to our running example of phase retrieval, as we mentioned, the non-linearity psi is just the absolute value. And we can also look at subgradient descent and alternating minimization, our two running example algorithms, and their non-linearities omega are highlighted in the corresponding colors here. Okay, just to take one more example, we can also take psi to be a random function. So in this case, for mixtures of regressions, for each sample you have two linear models. One is theta star, the coefficients, one is negative theta star, and you randomly generate a response from one of these two linear models, but you don't know from which linear model it actually came from. So that's this statistical model. And I can consider at least three algorithms that fall under the purview of this framework. So an alternating minimization method similar to what was on the previous slide. A subgradient version, so a first order method corresponding to that. And in this case, expectation maximization is also a higher order method. Okay, so now as long as these two or these two assumptions are satisfied on the statistical side and the algorithmic side, I can state explicitly the state evolution update. So they're given in this table here, where first order methods are this first column, higher order methods are at second column, and the first row is my parallel component, second row is my perpendicular component. Now importantly, these are completely specified. The only thing that you need to compute when you have your algorithm or your model of interest is these expectations. So expectation of Z two times some random variable omega, expectation of Z one times from random variable omega, where this random variable omega captures all of the problem parameters. So it's in terms of this nonlinear function omega, which specifies your algorithm, as well as this nonlinear function psi, which specifies your statistical model. And just perhaps to unpack these updates a little bit more, if we take lambda, which is the ratio of my per-iterate sample n to the dimension d, if I take lambda to infinity, this recovers the infinite sample limits of the population update. And indeed, that means that my parallel component, my signal component, remains unchanged regardless of if I take the infinite sample limit or not. But in my perpendicular component, the last term in the square root disappears for both of these updates. And you can already see distinctions and convergence behavior just by doing this. So for instance, for this alternating projections method, this first term scales quartically in the previous component beta. So, and the second term scales cubically in the previous component beta. So if I take lambda to infinity, that means that my next perpendicular component will scale on the order, will decay quadratically in my previous perpendicular component. So this would imply quadratic convergence rate, whereas for any finite lambda, at least locally, the dominating term is that cubic term. And that's precisely where this super linear convergence rate of three halves comes from in the alternating method. Okay, so to actually get these state evolution updates, it relies on two fairly simple ideas. The first is that even when the loss you're trying to optimize, maybe globally non-convex, oftentimes the algorithm which you employ to actually minimize that loss proceeds by making a convex surrogate, minimizing that convex surrogate, and then doing the same. So just to take a cartoon here, you're interested in minimizing this clearly non-convex plus in black, your current iterative theta t. You form some convex local surrogate of theta t, minimize that convex local surrogate that gives you your next iterate, and that's how you proceed in this for this method. And indeed, we can just quickly sanity check ourselves, this alternating projections method which I've been coming back to, proceeds in exactly the same way. Each iterate is just solving a convex least squares problem. Okay, so the second observation is that in high dimensions convex problems with Gaussian data admit exact analyses. So I've borrowed some some plots from some of the early work on this, at least some of the early rigorous work. So there's this paper of Mohsen Bayati and Andrea Montenari, which studies the lasso, they brought their deterministic predictions and the empirical quantities of the error around that. Likewise, Chris Ostrampoulidis and Samet Oymak and Babak Hasibi again studied the lasso using the CGMT. Their deterministic predictions from this method clearly adhere quite well to the empirical error. And finally, Nurideen Elkaroui and his co-authors used the leave one out method to study robust regression. And you can't even see more than one plot here. But there are indeed two curves there. Okay, so if we take these two observations, we can essentially bootstrap them into our first theorem, which is just a one step update. So this theorem states that take your current iterate theta, this is just some arbitrary vector in RD, and take one step of your method. So theta plus is one step of your method. Now as long as this quantity lambda, which is the ratio of your per-iterate sample N to D, and the number of samples you take at every iteration are lower bounded by, or they're large enough, then the empirical state evolution quantities, beta of theta plus and alpha of theta plus, concentrate around these deterministic quantities that I stated on the previous two slides, alpha gore and beta gore, and the fluctuations of these are on the order one over root N up to log factors for the parallel component, and one over into the one fourth up to log factors for the perpendicular component. So just to maybe make this clear with the cartoon picture, you take an initial empirical point, so an empirical state, alpha of theta, beta of theta, you compute a deterministic function of this, this gives you the blue filled in dot here, this is a deterministic function of alpha of theta and beta of theta, and now you draw a small L infinity ball in R2 around this point, and your next empirical iterate is guaranteed to line that L infinity ball. Okay, so two comments are worth making about this theorem, the first is that this holds for both the first-order methods as well as the higher-order methods that I outlined a few slides ago, for first-order methods to be clear, this is very straightforward to establish, so it's almost an exercise and this is essentially because of the resampling assumption. For higher-order methods it's a bit more difficult and we need to resort to some machinery like the convex Gaussian min-max theorem, and that's indeed where this exponent one over N to the one fourth comes from. In later work we improved this to one over N to the one half, so this should match the right statistical rate, but I prefer to keep it as the one over N to the one fourth, because this will hold in slightly more generality, and also this is how it's stated in the paper. Now, this is nice, so one step update is good, but I promise you on in the beginning a trajectory analysis, and this is explicitly not a trajectory, because what happens is I need an empirical quantity, I compute a deterministic function of this, but now I'm only telling you there's a small ball in R2 around this, I'm not telling you how to compute the deterministic component in that small ball. Do you take the worst case point, do you compute it from that filled in blue spot, so this one step update is not a trajectory prediction yet. In order to make it a trajectory prediction we need what's what we call an envelope result. So here this is stated in terms of a two-dimensional state space operator, which we call S-gore, so it takes a current state alpha beta, it spits out the next deterministic state alpha-gore beta-gore, and now you can state a deterministic trajectory using this map just by taking your first component alpha-beta and iterating this map t times, that gives you t deterministic states all computed from this first point, and what we'd like to establish is that if we start at this empirical state, this triangular mark, we can compute in the filled in blue states, these are all deterministic predictions, and hopefully our empirical iterations will lie in a small window around those. And indeed this is what our envelope theorem says, so as long as our initial iterate theta is in some locally converging region g, so to be clear here the locally converging region g states that the signal component is large enough, so larger than some constant, and the ratio of the perpendicular component to the signal component is small enough, so smaller than some constant, then with high probability, which depends on the maximum number of iterations capital T that your method is going to take, uniformly over the number of iterations that you take, so uniformly over one to capital T, the error of your empirical, so your empirical error of your method is very close to the actual deterministic prediction, and this is upper bounded by the quantity one over n to the one fourth, so the important part here is that even as you take steps of this method, the fluctuation of your deterministic prediction, which you may have made 15 steps ago, this doesn't increase with the number of steps that you take, so it's held fixed over time. Yeah, no there's no T dependence, the only T dependence is here, there is a universal constant that I'm hiding there, but no no T, that's that's precisely the point. Yeah, I don't want this to increase. Yeah, why you can say this? Yeah, so the two ingredients behind this, I already told you the first part, which is this one step update, the second part is that these these maps have some some properties, so if you can establish that this map is essentially a contraction in the right norm, then because it's, so it comes from the one step update plus a contraction. No, you have to, this has to be established on it, so the contraction is the key point that you have to, it is, it boils down to two-dimensional calculus, but you have to do it on a case-by-case basis. So for all of the methods I mentioned it does hold, but I have, I can't tell you generically that there might be some reason, our proof doesn't leverage that, but there there might, I can't say no, contraction is only at the state evolution. Yeah, it's completely deterministic. Yeah, cool, so once you establish this kind of envelope result this allows you to safely say okay my empirical iterates are tracked by these deterministic iterates, so now I need to establish some sort of convergence guarantee for this deterministic two-dimensional evolution. So we have this notion that we call sharp convergence, so the first is we say that the iterates enjoy little c capital C t not epsilon linear convergence, if in the matrix, in the metric D, if at time t plus 1 the distance is both upper and lower bounded by the distance at time t, on the lower side up to the constant little c, on the upper side up to the constant big c, and this is up to an additive error of epsilon, epsilon over 2 on the lower bound and epsilon on the upper bound. There's a similar notion of two-sided convergence for super linear methods, it's exactly the same except the distance at time t is raised at the power lambda, which you should think of as greater than 1, and the only other difference is that this constant capital C, this can take the value of 1 in the super linear convergence, but it has to be strictly bounded away from 1, bounded above by 1 for linear convergence, and let me mention that all of these constants that I've written here, little c capital C epsilon t not, these can all be problem dependent, so they can depend on n, they can depend on d, the noise variance sigma squared, the non-linearity psi omega, so all of the problem dependence is baked in there, and we'll go through some examples in a minute, but we can now revisit our two earlier questions, the first which was statistical, so we wondered how well the minimizers of the empirical risk ln actually estimate the ground truth theta star, and now once you've established this kind of guarantee, you get this for free, because this is in terms of parameter convergence, so your statistical error is just epsilon, and on the other side of the questions we asked, which are computational in nature and we wondered if we could compare between algorithms, you also get this because of the two sided nature of the convergence guarantee, so your method converges both at best linearly or super linearly, as well as at worst linearly and super linearly, okay? So just to give two examples to illustrate this, the first is alternating minimization for rank one matrix sensing, in this case the iterates enjoy linear convergence, where the convergence rate depends on the noise standard deviation sigma, as well as the sample size n and the dimension d, okay, so you can see it, this is perhaps not the best illustration because this method converges very quickly, and at most three iterations, but you can still see a distinction in the convergence rate as I vary this parameter lambda, which I remind you is my ratio of the number of samples n at each iteration to the dimension d, and I should also note that this shouldn't be thought of as something novel, so I can think of at least two works for different algorithms which get similar types of convergence rate guarantees, one of which is Mahdi's work, who's here, which will get a dimension dependence, or actually a much more fine-grained dependence, so you shouldn't think of this fine-grained dependence as something very novel. The second example is alternating minimization for mixtures of linear regressions, so in this case when there's no noise in the statistical model, this method enjoys super linear convergence with exponent three halves, but as soon as you have any amount of non-zero noise, locally these iterates enjoy linear convergence, so there's an exponential separation in the convergence behavior of the algorithm as soon as you change the noise level. So the final convergence ingredient that I need, I told you that I'm interested in how these algorithms behave globally, so I need to tell you what happens if I randomly initialize the method, so that's what this theorem does, if I randomly initialize my method on the unit sphere and dimension d, then as long as this ratio n over d, so lambda is n over d, is at least log log in the dimension d, then after logarithmically in d number of iterations, the empirical iterate lies in this locally converging region g that we know from the envelope theorem. So we can see what happens here in the gradient field, the initial signal component alpha of theta zero at random initialization is on the order 1 over root d, so this is in the top left corner, and now you can see in the actual algorithm there is some transient phase, this is the log d number of iterations before you start converging. So this is my conclusion slide which is just wrapping up all of these three parts into a full theorem for which we state for each of the methods that I told you about. So the first part is this deterministic state evolution convergence, so there's nothing random in this statement, so this two dimensional state evolution map s score converges super linearly with exponent lambda equals three halves to the level epsilon equals sigma root d over n. This level is the parametric rate, so this is the statistical rate that you would expect. The second part says now that we've established this deterministic guarantee on the state evolution convergence, we can transfer this into an actual convergence guarantee for the empirical iterates via this envelope guarantee which is that uniformly over the number of iterations one through capital T these two errors are quite close up to a fluctuation of one over n to the one fourth and the final part is exactly what I stated on the previous slide which is that if I randomly initialize my method I end up in this locally converging region. Okay, so that's all I wanted to say, happy to take any questions. Great question. Okay, so this, so might be asked if empirically you observe any difference between resampling and not resampling or if this is just, yeah, this is just, it's technical, yeah. So there's two, I plot here one case for first order methods there is a difference. So for first order methods there is actually a fairly extreme difference which is that the convergence rate can be quite different between the two. They'll both converge linearly but the resampled method will all, its iteration complexity will always be faster and in a non-negotiable way. It's slightly different for higher order methods which is kind of interesting. So in this simulation here, or maybe, so you can say I'll go on this one. I plot three different things. So this blue curve, these blue triangular marks, these are resampling where I take the ratio at each iteration to be 20. So I think in this I fix D to be 600 and then at each iteration I took 20 times 600 number of samples. That's these blue triangular marks. These orange triangular marks, I fix the total sample complexity to be 20 times D and I reused all of those samples. So these orange triangular marks, they seem to adhere, they're not exactly the same as the prediction but they end at the same level. So the error achieved is the same. And at least the trajectory seemed to be quite similar but not exactly the same. And just for completeness, this bottom pink curve is if you match the sample complexity of the resampled method, so you take the sample complexity to be 160 times the dimension, then you get a much lower statistical error and it doesn't actually match. So my understanding is that, okay, this is something we can talk about also offline because this is ongoing work too. So I have a little bit more to say about this. But my understanding is at least that for higher order methods, this resampling assumption isn't horrible. It does actually reflect what. So you have to, it depends on the state evolution map. So we would expect, yeah. So the way that you prove at least here, something like this, you need the state evolution very close to zero. It doesn't matter if it's at zero or not, but very close to zero, you need to get geometric increase in the signal component of the state evolution quantity. So all of them, yeah, you need something non-asymptotic, you need both. So the state evolution component is completely deterministic. So you need that two dimensional map. That doesn't depend on it. No, the non-asymptotic piece there is that you need your fluctuation to be on the order one over root d. So that you're only going to use the, so what you need to establish is that your empirical literates, indeed the signal component of your empirical literate, increases geometrically. In order to do that, if you can show that your deterministic state evolution update increases geometrically and that the fluctuation of your method is on the order one over root d, so that the dominating piece is coming from this deterministic, that essentially drives it. So nothing fancy. Yeah, exactly. Yeah, yeah, yeah. Yeah, it does, it does. This, they still get to the same point, so that the eventual error is the same. But if I take lambda close to, let's say I take lambda to be 2.2. Two is that exactly the information theoretic threshold. So things will become a little bit crazy there, but let's say 2.2. This, this gap does widen, but it doesn't widen too much and they end up at the same point. So for first order methods, yes. So for first order methods, yeah, it can and in fact, the, that first paper, so there is a paper by Michael Salantano, Yuchen Buen, Andrea Montanari, which shows that for the, a large class of first order methods, you can reduce any member of that class to AMP, and then you can analyze the state evolution from there. So for first order methods completely, that, that captures it. For higher order methods, it's a little bit less clear. There is a way to write down a state evolution without resampling, but it doesn't go, you might be able to do it via AMP, but I don't, because of the nonlinear nature of the updates, nonlinear nature in the random matrix X, it's not entirely clear how to