 Thanks. I actually also added the word optimization because talk ended up being a lot focused on a section of learning and optimization. So anyway, thanks for the introduction and thanks for inviting me. So this talk is actually based on several works, some of them together with my student Yossi Algevani as well as works with Nati Srebo and Tong Zhang. So when we talk about distributed learning and optimization, we're talking about a situation where we want to do some standard learning or optimization tasks, but where the data is partitioned across several machines. And this can come in various ways. So it can be starting from having several cores on the same CPU through several nodes in some cluster and up to some huge computing grid with different machines, maybe at different parts of the world. So distributed learning and optimization is something that has become very popular in recent years and there are two main reasons why we want to consider this setting. So one of them is when we have lots of data. So as we all know, we live in the era of big data and sometimes the data is so large that we cannot fit it in a single machine. So we need to distribute the data across several machines and do the learning based on that. So this is sort of viewing the distributed setting as a constraint. But another situation where it's actually distributed as an opportunity is when we don't have one machine, we have K machines. So hopefully we'll be able to solve the problem faster. And there are several challenges when we want to do learning and optimization in distributed setting. The first one is the issue of communication. So no matter which of the scenarios we're talking about, whether it's cores on the same CPU or a geographically distributed computing grid, communication is always something which is much, much slower than local processing. It takes much more time to send data between machines compared to each machine doing something locally on its own data. Usually this is in orders of magnitude difference. So generally we want algorithms which are distributed, which communicate, but actually communicate as little as possible because it's expensive. The second challenge is how do we paralyze the computation. And the challenge here is that many standard learning and optimization algorithms are very inherently sequentially in nature. So if we look, for instance, on stochastic gradient descent, this is based on very sequentially taking an example, doing some update, then another example, then another update. And a priori it's not clear how we take such kind of an algorithm and make it work in parallel. The third challenge is that we want the result to be accurate. So obviously we don't want to suffer because we distributed the computation. So the output quality should resemble what we could get with a non-distributed algorithm. So there are many ways to formalize this setting. The setting that we're going to focus on is when we want to do something akin to empirical risk minimization, and when the function that we want to optimize is convex. So we basically want to minimize some function f, which we can write as an average of f i's. Each of these f i's represent the average loss on one of the machines, and we want to minimize the average. And each machine in turn has n data points. So for simplicity we're assuming each machine has the same number of examples, but everything I say can be easily generalized. And the functions are generally convex. We can talk about situations similar to non-distributed optimization. We can divide this to scenarios where the functions are maybe strongly convex or smooth or both. And we'll discuss each of these scenarios later on. And in terms of the communication, we generally assume that communication takes place in communication rounds where machines can broadcast to each other. And the amount of communication corresponds to sending order of d bits per machine per round, where d is the dimension. So the idea is that the machines can send to each other maybe vectors or gradients, but not things like Hessians, for instance, d by d matrices. And this corresponds to a big data high-dimensional learning scenario where d can be very large and sending, say, d by d matrices is not something which is feasible. And the main question in this setting is how can we optimally trade off between these three requirements? How do we get an accurate solution with as little communication as possible and with as little runtime as possible, ideally getting speedups by a parallelization? And notice that in this talk, when we're talking about accuracy, I'm going to focus on optimization error. So our goal is to minimize this empirical risk function. You can also talk about other goals. So for instance, you can assume that the data is sampled IID from some distribution and your goal is to minimize the expected loss or the risk. And that sort of puts things in a slightly different perspective. Again, there are many ways and many settings one can talk about here, but this is the setting I will focus on. Okay? Okay. So starting to make things a bit more concrete, so in order to discuss the different results, we need to make various assumptions on how the data was distributed to the machines in the first place. What can we say about the relation between them? So one scenario is when we don't assume anything. So the data was partitioned in some arbitrary way. Maybe one machine has the positive examples and another machine has the negative examples. This is one setting. At the other extreme, we may assume that the data was actually partitioned and random. So we had a bunch of data points that were just assigned uniformly at random to the machines. And then the situation from the algorithm designer perspective is potentially improved because now there are stronger relationships between the data across machines. So for instance, we have various concentration of measure effects, the values or the gradients of the local functions of each machine are related. And as we'll see later, this is something that we can utilize. Another setting that is interesting to talk about, which in some sense generalizes the previous two, is a delta-related setting where we assume that there are relationships between the values or the gradients of the local functions at any point. So again, if the data is partitioned at random, you really have this kind of situation where delta is pretty small. But you can also discuss here more general things. So maybe there are statistical similarities between the data points, but maybe it wasn't exactly partitioned at random. So in some sense, it lies between the arbitrary and random partition scenarios. So basically what we're going to do in this talk is to discuss each of these three scenarios and discuss both upper and lower bounds, mainly in terms of the amount of communication. I also discuss the runtime though. And regarding the random partition, so the results there are actually going to rely on some very new results, which might be of independent interest actually, having to do with out-replacement sampling in stochastic gradient methods. So I'll talk about that, but also point out how it gives something new for distributed learning with random partition. Okay, so let's start with the arbitrary partition scenario. So we don't assume anything about what's the relationship between the functions of the different machines. And maybe the simplest baseline here that one can think of is just to reduce it to standard first order non-distributed optimization. So we can sort of ignore the fact that we are in a distributed scenario where this F is an average of functions of different machines. And we can just have each machine compute gradients of this capital F. This requires one communication round. So each machine computes the gradient of the local function, then they do a communication round to average. And that gives us basically an oracle to compute gradients of big F. And now we can plug it into any kind of black box at first order algorithm, for instance, gradient descent. And then you get an algorithm where the number of communication rounds is the same as the number of iterations of this algorithm. So you can do gradient descent. You can also do all of the other things that people do in standard optimization. You can do accelerated gradient descent. You can do smoothing and so on. And that gives you upper bounds on the number of communication rounds you would need. So if the functions, local functions, are strongly convex and smooth, you get the number of communication rounds where scale like square root of the condition number one over lambda. If the functions are lambda, strongly convex and smooth. I can also talk about just a non-smooth but lambda, strongly convex, convex and so on. You just derive it from the standard upper bounds for these algorithms. And on one hand, this is a very nice and simple approach. It's also almost fully paralysable because most of the time each machine just computes the gradient of its own local function. But it does require a relatively large number of communication rounds. So when we do large-scale learning, so problems in high dimensions, the lambda usually comes from an explicit regularization that we add to the problem. And for statistical learning considerations, usually it's quite small. It actually decays with the amount of data that you have. So lambda is generally quite small. And then you may need to do many communication rounds, or maybe also depending on epsilon or epsilon's desired accuracy. Now, so of course, yes. Just to be clear, so everything is fully synchronized, like your communication rounds is one shot. Yes, it's a very simple, naive, simple baseline that you can always do. Of course, as you might suspect, there are probably more sophisticated things you can do. And there actually have been a lot of work in recent years on algorithms for such a situation, from ADMM, COCO, COCO Plus, many others. But at least in the worst case, do they actually improve on this simple baseline? And the answer maybe surprisingly is no. So again, at least in the worst case, over say all strongly convex and smooth functions, you can't get something better in terms of the number of communication rounds, at least for a very, very large family of algorithms. So basically, these are algorithms which fall into the following template. So each machine implicitly has some kind of set of vectors wj. And what the machine can do between communication rounds is sequentially compute vectors which are basically either linear combinations of the vectors it computed so far or gradients of the local function at that point, or even things like multiplying the points with Hessians. And actually, it doesn't even have to be that the point it computes is in the span of these things. It can be also a linear combination of the point and its gradient. So for instance, that allows us to also consider algorithms which solve some kind of local optimization problem at each iteration. And the machines can actually do this for as long as they want in terms of the lower bound that I'm going to show. And during communication round, they can basically share some of the vectors they have computed. Okay. So I don't know if it covers every possible imaginable algorithm for this problem, but it does cover the kind of reasonable approaches that at least I can think of. Yes. Why do you need gamma to be positive? Yeah, this is just for technical reasons. So the point is that if the gamma is negative, it means that you may be able to solve a local at every round local optimization problems which are non-convex. And this is actually something that would break the lower bound. But it's also, you know, if you limit yourselves to algorithms which are based on convex optimization, then you basically have these factors which are positive. So you need gamma in you to have the same sign, if that's it? Yes. Yeah. Okay. And I'll show you the proof idea. It's actually very simple. I'm going to focus on the case where we just have two machines. And the local function of this machine will be just a quadratic function. Only one of them will have a linear term where E1 is the first standard unit vector. And A1 and A2 are two matrices which have the following form. So it's sort of they are block diagonal where the blocks are overlapping. So what's the idea? So let's consider the first machine before any communication was done. So because of the bias term, it would be able to compute a vector with a non-zero in the first coordinate. But then before communicating, it won't be able to manufacture any vectors with non-zero values except in the first coordinate. Once the communication round happens, the machine will be able to make the second coordinate, actually even the third coordinate non-zero. But again, it would get stuck. Again, because of the structure of these matrices, so basically the number of communication rounds limits how many coordinates we can make non-zero in the vectors that the machines compute. But the optimum of this problem actually requires all coordinates to be non-zero. And if only the first few coordinates are non-zero, that gives us a lower bound on the error. So after t iterations, the error can be no less than exponential in minus t over square root of 1 over lambda. And that basically gives the lower bound for the strongly convex and smooth case. As some of you might recognize, if you look at how the global function looks like, the average of f1 and f2, this is essentially the kind of hard function that's been used to prove lower bounds for a first-order algorithms in a non-distributed optimization scenario. So construction is the same, but here we actually make different structural assumptions. So again, as I said in the previous slides, the machines can compute the local Hessians and multiply it with things. That's fine, and still the lower bound would hold. You can also do similar things to get results for say non-smooth functions. The basic idea is still the same, but the construction is different. So without smoothness, we again create two functions with this kind of interlocking structure, but now it's not a quadratic form, it's with absolute values. And we get a 1 over lambda t squared lower bound, which is matched if you do the simple baseline I talked about specifically with accelerated gradient descent and with proximal smoothing. Okay, any questions about this? Okay. So next, we'll discuss the delta-related setting, which as I said is a situation where we assume, again, I'm not going to assume that the data was necessarily partitioned at random, but still I will assume that the functions have similar values or gradient surhesions at any point in the domain. And you can actually give lower bounds very similar to the ones I showed earlier, but where you now have these delta factors which making the lower bounds weaker. And the question is, can we get upper bounds? Can we get algorithms which utilize a delta-related setting and require less communication in a way which depends on this delta? Yes? Is it surprising that the lower bound does not depend on m? Is it obvious or is it just like a fact of life? Well, I talked just about the m equal to a scenario. If you talk about more, then things become a bit different. May the matching upper and lower bounds still depend on m? Yes, that could be. These bounds do not depend on m, but yeah. In general, at least in this talk, I think of m, the number of machines as being generally a constant. But I agree that it's a good question to understand what happens when it's not. So again, so for the delta-related setting, can we get algorithms for this? Maybe a different way to think about this question. So when we have a lot of data that we distribute between the machines, and if we have concentration of measure effects, it means that the delta actually becomes smaller and smaller. So in some sense, this is a situation where maybe by having more data, we can reduce the amount of communication that our algorithm needs because delta would become a smaller. So I'm going to talk about one algorithm which does have this kind of nice dependence on delta. There have been follow-ups to it since that I'll briefly mention at the end. So the algorithm I'll talk about is called the Dain short for distributed approximate Newton type method. For those of you who know the ADMM algorithm, the structure is very similar. So it's an iterative algorithm where each machine solves some local optimization problem of the following form. And then the machines communicate to each other average gradients and average solutions. So after solving the local optimization problem, they compute the average. So this is the entire algorithm. What is the intuition here? So the crucial property is that this algorithm is essentially equivalent to doing an approximate Newton step. So what is a Newton step? So if the problem is sufficiently smooth and we have Hessians, then one of the classical ways to do an optimization is to iteratively do something like the following. This has a very fast convergence, quadratic in general, but we can't implement it here because it requires us to compute and invert Hessians. And as we said, actually computing and communicating Hessians is pretty expensive. Now in our setting, an equivalent way to write the Hessian is as the average of the local Hessians. And what it turns out that at least for quadratic functions, then is equivalent to do steps which are not of this form, not like a Newton step, but of this form. So ignore the ui for now. Here we have the average of the Hessians inverse, whereas here we have the average of the inverse of the local Hessians. I should emphasize that the algorithm doesn't explicitly compute these Hessians. So even if the dimension is huge, you don't need to store d by d matrices. Rather implicitly by solving these local optimization problems and updating, this is essentially what you do. You don't solve the Newton step, but you solve like a full problem which is harder than the Newton step, no? No, because, okay, so I solve this local problem. Complex. Well, it depends. I do have regularization here, so I could do things like SUG or SVRG or SDCA. You could also solve the Newton step by conjugate gradient with a fast algorithm as well. Yes, but to do conjugate gradient, the number of iterations you would need to do would either scale with the dimension or with square root of the condition number. So you can do that, but the number of communication rounds would scale with the condition number. So you don't get this improvement in terms of the number of communication rounds. Okay, so this thing and this thing are not the same because we invert the order of the inverse and the average, but the point is that, remember, we're talking about a situation where these Hessians are similar, we're in the delta-related setting. If they were exactly the same, there would have been no difference between this term and this term. In the delta-related setting, they are different, but just a little bit, and the difference is quantified by this delta, which allows us to give a convergence guarantee, which is basically the following theorem. So the idea is that every iteration, we shrink the distance to the optimum by something which depends on h tilde minus 1 and h. So h is the actual Hessian, h tilde minus 1 is the average of the inverse Hessians. So again, if all of the local Hessians are exactly the same, h is just the inverse of h tilde minus 1, and their product would be just i, and then setting eta to be 1, I actually get a convergence with just one iteration. The more realistic setting where they're only delta-related, so this thing won't be exactly zero, but rather something that would depend on delta and lambda, the strong convexity parameter. And overall, you get an algorithm where the number of communication rounds is logarithmic in the required accuracy epsilon, and depends on the square of delta over lambda. So if delta, for instance, is as small as lambda, it means that the number of communication rounds is just logarithmic in the accuracy and independent of everything else. Just to give you an illustration of this algorithm, so this is on synthetic data, although we also did some experiments on real-world data. So here we compare Dain to maybe one of the most popular algorithms for this problem, namely ADMM. The left column is for four machines, the right column is for 16 machines, and the different lines correspond to a different amount of data which was randomly partitioned between the machines. So you can clearly see that the Dain algorithm, as each machine gets more and more data, the relation, the relatedness of the local functions becomes stronger, then delta shrinks, and indeed the number of communication rounds that you need, so here the x-axis is the number of communication rounds, this is log of the optimization, so the number of communication rounds decreases. In contrast, ADMM doesn't utilize relatedness between the local functions, so even if each machine gets more and more data, the convergence rate remains the same. Now, the guarantees I talked about are just for quadratic functions. We can also provide some guarantees for non-quadratic functions, but they are a bit weaker, I won't discuss them, and as I said earlier, there have been some follow-ups to this work since, so for instance Yuchen Zhang and Ling Xiao had this very nice paper last year where they proposed a different and somewhat more sophisticated algorithm called DISCO for the same setting as ours, where they improved the dependence on the ratio between delta and lambda, so we had delta over lambda squared, they have just square root of delta over lambda, so these are very nice algorithms. One thing that should be kept in mind about them is that they're still not necessarily very, very cheap in terms of front time because we still need to solve some local optimization problem at every round, which in practice is a little bit expensive, so in terms of the amount of communication, it's generally small, but in terms of front time, it might not be the best possible. Another thing I should point out is that currently the kind of analysis that we have is either for quadratic functions, for the algorithm of Zhang and Xiao, they were able to extend it to self-concordant losses under some assumptions, also the guarantees are slightly weaker. We still don't have an algorithm for the delta-related setting, which is say for a general strongly convex and smooth losses, at least one where you get a good dependence on delta, yes. Between this method and the one you just presented, you also have a power of four, which is the difference, so you have a square and there you have the square. What's the gain of, for example, what you're presenting over this one if there's any? Well, in practice, the difference between them is not necessarily that dramatic. Well, I'm talking about this algorithm because that's the algorithm I worked on and gives a simple way to take advantage of the delta, but in many cases that algorithm could work better, but I prefer to talk about my own work rather than someone else's work. Do you think you could get away with just smoothness and not self-concordance? Yeah, it's a good question. The difficulty is that both in our algorithm and their algorithm, it's something very similar to quasi-Newton methods and to do the analysis correctly for any kind of algorithm from the Newton family, it's very difficult to get something satisfactory without assuming something like self-concordance. In practice, I don't think it's really necessary in terms of practical performance, both these algorithms you can easily run on anything, but the analysis, I don't know how to do. If it's not smooth, you might not even have a Hessian. But for a tech like a smooth function, but not more, I guess as a squared in just, can you show it does not converge in some situations? That's a good question. I don't know, actually. In practice, it does work, but I don't know, I'm not sure if we know how to analyze it. So if we're the standard Newton algorithm, we don't know how to analyze it in this kind of distributed setting, it's only more difficult. Okay, moving to the last scenario I'll discuss, maybe the easiest in some case is when the data is in fact randomly partitioned between the machines. So again, this is a special case of the delta-related setting where delta is something like one over square root, the number of data points per machine, but can we utilize the fact that it's a random partition to get even better results than in the delta-related setting. So just to translate the results of the previous algorithm, if we try to understand what is kind of the regime where we can get a small number of communication rounds, something which is just logarithmic in the required accuracy, then the previous algorithms give you that as long as the strong convexity parameter is at least one over square root of n, which is okay in many cases, but not always because again, this lambda, the strong convexity parameter, often comes from some explicit regularization that we add and often it decays with the number of data points. Usually in the literature what you see is something between one over square root, the number of data points, and one over the number of data points. So this gives us one extreme of this regime, but many times we do want to use smaller regularization and then the number of communication rounds is not as good. In contrast, in the random partition scenario, I'm going to discuss actually a much simpler approach also in terms of the algorithm that does allow you a log of one over epsilon communication rounds as long as lambda is one over n, at least up to log factors. So we get this nice behavior in terms of communication for a much broader choice of the regularization parameter lambda. But to explain this approach, I will need to take a detour and talk a bit about without replacement sampling for stochastic gradient methods, after which we'll return back to the distributed setting. So we forget about the distributed setting for now, we just want to look at the situation where we have some function which is the average of many individual functions and we want to optimize it. So a very, very popular family of algorithms to do this is stochastic gradient methods. So if we do something like stochastic gradient descent or the sub gradient method, what we basically do is that each time we sample one of these functions f i, we compute the gradient or sub gradient at the current point, we take a step along that direction and project back to the domain if needed. Now the way that the standard analysis works is when is assuming that these i t's, these indices are sampled uniformly at random and independently from one to n. And that works because then each g t is an unbiased estimate for a gradient of the function I actually want to estimate, want to optimize f of w. But there's actually a certain theory practice discrepancy here because in practice quite often it's much better to do sampling not with replacement but without replacement. So if I already sampled some index I don't sample it again. A different way to think about it is that I pick some permutation over the indices uniformly at random and then just go over the data according to that order. So I just do a random shuffle of the data and go over it. And maybe I can then reshuffle the data and go over it again and not only it works better, not only it often works better in practice you get faster convergence, it's also often a much easier and faster to implement because going over the data in sequential order because of caching effects or if the data resides on this slow device it's much faster to go sequentially than through random access. Now intuitively this without replacement sampling works better because I am sort of forcing the algorithm to process all the data equally. If I sample with replacement it only happens on average but it turned out to be very difficult to analyze these stochastic gradient methods when we do sampling in this way because now the updates are correlated. I no longer pick the indices independently from everything before. There have been a little bit of work in this direction so there are classical results for incremental gradient methods. So basically convergence bounds which work no matter in which order I go over the data but I'm not assuming any kind of randomness here so the bounds are much weaker you can actually show that at least in some cases they can be exponentially slower than with replacement sampling. Very recently there was a very interesting work by Gourbouz Vallaban or Zaglarin Parillo which tried to analyze stochastic gradient descent for strongly convex and smooth problems and showed that if you do sufficiently many passes over the data doing sampling without replacement then eventually you do get a small error. So as k gets larger you get a decrease here which is like 1 over k squared versus 1 over k in the width replacement setting. However they also have a very strong dependence on the number of data points. So just to make the balance here non-trivial you need to do at least n passes over the data if you want to be better than width replacement k has to be something like n cubed and this is a little bit unsatisfactory because the case where we want to use stochastic gradient methods to begin with is when we don't want to do many passes over the data. If you're willing to do many passes over the data there are actually much better methods so just do plain gradient descent or accelerated gradient descent or fastest stochastic methods. So in a situation where k is small or maybe even one these results don't tell us much unfortunately. So what I'm going to talk about next is some new results which give an analysis for stochastic gradient methods with out replacement sampling. Our goal is a little bit more modest in the sense that we won't show that without replacement is strictly better but at least we show that it's not worse in a worst case sense. So again considering scenarios like strongly convex and smooth functions or convex functions we get the same kind of convergence rates as in the width replacement sampling case. So we talk about various scenarios either convex or under strongly convex and smooth and also an analysis for a no replacement version of the SVRG algorithm which is the one that would relate later on to distributed learning. So I'll explain a little bit how this kind of analysis works. It basically uses ideas from stochastic optimization but also from adversarial online learning and transductive learning theory. So if you're not familiar with these don't worry I'll explain as we go along. Okay so the simplest maybe to explain is the situation where the functions are just convex and ellipses each of i and we look at an algorithm which sequentially processes these functions according to some random permutation producing iterates and our goal is to prove that in expectation the average suboptimality of the iterates is order of one over square root of t. And so you know so based on this you can argue that if you pick a single Wt where t is choosing you formally at random in expectation this would be the bound or you can take the average of the Ws. Basically this is the kind of convergence bound that we would like. And actually I'm not going to talk about a particular algorithm. All I would require is to have that the algorithm will have a regret bound in the adversarial online learning setting which is a situation where these functions I basically don't assume any kind of statistical assumptions about them. They're arbitrary functions and then but maybe convex ellipses and then I want that the average loss of the points Wt that the algorithm produces is only one over square root of t worse than the average loss of any fixed point W in my domain. So for instance this holds for stochastic gradient descent but also other algorithms. And the proof idea I can basically show in two slides. I think I have time to go over every point but this is the following. This is the thing that we want to bound. I'm using the fact that this sigma, this presentation is chosen uniformly at random. So in expectation the marginal of f sigma t is just a big f. I add and subtract terms and then I apply the regret bound that I assume. So this allows me to upper bound this whole thing by one over square root of t. This the second term I write a little bit differently and do some simple algebraic manipulations. And what I end up with is this bound where what I have here is the expected difference between the average loss of Wt on the losses seen so far minus the average loss on the losses that still weren't seen according to the permutation. And now I'm going to do something which might appear to be very loose so I upper bound the expectation term by the expectation of the supremum over every possible point W in my domain. So why do I do this? I do this because this kind of what basically I'm asking when I'm trying to bound this expression is I had my data. I randomly partitioned it, fixed some t. I randomly partitioned it to a group of size t minus one and a group of the rest. And I ask for any given point W how large can be the difference in the average loss between these two things. So because of concentration of measure and uniform convergence effect this thing can generally be small. And actually this is exactly the term that has been studied in transductive learning theory. So in transductive learning we have some fixed data set which is split to a training set in a test set and then you can ask what is the difference between the empirical risk of the training set versus the risk or the average loss on the test set. There's been an entire theory developed exactly for these things in particular. It can be shown that you can upper bound this expectation of the supremum by a transductive version of rather macro complexity which is used to N. Okay capital N is the number of the total number of data points and t is the number of iterations the algorithm does. So you sample a single permutation and you do t steps? Yes. So here I'm assuming the t is less than or equal to N. I do just one random shuffle of the data. Actually everything I do here can be generalized to a situation where you do repeated with shufflings. But the nt will not be N. Yeah so actually this term would dominate this term. But that's okay because I just want to end up with a bound which is like one over square root of t. Do you need t to be less than N? No it could actually equal N. That's fine. You could like a full stack of full like a full pass. Yes. Okay so you want to be more general so you allow your other way for less than a full pass. I mean the analysis allows me to do that. Okay so this kind of expectation of supremum you can upper bound by a Radumakar complexity the Radumakar complexity of the domain W and then you can apply you can instantiate it for various cases. So maybe the simplest one is if we talk about linear predictors with bounded norm and the losses are convex elliptiates we get the one over square root of t rate and actually you can also show that all the all the parameters hidden in the o notation the norm of the predictors and elliptiates constant are all correct. You get the same thing as in the width replacement case up to maybe constants. And you can also do a more sophisticated version of this analysis to get say the one over lambda t with lambda strong convexity. Here you do need to work harder because uniform convergence doesn't give you one over t rates but it turns out you can do some trick to get around it I won't have time to go into the details. But to start getting back to the distributed setting I want to focus on the results for the SVRG algorithm which is belongs to a family of algorithms most of them developed over the past few years including by members in the audience here which are exactly targeted at solving optimization problems of the following form. They have cheap stochastic iterations like stochastic gradient descent but their convergence rate is linear so to get epsilon accuracy number iterations only scales logarithmically with epsilon and all the analysis I'm familiar with are strongly used with replacement sampling and we instead consider a without replacement sampling and we picked in particular the SVRG algorithm. So the algorithm the standard with replacement version has the following form it's a very simple algorithm it goes in epochs in each epoch we compute one full gradient so the gradient with respect to the function we actually want to optimize and then we do t stochastic iterations where each time we pick one individual loss uniformly at random and do an update which has this form in expectation it still corresponds to the gradient of f but these terms here ensure that the variance the noise that we have in this update gets smaller and smaller over time and the standard analysis is that basically you need to do log of one over epsilon epochs overall and in each epoch the number of stochastic iterations needs to be at least one over lambda. Now in a recent paper Jason Lee, Lin and Ma made this very nice observation that actually you can take this algorithm and apply it basically as is for distributed learning. So in distributed learning the individual functions FIR distributed across different machines but still the machines can simulate this algorithm so each time they do communication round to compute the full gradient and then each machine runs these cheap iterations on a subset of their data. Now there is a difficulty here because the algorithm requires with replacement sampling and we're talking about situation where the data was partitioned at random so it doesn't correspond to sampling with replacement but at least as long as the number of iteration is one over sorry this should be a square root of n as long as you sample less than square root the total number of examples by the birthday paradox within without replacement sampling is more or less the same so this thing would work. So when I take this constraint and plug it into the analysis it means that we get this way in algorithm for distributed learning optimization however it's only applicable when the strong convexity parameter is at least one over square root of n which as we said earlier is quite a bit restrictive. So what instead you can do is simply do the same thing in the same algorithm but this time use without replacement sampling which fits much more with the random partition data that we deal with so it's exactly the same as in the previous slide but instead of each time picking individual loss independently we fix some permutation over the data and do the update according to this permutation and now this is something that I can actually simulate with random partition data all the way up to a order of n order of the number of data points in t here again maybe up to log factors. So you do less than a single pass over the data? Yes the point here is that I have these expensive gradient calculations which require a full pass on the data but the number of stochastic iterations I need to do is actually less than the size of my data so that's we had to do like a fix random permutation and do several passes it does diverge quickly but the steps have to be much smaller. Yes so I'm going to talk about the analysis in a moment but it's very important that if you do several passes over the data you reshuffle the data each time otherwise the analysis doesn't work but I think that this was also yeah it was noticed that with these algorithms if you don't repermute each time it can either converge poorly or not at all. Okay is the basic idea of the algorithm clear? Okay so this is an algorithm you can apply on anything but what can we say in terms of rigorous guarantees so currently we can give a bound for without replacement SVRG but only for a regularized least squares this is for technical reasons as far as I can discern but this is what we can currently do still it's an important setting. So again using the same kind of parameter choices log 1 over epsilon epochs and 1 over lambda without replacements stochastic iterations we get a similar kind of convergence rate as the width replacement case. So what this means in the context of distributed optimization is that at least for regularized least squares we can do we can find an optimal solution okay there are actually two implications here so for non-distributed optimization it means that if we just want to do without replacement version of SVRG we actually don't need to do any data reshuffling all the way up to lambda being around one over the data size which again is good in situations where access to the data is expensive and doing this reshuffling is not something you want to do too many times in the context of distributed optimization again where lambda is at least one over the data size you get an epsilon optimal solution with randomly partitioned data and you only need to do a logarithmic number of communication rounds and because of the structure of the algorithm the runtime is actually dominated by this full gradient computation which is fully paralysable because each machine can just compute its own gradient on its local data and only do an averaging at the end so you also in terms of runtime you get a runtime speed up by using more machines yes so in your first application so you you you shuffled there once or it's just randomly and then you never change the permutation yeah you just need to do it once you do need to pass over the data several times but you don't need to change the order it's sequential passes because that seems to contradict the i guess i guess probably the log over one accident the constant in front of it has a end time worst term probably that's the slowdown that constantly mentioned um so you need to much much smaller step size no so actually again the point is that here when i say you don't do data reshuffling it's because the number of stochastic iterations you do is not larger than the data size you do more than one pass because you need in each epoch to compute the full gradient but to do the stochastic iterations you don't touch a data point more than once and that is important otherwise i don't know how to give this kind of between two epochs do you need to change your permutation no so in this case my guess to reconcile with the fact that empirically when you don't change the permutation you need to use much smaller step size empirically we were using like uh sag or whatever with or saga which does not recompute a full gradient so yeah but at the given point so it's like you start from scratch and also that's a good point i mean this is also an analysis for svrg i don't have an analysis at the moment for sag or sdca and the structure of that alg of these algorithms is also different so here i'm really utilizing the fact that this algorithm uses full gradient computations and a small number of stochastic iterations sag or sdca need to do they don't have a full exact gradient computation instead they have more stochastic iterations so that does break the the analysis here because it does require you in the stochastic phase to touch a data point more than once yes right so if you don't reshuffle between epochs that means that there are some data points which you never see in your stochastic iterations but is that yes that is true you see them in the full gradient computations yeah any other questions okay um i think this is okay this is more or less at the end so to summarize i talked about three scenarios in a distributed learning and optimization and gave all kinds of results in each of them so maybe the one that we currently i think understand best at least in terms of worst case guarantees is the arbitrary partition case where maybe a bit disappointingly the simplest baseline is also worst case optimal in the delta related settings the situation is that we can handle quadratic functions pretty well maybe self-concordant losses but we currently don't have a a algorithms for at least with provable guarantees for generic say strongly convex and smooth functions and also the kind of algorithms that we have are a bit heavy they need to actually solve an optimization problem at every iteration um and finally in the random partition currently we have provable guarantees for least squares the algorithm i presented based on the svrg i think we still haven't experimented on it too much but i suspected would work generally on strongly convex and smooth functions but we don't have an analysis for it at the moment well okay actually it is possible to give some kind of analysis but again it's only in situation where lambda is very large is so large that there's actually not a difference between within without replacement set without replacement sampling so it's not a very interesting result so i think i'll stop here thank you very much