 Welcome everyone. It's a great pleasure to see so many people joining us for this TCS Plus. So before we start with, well, I'll say it afterwards maybe. So actually, let me just introduce everyone who's here. So I'm not sure, there's a lot of people, so I'm not sure I'll get everyone right, but I'll try. So there's Alexander joining us from UCB. I think we can't see you, hear you. I assume you noted. Okay. Then we have Andrew joining us from University of Wisconsin Medicine. Hi, guys. Then there is Budima joining us from EPFL. Welcome, guys. Then there is Chris joining us from Yale, I think. Yeah. Yep, Yale. Okay, cool. Clement is joining us from Stanford. Welcome, everyone. Then there is Kupien joining us from University Michigan. I'm sorry. Welcome. Emmanuel is joining us from Max Plant Institute. Welcome, guys. So maybe this is the first time. I'm not sure. So I hope you enjoy it. Then Godam is joining us with a group just right next door, MIT. Ilya is joining us from Columbia. Welcome. And Janish is joining us from Caltech. Welcome, guys. All right. So welcome, everyone. So it's a great pleasure to have John Kellner give the talk today. So today is a little bit of a somewhat special talk because, as you know, this talk is sort of dedicated to the memory of Michael Cohen, who was a graduate student at MIT. John Kellner and Alexander Madri, I believe, were sort of both of his advisors. And Michael is someone that we tried to get to talk in TCS Plus, and he was going to do it. But unfortunately, we didn't get to it. So John is giving the talk in his memory today. Let me also say that a couple of weeks from now, we'll have a talk by Sebastien Boubec from Microsoft, who is also someone who worked with Michael. So it's really a pleasure to have John give the talk. So John is an associate professor of applied math at MIT. He also got his PhD from MIT, advised by Dan Spielman, and then he spent a little bit of time at the Institute for Advanced Study, and then he's been a professor at MIT since then. So John has done a lot of work all across algorithms, combinatorial optimization, spectral graph theory. Maybe he's well known for some of his graph algorithms work, so solving systems of Laplacian equations and his work on max flow. But he has a whole range of other works. And so today, he's going to tell us about some recent work with Michael Cohen that was published at STOCK on graph algorithms. So John, it's all yours. Thanks. Yeah, thank you very much for having me. So as Thomas mentioned, I'm going to be talking about some work with Michael Cohen and with several other people. So the main paper I'll be talking about today, it was with Michael, John Peebles, Richard Peng, Anuprao, Aaron Sinford, and Adrian Vladu. And before I start, just if there's any point where you guys have questions or want me to stop, just feel free to chime in if you can unmute yourself. I'm happy to take questions as we go along. And yeah, so today is going to be about work with Michael. And we've been eulogizing him a couple of times in a lot of these communities lately. So I think I'm not going to talk at length about it, but I just want to say that really, this is a huge loss for our community. Today, the result I'm going to talk about I think is a very exciting result. And perhaps what's from an academic standpoint, one of the more surprising things about Michael's work is that this is just one slice of one slice of one slice of what he's done in his three years of grad school and before. So, you know, he really was a wonderful person and a wonderful researcher, and we're going to seriously miss him. We already do. Today I want to talk about some work that he worked on. And it's actually, it's a little funny because Michael's original role in this work was to basically be kind of a dual oracle where he would explain to us why everything we were trying wouldn't work and never would. And then eventually when we convinced him to try to prove the result instead of disprove the result, the result, set up quite a bit. And today is going to be, I'm going to talk about one of our papers. This is in a sequence of three papers, but I'm going to focus on one of the three. Okay, so before I launch, hold on, let me just get slides to against. So before I launch into the kind of real body of the talk, I want to give a sample application because I think it's easy for me to bury the lead a little bit when talking about this and get caught up in high level stuff and kind of bury the point of the, one of the things that it accomplishes. What I want to give is just one application of this work that I think hopefully will convince you that it solves something interesting. And then in a second, I'll talk about the broader results. So the sample application I want to talk about is one of the very classic questions, maybe the classic question in the theory of Markov chains, namely, I give you a Markov chain and you can think of that for this talk as a random walk on a directed graph G. And a classic, classic theorem says that if G is strongly connected in a periodic, when you run a random walk for a sufficiently long period of time, the probability distribution converges to a stationary distribution. And finding Pi is basically the first, maybe the fundamental algorithmic question you could ask about a Markov chain. For undirected graphs, it's very, very easy to find, as I'm sure most of you know. You just basically write down the node degrees and divide by the total number of edges. However, in the directed case, all bets are off. In the directed case, it's very possible that the stationary distribution has essentially nothing to do with the node degrees. And in general, it's a fairly complicated linear system. And, you know, for example, I drew a graph here where you'll notice that the node degrees have nothing to do with the stationary probabilities that I've filled in. If you make a kind of bigger version of this graph, you can see that they can vary wildly. Here is a classic graph where the stationary probabilities go off by a exponential factor. And in general, we know that we can find these because you can state the problem as a linear algebra problem. You can state finding the stationary distribution as a linear system. And then you can throw general linear algebra at it. So Thomas just disappeared. Can anybody see me still? Sorry, John, I'm here. Okay, good. So, yeah, so you can, the best previous algorithm was just general linear algebra. So you can state this as a matrix problem. And then you can solve it by just throwing general linear algebra routines at it. So for, you know, in general, I guess it's n to the omega for whatever the matrix multiplication constant is at present. So omega right now is known to be less than 2.372. And that was it. So somewhat surprisingly, you know, this is a question that's been investigated quite a bit really in multiple communities. And in special cases, there are a whole bunch of results and there's a bunch of heuristics. But really the best algorithm for this in any kind of general setting was just to throw general linear algebra at the problem. And that was it. So, you know, this is a problem that's been around I guess since the late 1800s. And really all that was known was just think of solving, solving this as general linear algebra. And that's it. So one consequence of what I'm going to talk about today is going to be the first improvement over general linear algebra. And in fact, what we're going to do is go from omega n to the omega. So greater than n squared, you know, I think practically n cubed to almost linear time. So today what we'll do is we'll sketch the high level picture of how to get an algorithm that will compute the stationary distribution of arbitrary random walk in time O of m to the 1 plus little o of 1. So today when I talk about almost linear, I'll mean something like this. When I mean m times poly logarithmic factors, I'll say nearly linear. And I should mention that in follow up work after this paper, many of the same authors, including Michael, and also including Rasmus King who was on the previous paper, we managed to improve this further and we managed to turn the m to the 1 plus little o of 1 into a bona fide nearly linear time. Okay, so that's kind of an illustrative example. So hopefully that will convince you that there's something here. It will knock factor of almost n squared, I guess, in the practical running time office album use problems for sparse graphs. And so, okay, that was my sort of broad, high level example motivation. Let me take a big step back and talk about a more general overview motivation. So the real motivation that we had to this work came from the state of algorithmic spectral graph theory. And, you know, basically algorithmic spectral graph theory, if you wanted to give a caricature of definition of it, is something that exploits the relationship between three seemingly disparate objects. On the one hand, there's graph algorithms. On the other hand, there are Markov chains and random walks. And on the third hand, if you have three hands, are matrices and linear algebra. And exploiting this connection has led to huge advances in all three disciplines, both in theory and practice. And I'm not going to run through all of them, but this has been one of the really great algorithmic success stories the last couple decades. And, you know, more recently, it's become a very hot topic because of what's been termed the Laplacian paradigm. And what this has been has been exploiting these connections, along with the fact that we can solve certain classes of linear systems associated to graphs very quickly to really push on all three domains. And this has recently led to a slew of very fast algorithms for very classic problems in the theory of graph algorithms, such as max flowing from cut, multi-commodity flow, shortest paths, random spanning trees, etc. On the linear algebra side, the main tool and one of the main consequences, I guess, of this connection has been the ability to solve undirected Laplacian linear systems and with them some generalizations like symmetric N matrices and with those tools, other linear algebraic questions like eigenvalues, eigenvectors, and heat kernels. And then on the Markov chain side, it turns out that you can take a lot of the very classic properties of a Markov chain for undirected graphs. Stationary distributions are not particularly interesting, but other things like escape probabilities, hitting times, mixing times, etc. can be stated as properties involving Laplacian linear systems and then you can solve them too. Okay, so that's basically the state of undirected Laplacian, the Laplacian paradigm for undirected graphs. And the motivation for this work has been essentially everything I just said really only applies in the undirected setting with very, very few exceptions. So at a broad level, there's this gigantic gap across the board between what we can do in the spectral graph theory world and in this whole algorithmic setting, between what you can do for directed graphs and what we can do for undirected ones. So I want to mention a couple key aspects of this gap. And the goal that the basically the high level goal that motivated this work was to try to close this gap. In some sense, we basically didn't have spectral graph theory algorithms for directed graphs. So for the basic Markov chain problems which are just motivated, as I said, for undirected graphs, it's O tilde of M. And the way that works is that, you know, for writing down the stationary distribution, you can just write it down. It's linear time. For some of these other more complicated problems, you can rely on the fact that we can solve these Laplacian linear systems in nearly linear time and that lets you solve a whole bunch of other basic Markov chain problems in linear time. For the directed case, however, really, again, as I said, for essentially all of the basic properties you would look for on a directed graph, on a random walk on a directed graph, in the general case, all you can do is throw a general linear algebra at it. So n to the omega and arguably Mn if you use conjugate gradient, but that's not really fully true if you believe in the numerical precision. So this is on the Markov chain side. On the linear system and linear algebra side, there's a similar gap. It turns out to any undirected graph you can associate a Laplacian, which I'll quickly define in a second. And for solving linear systems involving graphs Laplacians, there are these highly celebrated works that let you do it in nearly linear time in the number of edges of the corresponding graph. However, for the directed graph, as I said, just linear algebra and to the omega. And for the graph problems and situations a little bit harder to state, but more or less, all of the recent progress in the Laplacian paradigm has at least to some extent been in the undirected world. Virtually all of the algorithmic results have relied on solving undirected Laplacian linear systems. There are a couple of the results that involve directed graphs, but they did it through iterative machinery, through fairly involved analyses of interior point methods, and at their root were really kind of working with undirected objects. So if you really want to kind of look at it precisely, essentially everything we've had in this Laplacian paradigm has been for undirected graphs, one way or another. And it's of course a very natural question whether any of this comes over to the directed setting. So if you were to try to trace back where this origin came from, where this gap came from, really at its root, the issue is in the state of spectral graph theory and algorithms around it. In the undirected setting, we have a very polished theory, we have very powerful algorithms, we have a whole bunch of primitives that we can use that let us solve very complicated problems very quickly. In the directed setting, there's a huge gap, and we have a much less well-formed theory and virtually none of the main algorithms. In particular, the key primitives that showed up throughout the Markov chain problems, the linear systems, and the graph problems were missing some of their basic fundamental pieces. In particular, all of the linear system solvers in one way or most of the linear system solvers rely on graph sparsification. And all the Laplacian solvers, of course, do, except for the one or two that rely on spanning trees, on low-stretch spanning trees, which are dual objects. And more broadly, the Laplacian solvers themselves that we use, you can write down and direct the analog of the question, but we don't have an answer. And so kind of at the core of this, what's really going on with Markov chains and with the graph algorithm applications, is that we have this big gap between what we can do in the undirected spectral world and what we can do in the directed one. And this has persisted for quite a while. And our goal here is going to be, to at least to some extent, start to close that gap. And the results I want to talk about today are a sequence of three papers that Michael and I and the co-authors I previously mentioned wrote, in which we essentially eliminate the algorithmic gap for the Markov chain and Laplacian system settings. So I crossed off a large gap in all three, and it's at least a smaller gap now. And in two of the three, it's essentially no gap up to polylongar rhythmic factors. So for directed, for all of the Markov chain problems that we're going to talk about, so he sort of picked your favorite question about a Markov chain, hitting time, mixing time, approximately, stationary distribution. For all of these, it turns out we can now solve these problems in O tilde of M time. So instead of, you know, greater than quadratic, we can now do it. Similarly, for our main tool in doing that actually will be solving Laplacian systems. And we'll do those, we can do those in nearly linear time as well. And I should mention that the result we'll talk about today is going to be an almost linear algorithm. The improvement to nearly linear time is a follow up paper that uses related techniques, but is a little bit more involved. And I won't talk about too much today. And then the final thing I wanted to cross off was, I crossed out broad in the discrepancy between the state of spectral graph theory and the two settings. I didn't want to cross off discrepancy because there's certainly still a large gap in it. But at the very least, we've shrunk it to some extent, particularly in terms of our algorithmic primitives. On the algorithmic side, what we're going to do today and the way this whole thing is going to work is we're going to show how to reproduce the basic primitives that have been so useful in the undirected setting. We're going to show that we can actually do them in the directed setting as well. So we're going to show how to sparsify graphs and solve Laplacian systems. And this will kind of be the root of all of our directed primitives. Before I launch into that, are there any questions so far? Good. Okay. So just to state this a little more carefully, let me tell you what I'm going to cover today. So today, I'm going to describe a whole bunch of new algorithmic and theoretical tools for directed spectral graph theory. And I'm going to use these to get the first almost linear time algorithms from many of the central problems that you would ask in the analysis of non-reversible Markov chains. I'm going to talk today about the almost linear ones and not about the nearly linear ones, because I think it makes a better self-contained talk. And the tools that we're going to introduce to do so are going to be, essentially, to do sparsification. We'll need to know what it even means. And so we'll introduce a notion of directed graph approximation and then with it a notion of sparsification. We'll then use this and a whole bunch of other stuff to solve directed Laplacian systems in nearly linear time. And then that will actually yield all of our algorithmic applications. In particular, this will let us, this is the basic primitive one needs to find stationary distributions of random walks, personalized page rank vectors, hitting times, mixing times, escape probabilities, to solve more general classes of linear systems for asymmetric matrices, and a couple. So that's our plan. And I mentioned that for everything on this screen, previously the best algorithm was n to the omega. And for all of these, we can now do them in nearly linear time. And today we'll do almost linear. Okay, so before I start, we just lay down a couple basic definitions that we'll use. So as I said before, the paper is going to live in the intersection between graph algorithms and linear algebra. And so for that, we need to have a way of translating between the two. And for that, we're going to need to have some matrices we can associate to graphs. And I'm going to go through a lot of these things a bit quickly, because I think most people have seen these before, but stop me if I go too quickly. Okay, so here's a graph. And the obvious matrix associated to it is the adjacent matrix. We can also make a diagonal matrix of the out-degrees of the vertices. So for example, the fact that the 2, 2 entry of D is 2 is because vertex number 2 has out-degree chip. We then define kind of our main objective study, which is the Laplacian of a directed graph. And Laplacian of a directed graph will define as D minus A transpose. So in the undirected setting, you are probably used to seeing just D minus A, if you've seen these before. Of course, in the undirected setting, the adjacency matrix is symmetric. So it doesn't really matter. But it will have a more natural random walk interpretation if we work with A transpose instead of A. And then the other object. So the first two are sort of associating graphs and matrices. I now want to bring in Markov chains. And the way we'll do that is we'll look at the random walk matrix. So to get that, you're going to multiply on the right by D inverse, which is going to essentially normalize the rows of the matrix, the columns of the matrix. And so the property that the random walk matrix will have is that applying it to a vector will, you could think of a vector as a list of probabilities on nodes. And when you apply the Markov, the random walk matrix, it will give you the probabilities after K steps of the Markov chain if you apply W to the K. Okay. So that's our setup. And I want to start because what we're going to be doing is working with all the linear algebraic properties of these things, I want to just mention that already there's a crucial difference. And it's a difference that's actually, it's going to look a little silly, but I claim that it's actually really at the core of some of the difficulties we're going to run into. And that difficulty is that in, you know, that the columns in one of these matrices in the Laplacian matrix clearly sum up to zero just by introduction. And also by the fact that that's the out-degree is equal to the sum of the weights of the edges leaving the vertex. And I want to point out that while in the undirected case, the all ones vector, the rows sum up to zero. So the all ones vectors in the kernel, in the directed case, that's completely false. So even in the graphs that I wrote down here, you'll notice the rows sum up to something fairly arbitrary. And this is going to present a kind of a key problem for us because if you want to work linear algebraically with a matrix, understanding its kernel is certainly an important property. And for the undirected setting, we just wrote it down. For the directed setting, we already have some issues. You can write a similar property in terms of the random walk matrix that will bring this connection even more into focus. In the random walk matrix setting, note that while one transpose w equals one transpose, and note you can think of this as just basically saying probability is preserved, that the sum of all the probabilities doesn't change. It's definitely not the case that w times one equals one. This is even not the case for undirected graphs for normalization reasons. But it's, you know, and the way I want you to think of it is think that w pi equals pi is the defining property of a stationary distribution. The fact that we don't know the stationary distribution and that in general, it's not the all ones vector clearly indicates that w one shouldn't equal one. Okay, so this is our setup. And basically what we'll do today is focus on solving linear systems involving more Gaussian and everything else will reduce to it. Because it will just simplify the terminology for the rest of the talk, I'm going to assume that all graphs are strongly connected. I may also lapse into assuming that they're all unit weights on the edges. Neither of these is actually an essential assumption, but it will simplify the notation. So I want to now briefly expand upon this comment I made about kernels and stationary distributions just to make it clear that this is really a fundamental issue. Um, so if I give you a nice, strongly connected graph, that's not periodic, then the property we're looking for, the stationary distribution, is let me first define it for you precisely. So it says it's the distribution that under these assumptions, the random walk converges to after a long, large number of steps. And formally it obeys the property that w pi equals pi. If you're in the stationary distribution and you take a step of the random walk, you remain in the stationary distribution. Um, and so a natural question is why we care about these, and I claim we care about them for a couple of reasons. One is from the Markov chain side. They're kind of the basic property. They're theoretically useful. They have a ton of applications, but more important for the technical content of what we're going to say today, I claim that we need to, we really actually need to understand them if we have any hope of solving any of the problems we've written down. Um, and the reason for that is that if I want you to solve lx equals b, you really should be able to understand lx equals zero. And as I mentioned a few seconds ago, lx equals zero is very, it's up to a scaling by d, is the same as finding the stationary distribution. So if you take the definition of l and you kind of manipulate it a little bit, you get that, um, if I gave you, uh, if I gave you the vector, a vector x, so that lx equals zero, then dx is a stationary vector for the random walk. So, um, you know, so I claim that these are kind of crucial issues that are at the core of what we're going to look at. And it kind of gives us like, uh, chicken and the egg issue that seems fairly problematic. Um, on its face, like, we wanted to initially find the stationary distribution as one of our goals. To do that, we have to solve a linear system. But what I just said was to understand the linear system, you kind of want to know the stationary distribution because it corresponds to the kernel. Um, and that's sort of where we're starting. Where we're starting is that out of the gate, the first main difference that we see in the undirected and directed settings is just in the undirected setting, we can write down the kernel, the matrix by inspection. In the directed setting, we have no idea what it is. And if you think about, if you've seen the other algorithms in the undirected setting before, anything involving preconditioning, it really needs to understand the kernel and vectors that are near it. And so if we don't have that, we seem to be in trouble. Um, are there any questions so far? Um, sorry, I just see a little chat message that says some people couldn't see the slides. Are they all good now? Oh yeah, it's fine. This was at the start of your talk, but I think it's fine now. Okay, great. Okay, so any questions on anything? Okay, so moving forward. The way we're going to get around the stationary distribution issue and kind of the key reduction that we'll start to make this problem feel tractable is we're going to focus on Eulerian graphs. So, um, Eulerian graphs are a nice general case where we do in fact know the stationary distribution. So if we believe that, um, you know, that we need to understand stationary distributions to understand our problem, let's first focus on a case where we understand the stationary distribution, but don't know how to solve our problem and see what we can do there. And Eulerian graphs are this case and basically, um, okay, so what is an Eulerian graph? An Eulerian graph is just a graph where every vertex has the same in-degree and out-degree. And the reason this is relevant is that, um, if you look algebraically at what that means, that means that if you look at A transpose times the all-ones vector, you get the vector of in-degrees, which is also the vector of out-degrees, which is A times one. And so, um, since we know that D times the all-ones vector is the out-degree matrix, we can move around with this and we get that, in fact, in the Eulerian case, L times the all-ones vector is again zero. So all undirected graphs are Eulerian because the edges are both in and out. And Eulerian is the property that basically is the reason why L times the all-ones vector is zero. Okay, so, uh, what's good about them is that just like undirected graphs, um, if the graph is strongly connected, the only thing in the kernel is this all-ones vector. And so, we can certainly check if a graph is Eulerian, and if it is, uh, we know the kernel precisely, assuming it's strongly connected. Um, similarly, uh, the stationary distribution, we can just write there. So, this is definitely a case where we want to talk about solving linear systems. It's not interesting to find the stationary distribution for an Eulerian graph. It's just proportional, as in the undirected setting, to the out-degrees or the end-degrees, whichever you prefer to call them. Um, and the reason this is going to be algorithmically useful for us is that actually it will give us, um, we're going to start to see a way to get out of our checking in the egg problem. And the idea is that we're going to try to get what's called an Eulerian rescaling program. So, every strongly connected graph has a diagonal matrix X, where L times X is actually the Laplacian of an Eulerian graph. So, you could think of this as a node reweighting, so that the graph becomes Eulerian. And, in fact, it's pretty easy to write down. It's actually just D inverse times pi. So, um, why is this useful? It's useful because in the first of the three papers, what we show is that actually we can use this, the existence of Eulerian rescalings, and the relation to the stationary distribution to actually take the general problem of solving Laplacian systems and actually turn it into solving Eulerian linear systems. And roughly, the way it works is that we're going to show that by solving polylogrhythmically many Eulerian systems, we can actually find an Eulerian rescaling of our graph. I won't get into this in too much depth, but intuitively the point will just be that you're going to maintain a guess at your scaling. You're going to solve it, kind of, if the graph were Eulerian, you would be in really good shape. If the graph isn't Eulerian, then as you, then when you solve a linear system sort of loosely derived from the scaling that you currently have, you'll get information about how to make your graph more Eulerian, and you'll repeat. So, this is actually just a couple lines of pseudocode. The idea is just you build an associated graph, and then given a guess at your scaling, you can then kind of reroute the directed edges a bit by reweighting the vertices so that you make your graph closer to Eulerian, and you can show in a polylogrhythmic number of solves that you can actually nudge your graph all the way to being Eulerian. So, that's our first step, but I'm not going to focus on that today. That's in the first paper in the sequence of three, and what it shows is that from a linear system solving standpoint, we can, it's essentially fully general to focus on the Eulerian case. Sorry, Naive question. So, you find a scaling, but it's not necessarily the scaling. I mean, it doesn't give you the stationary distribution, right? So, here you said there is a scaling that's based on the stationary distribution that will make the graph Eulerian, and then you said, oh, I have an algorithm to find a scaling, but it's not that particular one, right? I mean, at the end of the day, it will be sort of our whole goal. Basically, we're going to, in order to solve linear systems, we're going to find the stationary distribution give or take. Yeah, but now you said, oh, I did this first step, so you haven't found the distribution yet, right? Right. So, basically, okay, so a little more precisely, what you do is you can take your graph and you can add some edges onto it to make it trivially Eulerian, but not obviously useful for solving linear systems. So, you can make it Eulerian by adding some stuff and then you can solve, and so you'll then have an Eulerian system you can solve involving this new created graph, and then that will give you information about how far you were from being Eulerian. So, you'll get a sequence of vectors that are guesses at the stationary distribution, and those will get closer and closer to giving you, well, D inverse times the stationary distribution, and so you'll solve a sequence of Eulerian systems that are derived from your graph, but are not actually your graphs, they're made by selling in some stuff to make it Eulerian, or almost Eulerian, and then you'll kind of nudge your way towards knowing the actual stationary distribution. Did that answer the right question? Well, I just wanted to clarify that your ultimate goal is to find the stationary distribution of a non-Eulerian graph, and now you said, oh, I did the first step, I made it almost Eulerian, but you haven't solved the original problem yet of finding the... So, I think probably maybe the more linear way of thinking about this is let's make our goal be to solve linear systems in an arbitrary non-Eulerian graph. Oh, sorry. Yeah, yeah. Okay, sorry. And if you could do that straightforward to then find the stationary distribution, and it will take us out of the realm of stationary distribution systems, the realm of linear systems, where then we can get away with this. So, I have a quick question, which I think might be the same, which is, if I take the scaling and then I multiply by D, do I actually get the stationary distribution, or is it possible that you found just a different rescaling? Right, I think, is this the same question? Yeah, so you'll find the stationary distribution. So, yeah, you'll find the distribution. I mean, like assuming we're under these assumptions that I've made where there's a unique stationary distribution, then just up to a normalization by D, this is the same problem. Okay, okay, thanks. Here. Other questions? Okay, so I now want to get into the meat of the topic, which is, excuse me, which is, well, sorry, before I go on, I should mention that once you can solve, once you find your Eulerian rescaling, that says, sort of, you could think of that as a reduction from solving general directed systems to solving Eulerian systems. Again, in this original first paper, we show how to reduce all of these Markov chain applications to solving directed Laplacian systems. So, the logic of this whole thing is, if you want to solve any of the Markov chain applications, you can show with varying amounts of work, depending on the question, that it suffices to do directed Laplacian systems. I should mention that some of the work in this is that the, is dealing with numerical stability is not always easy in these things. And then once you can solve directed Laplacian systems, sorry, you can solve the Markov chain things. And then we show that you can solve directed systems by solving Eulerian ones, and thus it suffices to solve them. We'll then spend the rest of the talk today basically just focusing on how would you solve an Eulerian system. Okay, so we are going to do a pioneering method, and a good motivation for the methods that we're going to do is going to be to look at the simplest iterative method for solving linear systems, which if you say it one way is gradient percent, but we're going to say it in a minute. We're going to describe it in an algebraic way. And our actual algorithm will be motivated a bit by this. Okay, so first, instead of solving something involving your original matrix, we're going to do a normalization so that our matrix looks like i minus something instead of d minus something. So if you multiply on the left and right by d to the negative half, then instead of having d minus a transpose, I dropped the transpose, but instead of having d minus a transpose, you have i minus a transpose. And, oh, sorry, I didn't drop it. Sorry. I made a weird notation choice that made it go away. So if we define a a in the script to be d to the negative half, a transpose d to the negative half, we can now think of solving Laplacian systems in terms of solving i minus the script a. And if you want to sort of get a good intuition for it, it's good to just keep in mind the case of an out regular graph. And if our graph is d out regular, then the matrix d is just d times the identity. And so this just becomes the random walk matrix. So it just becomes the adjacency matrix transpose over d. Okay, so now it's a classical statement about Markov chains that you can check fairly easily, which is that the that this matrix a has spectral norms strictly less than one on the kernel of the Laplacian of the assuming, of course, that you have strongly connected a period of graphs. And what I want to do to motivate our iterative methods is just ignore the kernel for a sec and pretend that just forget that part of the space and just think of think of a as being some matrix actually bounded by some row less than one. The kernel is just kind of a technical issue. So I claim that basic iterative methods, you can think of actually as a Taylor series. We can think of gradient descent, if you write it out one the right way, as a sort of repeated method of evaluating the series for i minus a inverse namely i plus a plus a squared plus a cube. So this is an iterative method, you can buy repeatedly multiplying by a and adding stuff, you can apply more and more terms of this series to a vector. And that will give you a better and better approximation of the result. And if you just kind of play with the Taylor series, you'll see that in order to reduce the error by a constant, you'll need order one minus row inverse terms just to make the Taylor series converge. And we're going to call that kappa, the one minus row inverse. And so this is both good news and bad news. The good news is this really is an algorithm for solving linear systems. It actually works. The bad news is that for our graph theoretic problems, kappa is not necessarily very good. And it can actually take a pretty long time if you try to do this by straight gradient so the matrices that we're looking at actually will be ill conditioned and doing any kind of naive linear system solve on them is not gonna work. So a good way to see this is to go back to our graph interpretation. And I want to first convince you that it can be n. And intuitively, the reason is just that if I take some number of steps of the iterative method, it corresponds taking the same number of terms of this series. And each application of the matrix A corresponds to another step of the random walk. And so if you look at B, A, B, A squared, B, etc., each one of those additional steps takes looks at random walks of length one higher. And so in particular, if you're asking a global question, then you do need to be able to see the whole graph, some of my screen is washing out. There we go. So in order to do that, you certainly need to be able to see all the vertices with your iterative machinery. Like you need to be able to get information from one vertex to another, which means that for any linear system that's just repeated multiplication and addition with the original graph, you're stuck with a lower bound of the diameter which can be in. In fact, I should say this is the undirected motivation. And this is why you should be convinced that you can have a poly-n kind of condition number dependence. For directed graphs, the situation is much more dire. Because really what you care about is actually the mixing time. The actual quantity determines how quickly these iterative methods converge is 1 minus rho, which is related to the mixing time. And for directed graphs, that can be exponential. So I drew a graph here where if you just stare at it for a second, you'll see that the mixing time of this graph is actually exponential in the number of vertices. So pictorially, you either move to on each side of it, starting from today to the left, you either move to the right or all the way back to the left. Starting from the right, you either move one to the left or all the way back to the right. And the point is to get from, say, vertex one to vertex one prime takes exponentially many steps because you keep jumping back to one. Okay. So what this says is that we have an iterative method, which running time depends on this condition number and the condition number can be atrocious for the problem that we're looking at. Any questions? Okay. So our iterative method for this paper is going to be motivated by previous undirected solver that has a very elegant way of getting around this problem. So what was the problem? The problem was really just that you need to be able to see high powers of your walk matrix. But if you just do it by applying the matrix repeatedly, then you won't see walks longer than the number of iterations. We already know a very easy way to get to bigger lengths in a hurry, and that's just repeated squaring, right? So instead of trying to go multiplied by A repeatedly, we'll get longer walks by just repeatedly squaring our graph or our graph plus multiples of the appropriate thing, and that will let us access longer and longer walks. So now if we need to look at walks that are of length n instead of requiring poly n steps, we're just going to require longer in n squareings. Formally, there's actually a nice way to see how to do this. If you take the identity error down before, the Taylor series for 1 over 1 minus a, basically, you can factor it out as i plus a, i plus a squared, i plus a to the fourth, etc., all multiplied together. And so you see if you take k terms of this, you'll get all the powers up to 2 to the k minus 1, 2 to the k plus 1 minus 1. And so this actually gives us a very natural algorithm, right? It basically says the way we're going to get to long paths is that instead of just multiplying and taking one step, we're going to square. We're going to double our path lengths. And in order to get an algorithm out of this, all we have to do is repeatedly square our graph. So we'll compute a, a squared, a to the fourth, a to the eighth by repeated squaring. We'll do this up to log kappa, which is the number of terms of the original series we need. And then we'll just multiply by i plus each of the power, the appropriate powers of a. And this will get us to what we want, right? This will, after doing log kappa terms of this multiplication, we'll get very close to our solution. There's however one major problem of this. This actually is true. It actually does give us a small number of multiplications to solve this linear system. But the problem is that it's super slow. So if you remember, our actual goal is going to be to get something say nearly linear in the number of vertices and the number of edges rather. Whereas here, after a small number of steps, after repeatedly square a logarithmic number of times in particular, you expect to get a very, very dense graph. And if you have a very dense graph, then just doing anything is already quadratic time and you've lost any hope of a nearly linear algorithm. And in fact, if you look at this a little more carefully, you're going to have to be multiplying multiple dense matrices together to get the repeated squareings. And that will give you something that will run in at least matrix multiplication time. So there's sort of a good news, bad news. The good news was we actually have a reasonable iterative looking thing that doesn't involve horrible numbers of iterations. The bad news is that each iteration is going to be horribly slow because the matrix gets very dense very quickly. And our solution to that is going to be sparsification. Before I go on, are there any questions? So what the trick is, is just we're going to interleave with what's called sparsification. And I want to just quickly remind you, if you haven't seen it or you don't remember it, what sparsification does. So sparsification goes back to venture cargo who defined cut-spot partial tires for undirected graphs. And what they said was, they showed that you give me any dense graph. So I think m, the number of edges, is vastly larger than the number of vertices. And it showed you can approximate nearly linear time. You can find this new graph, h, that's sparse and gets very close to the original graph, g, in the sense that every cut in your sparse graph is multiplicity, multiplicatively within a one plus or minus epsilon of the corresponding cut in the original graph. But your new graph only has m polylog n, m log n over epsilon squared edges. So this said that if you're willing to lose a little bit in your cuts, you can replace your dense graph with a very sparse. For linear system solving, this was strengthened by Spielman and Tang when they defined spectral sparsification. And this is the same goal. You want to approximate a dense graph by a sparse one. But you want to do it in some linear algebraic sense, whichever down here. And these sparsifiers are strictly stronger. Namely, if you were to write down, take the definition I just wrote and stare at it for a bit, you could convince yourself that any spectral sparse of fire is actually a cut sparse of fire as well. So if you can find a spectral sparse of fire, you can find the cut sparse of fire, but not necessarily vice versa. And Spielman and Tang showed that you can actually find these. So they showed that in the same running time, you can actually find, roughly the same running time, you can find spectral sparse of fire for any arbitrary dense graph. I should mention just as a general caveat or a general thing to know, I've been writing n log n over epsilon squared. That's actually not theoretically tight. You can do better. It's irrelevant for the algorithmic applications today because it takes a little longer. But you can find, sorry, you can find the, we can actually find ones with only linear, not linear times polynomial. Okay, so using this, we actually can get around our density issue. So what we'll do is we're going to interleave our repeated squaring steps with sparse location. And so the idea is instead of doing square, square, square, square, square, square, we'll do square, then sparse of fire, then square, then sparse of fire, then square, then sparse of fire. And you can show that actually you can do some implicit work with G so you can actually sparse of fire G squared without ever writing down the whole graph G squared. And you can do it in nearly linear time. And then what we get is we actually get that are instead of having this exact series that we had before, we now are going to approximate each term in this identity. And the hope is that this somehow is enough to actually solve our original problem. For this, you need to show somehow that the errors don't accumulate, right? Like every step, we're making some progress, but we're also introducing some error. And you have to show that the errors don't add up too quickly. And I turned out the actual identity I wrote down. That's not necessarily true. Or at least we don't know that it's true. However, there's a symmetrized form of the identity that is essentially the same idea, but is a little bit, you know, slightly longer, slightly messier symmetric version that actually has better PSD properties. And that's, and if you use this, it strongly relies on the fact that I minus A is a PSD matrix. But using it, you can then actually write down a version of the identity where the errors are easy to balance. Okay, so what this does is this lets you show that actually you can do reasonable approximations for each of the terms in this. And when you iterate the whole process, you don't end up with a huge error. And this is an algorithm for solving linear systems. So basically write down a slightly more complicated identity, apply it recursively at every step, sparsify. This gives you a crude solution, and then you can wrap it in an iterative method to get a more precise one. So this is the template we're going to follow, albeit with some caveats. Are there any questions about how this works? Okay, so moving right along. Our algorithm for solving Eulerian Laplacians is going to have the same general approach. We're going to take this identity, we're going to iteratively apply it. We're going to interleave this classification. But to make it work, we're going to need two major pieces that are going to differ substantially from the undirected case. One of them is that we'll need a notion of deconstruction for sparsification for directed graphs. The other, which is a more technical one, but a pretty crucial one, is that actually because we can't use these symmetrized identities, essentially because we don't have symmetry at all, we're going to need a much more involved, a slightly more involved iterative method. And I should mention that something to keep in the back of your mind throughout this whole thing is that essentially all of the preconditioning methods that people traditionally talk about, and essentially all of the iterative methods that one typically talks about, more or less think about matrices as quadratic forms, and thus think about PSD matrices as the objects they're working with. So when you solve something that's an asymmetric matrix, they basically multiply it by transpose and solve symmetric things in some implicit way. And kind of at the root of a lot of these difficulties is just that many of the things you would want to do for symmetric matrices, they don't even make sense on your face, for instance, asymmetric ones. So for example, even this notion of PSD matrix inequality doesn't even, you can't even write it down because you don't have symmetric matrices. It doesn't really make sense. So what I want to do in the remaining time is I want to sketch a couple pieces. I want to sketch the main parts of the algorithm, and then if people want it to stick around for any longer, I can actually come back and do these in more depth. So what I really want to start with is just saying what should the, is saying what directed sparsification should mean. And this turns out to be a stickier issue than you think, because, you know, it's actually a very obvious question to ask, and Bencher and Karger even asked it right in their original papers. And there's been a lot of trouble generalizing sparsification machinery to the directed setting. And there's a very good reason for it. The reason is that actually there are a lot of pretty sensible definitions, the ones you'd naturally write down. They actually just don't exist in a provable sense. So it's not just that we don't know how to do them. It's that actually you can prove that they're all in that. And so for example, you know, let's start easier. So instead of going for spectral sparsifiers, which are stronger, let's start with the weaker goal and just look for cut sparsifiers. And you can still write down the definition. It's perfectly sensible. It just says, set, take all the cuts, take your graph, make it sparse, get all the cuts right. The bad news is that these don't exist in a provable sense. We can actually write down some fairly simple graphs where for any sparse graph that multiplicatively approximates the original one, sorry for any graph that multiplicatively approximates the original one, you have a completely dense graph. So essentially any approximation of the graph G actually has to have omega n squared edges. Here's the graph. It's just a bipartite graph. And the issue is the following. The issue is I claim that if you remove any one edge from the graph, you actually start to, you actually already break a cut badly. And the reason is that there's a cut where the only outgoing edge is in fact the specific edge you've chosen. It's a graph. So I've drawn a cut here. If you look at the edge between N and D, I've drawn a set where the only edge leaving that set is the edge from A to B. And so if you don't include that edge in your sparsifier, you've introduced an error of infinity. You've made the cut go from one to zero. And so this kind of says out of the gate, directed sparsification just has some real issues there. If you try to do cut sparsification, it's not just that we don't know how to do it, it's that there isn't any hope. So our interesting and promising note here is that for Eulerian graphs, that problem goes away. In particular, Eulerian graphs, the edges or the cuts are actually the same as in an under-empted graph. And so cut sparsification doesn't present a problem. So we just kind of sum up the issue. Basically the problem is that the problems that we have to deal with in order to get a solver and to get a good notion of sparsification for it, one of them is just that cut sparsification just doesn't exist at all. Another is a little bit more of a meta issue, but it's kind of a crucial one, which is that when you talk about spectral sparsification, you look at X transpose LX and you really whenever you talk about stuff, you really are just working with this quadratic form. However, if you were to do anything like that for the directed graph, then looking at X transpose LX is the same as looking at X transpose L transpose X, which is the same as just looking at the X transpose L plus L transpose over two, which is like the under the undirected graph you get by forgetting the directiveness of the edges. And then pushing forward, if I gave you a general directed graph, I guess I should say I slightly lied. Only in the Eulerian case does L plus L transpose actually end up working out nicely. If I gave you a non Eulerian graph, L plus L transpose over two is not even usually PST. And so if a matrix isn't PST, then it doesn't even really make sense to approximate it multiplicatively because you have negative numbers and that's weird. So, okay, so finally, one last issue here is that the kernels are actually super subtle and they don't match either. And the main idea, the solution, which I've kind of tipped my hand on, is just to work with Eulerian graphs where all these problems go away. So to define sparsification for general graphs, basically think, define it for Eulerian graphs and then push it through the definition of an Eulerian scale. For Eulerian graphs, you know that sparsification is okay. You can show that L plus L transpose over two is PST and the kernels match. And then once you have a notion for Eulerian graphs, we're in good shape because you can then use the distance of an Eulerian scaling to generalize that to all graphs. And so that kind of brings us to our main goal for sparsification, which is finding a definition and doing it. And I guess in the stuff up there, a natural guess is L plus L transpose over two. Namely, take your under your directed graph and then form this undirected graph by adding it in its backwards edges. So basically forgetting the directiveness. And then make your definition of approximation just be that this undirected graph associated to graph is well approximated. This turns out to be a really bad answer. It turns out that for linear system things, this loses all use, most use of the origin of the Laplacian. And it's even for very simple cases. Essentially, what this is telling you to do is try to solve your directed problem with an undirected problem. And you can get something but not something very strong. And the example you need to look at to see it, which I will mention a little bit later, is just a cycle. If I look at a directed and undirected cycle, their linear algebraic structure is completely different. So solving directed cycles and solving undirected cycles have very little to do with each other. Their relative condition number is very large. Okay. So instead, here is the definition that we use. This is our definition of approximation. And then with that, so we'll say that a graph approximates another graph if the Laplacian's approximate each other. And we'll say that Laplacian of G epsilon approximates the Laplacian of H if the following is true. So instead of looking at quadratic forms, which throw away our directed structure, we're going to look at spectral norms, which are much more reasonable to talk about for asymmetric matrices. And we'll say that they approximate each other. Intuitively, you want to say LH and LG approximate each other. If in some appropriate norm, LG minus LH is small, the norm we'll use will actually be the one coming from the undirected structure. So we'll basically say you apply this change of norm to apply U to the negative half, basically, but with pseudo inverses of LG on both sides. And then you'll look at the spectral norm of the difference in matrices. And we'll say they're well approximated if this is the case. So you can think of it as saying that if I apply LH or LG to a vector, that the difference between the answers is small when compared to the results of applying, putting that same vector through the undirected version of the graphs. Okay. So then we'll say something as a sparse fire if it approximates any sparse. John, you have a question someone was asking in the chat. I know you saw it maybe, which is why you didn't stick with the notion of cut sparse fire. So the cut sparse fire notion would be for an Eulerian graph. So in an Eulerian graph, the cuts are actually the same as if you just look at the corresponding undirected graphs. So if every vertex has the same in-degree and out-degree, then you can actually show that if you essentially, if you want to look at the number of edges cut by some particular partition of the vertices, it's actually the same as if you just erased all the directedness and just look at the corresponding undirected thing. So this matrix, you said M, the cuts in the graph corresponding to it are the same as in the Eulerian graph. Intuitively, an Eulerian graph is a sum of cycles. And a cycle, if you think of a directed or an undirected cycle, they actually both cut the same number of edges for every cut, a feel factor of two if you scale it. You have to scale it right. But they both have the same cuts. So the point is, is that if you use the cut notion of sparseification, what you're basically doing for Eulerian graphs is just saying forget the directedness and look at undirected approximation. And for the cycle, as an example, that already is a problem, right? So a good way to see it is that for a cycle, the undirected cycle and the directed cycle are cut approximations of each other. But when you're algebraically, they're very different. So this is a stronger and more useful notion. Does that answer your question, whoever's question it was? I think so. Yeah, cool. Good. Okay, so you can show that these work great. So you can actually show and while sketch, depending on timing, I'm not sure whether I should either, whether I should sketch this before or after I give people a chance to go, let me start sketching a little talk about this a little bit more. But we can show that sparseifiers exist and construct them. And then it turns out that you can use these inside an iterative method to solve a problem. So, okay, so let me just, since I'm, right now what I've got left is that I've, this was sort of a self-contained high-level description of the algorithm. And the main things to flesh out are how to construct sparseifiers and how to do the iterative method. I will do, I have sparseification first, it will take maybe five, 10 minutes. And then iterative methods will take a couple more. I think the sparseification is the one people will find more interesting. But I want to, let me just recap the high-level picture and give people a chance to go if they have something at right now. And then I'll, before doing the details, the technical details of sparseification. So the high-level picture here was, we want to solve all these markup chain problems. We turn those into linear systems. To solve linear systems, we reduce to Eulerian systems. To solve Eulerian systems, we're going to mimic this repeated squaring and sparseification algorithm. The main issues are just the squaring and sparseification, basically. For the, for the squaring or for the identities that we use in the squaring, we have to really be, the iterative method is going to have to change in order to get the errors to not blow up on us. For the sparseification, we don't have a definition. And even with a definition, we don't have a construction, and we have to figure out a difference. But once we have those, those pieces, basically our algorithm will be interleaving, squaring, and sparseification with a little bit of iterative machinery wrapped around it to deal with the blow-up errors. Okay, so are there any questions on this? I'm going to call this the end of my sort of general talk without details. And then right after this, I'll do sparseification details. But are there any questions about this? I want to give people a chance to leave if they have to run. Are there any questions? I don't see or hear any questions, and I don't see anyone leaving either. So I think a lot of people want to see the construction of sparseifiers, so I should see it. So that's what I'll do. So I'm going to sketch the construction sparseifiers next, and that's kind of the key technical piece. And I want to do it by building up sort of a sequence of failed attempts based on the undirected case until we solve all of our problems. So let me first kind of remind you how the cleanest algorithms for the undirected case work. For the undirected setting, you can get a sparsifier by the following very simple procedure. Step one is you're going to compute some number that measures the importance of every edge. That's not the world's simplest thing. That requires some thought. But let's just say we're given those numbers that somehow measure the relative importance of edges. Then a sparsifier will just sample a bunch of edges proportional to those numbers that we wrote down. So you take more important edges with higher probability, take n-polylogin samples, and poof you have a sparsifier. This seems like a good idea, and a very, very natural idea for sparsifying a directed graph, or more specifically sparsifying the malaria, and one is just to mimic this procedure. So the first question is, is this our answer? Can we actually just do kind of a naive sampling method where maybe with a little more work we'll compute some complicated looking number for every directed edge, and then we'll just sample a bunch of directed edges according to these probabilities independently. That would be great, but unfortunately it's really not possible, at least as far as I can tell. And the reason is, again, back to a simple example of the cycle. So what I want to do now is I want you to think about a undirected cycle as just a bi-directed graph. So think of it as I take a cycle where between every pair of adjacent vertices I have a forward edge and a backward edge. And then let's look at what happens when we randomly sample. So suppose we were to write down any kind of smart probabilities for this graph, you'd kind of think they have to be basically the same for every edge, just by symmetry, but whatever. The issue that's going to show up is just that when you do your sampling, you don't expect with any kind of edge-independent sampling to get the forward edges and the backward edges to line up. You expect to have that for your edges. There'll be some deviation between your forward weights and your backward weights for your edges. And that's something you would expect. So obviously we don't need to sparsify an undirected graph as a cycle, but this is just to show that if I gave you any reasonable sampling procedure, meaning, so what we do, we'll say take our undirected cycle and pick n log n samples from any independent distribution over the edges, you'll see that this is going to really, it's going to make it the case that your forward and backward edges aren't the same for different original undirected edges. And you might think this would be okay. I mean, they're pretty close, right? If you do enough, you'd expect a 1 plus epsilon or something. Turns out you can see this is actually super false. So you can actually see that just if you just take n log n samples of these two n edges, that you'll actually get something that's a pretty poor approximation of a undirected cycle. And let me give you three ways to see why there's a problem. The first is that it's no longer Eulerian, right? I mean, if you explicitly, if you look at just a single vertex, you'll see that in degree and out degree have no reason to be the same because there's some numbers on each of the four edges and there's no reason they should line up. That already is an issue with our, really, if our whole method is going to make sense, we really need to stay within Eulerian graphs for this to kind of work well. A more fundamental one is that the random, like, we're really looking for something that preserves the structure of random walks. That's the way to think about what any notion of approximation we have should do. And I claim that this method is going to really screw up the stationary distribution, commute times, pitting times, et cetera. And a good way to see it is just to think, like, imagine that I were to tell you that every undirected edge had some imbalance, like, you know, it's one plus epsilon more going to the right and to the left or vice versa, that you expect in all of these n edges to have a non-negligible, just by chance, you expect to have a non-negligible sequence of pluses in a, you know, of all things pointing to the right. Right? You expect that a sequence of length k happens with a probability exponential k. So you get, like, logarithmic length sequences of all, you know, all clockwise or all counterclockwisees on your one plus epsilon. And when you multiply this together, you get a substantial multiplicative change in what your walk does. It has a really substantial drift. And that means that the probabilistic behavior and the stationary distribution change quite a bit. So that's bad news. Algebraically, the stationary distribution change, so the kernel change, the conditioning change, even if you deal with the kernels the right way, then there's no good way of making this into a good preconditioner. It's really just a pretty bad approximation of the cycle. So that says straight naive independent sampling actually has some real issues. How do we fix it? Okay, so let's start with the kind of first problem we had, which was that we broke Eulerian. That would eliminate, if we didn't break Eulerian, we'd also not break stationary distribution and thus not break the kernel, right? Because those are all kind of the same thing. So here's our way of not breaking Eulerian. Yes, we're just going to fix Eulerian. Yes, we'll need to break it. So new algorithm is randomly sampled just as before. But now we have these imbalances at the edges, add the vertices rather, and just like throw in some edges to fix it up. So we start with a non-Eulerian, but sort of close to Eulerian graph, hopefully, just add in some edges to fix the Eulerian. So we'll call that patching. And we still have some issues. One issue is we still don't know what probabilities to use. The other is that this patching is not some, like the motivation behind random sampling was that random sampling gives good approximations of stuff when you do it right. Patching is not random, as described. There's no reason that it shouldn't be big in terms of how it compares to the graph. And you actually can come up with good examples where if you just sample independently, naively, and then kind of do a naive patching thing, you could get that the patch that you're adding in could actually be pretty big compared to your graphs. And so you have to, until you screw up your approximation. So sometimes what we need to know is we need to know that we sample in a way that makes the patching doable without breaking stuff badly in the approximation. So our first idea was to fix Eulerianness. We brought us to patching. Our second question is going to be just how do we not screw up Eulerianness too badly? And so how do we find a way to sample and patch so that we can actually sort of make our patches small compared to our graph? And for that, what we'll do is we're going to take, we're going to look at expanders. So let me tell you a case. I claim that at least in the case of expanders, we can do this. And the basic idea is that our problem with, in our previous thing, if you don't know the probabilities and patching can add too much error, you can fairly easily show that essentially uniform sampling give or take, you have to play with it because of degree and balances and stuff. But nice, simple sampling works fine for expanders. The appropriate notion of uniform is on sense. And then what you can see is that actually the, when I, if I just got an expander, it's not too hard to show. And by an expander, say an undirected graph, the underlying undirected graph is an expander, you can actually show that the deviation should route well in your expander because your expander is a good expander and thus can route lots of stuff very easily. And it's not too much algebra to show that this is fine for expanders. Okay, so this is at least promising, right? It says for any expander graph, an Eulerian expander, you can hope to find up, you can sample and then patch and you can route the sampling, the patch pretty well in the original graph, which says that in some appropriate way we can actually sparsify expanders. However, okay, so this gives us a claimed algorithm with a mild problem. The good news is we have an algorithm, the algorithm that we're going to do before I complain about the problem with it, the good news is like the algorithmic attempt is going to be that you take any graph and we're going to chop it up into a bunch of Eulerian expanders. That's our, so I give you a graph that has some sparse cuts in it, just chop them, break your graph up into a bunch of expanders, and then within each expander uniformly sample or uniformly appropriately weighted, and then patch stuff inside the expander so that it stays Eulerian. The issue here is that I lied a little in step one. It's actually not the case that you can always decompose an Eulerian directed graph into Eulerian expanders. So the issue here is it may well be that the expander decomposition you want to do will actually break Eulerianness. And so that brings us to our final tweak, which is don't even worry about it. So what we're going to do is we're going to decompose the graph based on the undirected expand, like the cuts in an Eulerian graph are the same as the cuts in the undirected graph. We know already from previous work how to decompose any graph into expanders when it's undirected. So let's just do that and hope for the best. And you can show that the best actually is fine, namely that if instead of basically, instead of using this directed notion, and we're just going to use graphs that would become expanders if you erase all the directions on the edges and go from there. So maybe more precisely with a little picture, here's our general algorithm, and this is actually the final one. And we're going to start by decomposing our graph into subgraphs that would be expanders if we erase the edge directions with a mild technical tweak first. So you could think of that as like you sort of forget the directions on the edges and then we just decompose into expanders. Now, then what we're going to do is we're going to put our edge directions back. All we're using the undirected thing for is to get our partition and within each of these well connect, at least somewhat well connected in some sense subgraphs, which are not our priority the way I described this Eulerian. Within each one of these, you're going to uniformly sample the, sorry, I could get back. So within each of these little pieces that we have, we're going to uniformly sample our directed edges, again with uniform with the same asterisk as before. And then we're going to patch each subgraph so that the subgraph stays over there. And that's it. So use an undirected expanded decomposition, take random sampling, patch that's to be Eulerian, and that actually gives you a sparseifier. And that's it. So once you have those, we actually have our sparseifiers and then other than the iterative method that I'm happy to go into if people want to, but I wasn't intending to. That's the algorithm. Okay. So yeah, any questions on this before I either stop or talk about iterative methods? I had a small question that you might have answered already, but what about if you sample cycles? So you said if you sample edges independently, use Eulerian. So these graphs are unions of cycles. What would happen if you try to? Yeah, that's a really good question. So we tried and we didn't know. Roughly the issue was that like, you know, you can actually think of the undirected setting as doing that with cycles of length two. Right? When you sample an undirected edge, you can just think of it as a bi-directed edge that goes forward and backward. That's a cycle of length two. It's a natural guess that we didn't know was that you can, whether you actually can decompose your graph, you know, when you take a cycle, you pay the length of the cycle as a cost. Like, that's how many edges you get. And if we could show, essentially, that you can find a covering by appropriately short cycles that don't overlap too much, you'd be in good shape. The issue was that when you do cycles, they're going to intersect with each other and they might be long. And so it's pretty easy to use the same proof that you would kind of similar proofs to the undirected arguments to get something roughly saying that if your cycles are short and don't overlap too badly, then you can do independent sampling on cycles and get it. What we couldn't do was find the right definition or construction of a set of cycles that we knew would always exist that have these nice properties. Like, we couldn't find, we couldn't convince, we couldn't get a concentration bound and a cycle decomposition so that the cycle decomposition guaranteed a good concentration bound for the number of edges in the output. I can't rule it out. It might just be that we didn't have the right, like it's very possible and it would be very nice if that could work. I think there's two possibilities. One is that it could be impossible. The other is that it could just be that we don't know we were either doing the wrong concentration bounds or the wrong cycle decomposition. Not sure. But yeah, so it would be really, it would be nicer in some sense. It would be nice to get a sparsification thing that completely avoids this decomposition thing and somehow works with global objects. We couldn't quite get it yet. Other questions? Okay, so this was where I actually intended to stop. But I guess if people want me to do iterative methods, I'm happy to. I also will not be offended at all if people want to go home. So I don't know, I guess about with your feet or something. Well, it's hard to get votes from online. Let me do it this way. Let me sort of say a sentence of conclusion and then and then you'll continue and I'll stay and then anyone can you know, stay or leave and in any case it's recorded so you can also leave and it's how long do you need John to do that? Like five minutes? Yeah, so let me just conclude. Yeah, sorry, go ahead. Well, I wanted because I realized I should have thanked the other organizers before we started and I want to do it while it's still recorded because they're doing a lot of work, even though not all of them are here. So the other organizers are Clemence Kanon, Gautam Kamath and Ninja Day and Oded Raghav. So I think we thank them for helping us out. We also thank John for the talk even though it's going to continue. I also wanted to say that couple of weeks from now it was Sebastian Bebeck from Microsoft giving the talk and that was the intermission. So I think John, I'd be happy to see the so anyone can stay. This is still live. It's still recorded and so let's see what the iterative method. Great. Yeah, thank you all and thank you again to the organizers for organizing this. Okay, so iterative method. Basically, the idea is going to be the same thing but we have to deal with errors. So it's going to be a unsymmetric thing worked really well because it had this nice symmetric thing where it could use PSD inequalities that we don't have and it kept the errors under control. What's going to happen when you try to write this out for the directed case is that the errors actually are going to grow a lot faster and the sketch here, this is a mild lie, but not a lot of lies. This is just technical you know, brushing something through the rug. We're still going to use the same identity we had, but we're going to think of it recursively. So instead of thinking of it as multiplying a bunch of stuff, like let's just think of it as like let's think of the expanded identity as a recurrence. So in the exact case, we have I minus A inverse is I minus A squared inverse times I plus A. So if these were scalars, this is like 1 over 1 minus x is 1, you factor 1 over 1 minus x squared as 1 plus x 1 minus x and you cancel the 1 plus x's and you get 1 over 1 minus x. So if you can think of obtaining the same algorithm as just applying this recursive thing I wrote above, sorry, applying this iterative thing above, you can obtain it just by repeatedly applying this identity and you'll get the same thing. So more like you'll get that, you know, if you apply it once, you get that it's the identity over there. If you apply it, you know, three times you get I minus A to the eighth inverse times the times I plus A to the fourth I plus A squared I plus A, etc. And what's nice about this formulation of it is that instead of having an error term and some dot dots, you actually have an exact answer, right? Like the error in this, for example, in the third step is just that it's you're off by a factor of I minus A to the eighth inverse. Rather than having a series that you're truncating terms, you just actually have an exact expression for the inverse but where there's a term that you don't know the answer to. And you can apply this repeatedly in a place with a sparseifier in the same way. But what's going to happen if you look at it is that the error is going to grow exponentially with the number of applications. So you're basically going to lose a small concept every step. And so if you were to apply it start to finish, you'd end up with two, you'd need to let's think we have a polynomial condition number. Even then when you every time you'll have to square it log n times and you'll get like a two to the log n. So you'll get a poly can you'll get a poly error over the log n applications. So basically the loss of doing directiveness is a constant per per swearing and those stock pile and so you get like a two to the number of number of squareings and you lose a poly factor. So that's the issue. The issue is just the errors really do accumulate on you and it and there it is problematic at least as far as we can tell. And so the solution is kind of a naive one which is just to fix it as we go along. So the errors are blowing up in us and instead of just like going start to finish all in one shot sparse by square sparse by square sparse by square, we're going to do some blocks of it and then when the errors get too big, we're going to wrap it in the iterative method to push the error back down before recursing. So you can think of it as we're going to pay a running time hit but as but for I'll say this more precisely in the next slide but like we're going to call the thing we construct a couple times and we're going to fix our error by wrapping it in something iterative. So in slightly more detail, you can think of the previous identity that I wrote here as like the first one, the I minus A inverse thing as I'm going to say a squared inverse one plus A. You can think of that as a reduction that says if I want to solve I want to apply I minus A inverse, it reduces the problem to applying I plus A and then applying I minus A squared inverse. And the idea and then if I apply it three times say I get I'm reducing applying I minus A inverse to applying I minus A to the eighth inverse and then I plus A I squared right so you could think of this as a reduction and the point would be that eventually if I get a high enough power of A then I minus A inverse is pretty close to the identity and it's an easy system to solve. And so if you kind of watch this together you could think of it as like if I apply say J minus I steps of recursion then you could think of this as reducing the application of I minus A to the two to the I inverse to the same thing but with a J. Right so your goal is to get two to the log N or log kappa and you could think of some number of steps is it takes J minus A steps I steps to the recursion to go from I to J. And the cost of this reduction is that you need to apply all the appropriate powers of two of A. You need to apply A to the two to the J minus one J minus two all the way down to A. My face is hiding. I'm all the way up to A to the two to the I so you have to apply all the corresponding powers. And so a kind of high level picture of what we'll do algorithmically is we're going to pre-compute approximations to these repeated squares. The approximations are for speed. So what we'll do is we'll say I think of the B's like B sub I is being A to the two to the I approximately and you'll make it that B sub I plus one is approximately equal to B sub I squared where the approximation notion will be the one we defined before and it will be some epsilon sparsifier and sparsification epsilon that I'll define shortly. And you can find a sparse sparsifier of this form efficiently with O tilde of N over epsilon squared edges. And let me define D to be log kappa. So I think kappa is poly N and D is log N if you have a nice fairly nice polynomial stationary probability kind of setup. Okay. So then what you get is that I minus B I inverse is a good approximate is up to an O of epsilon sparsification is approximated by you know this I minus B squared I plus one inverse I plus B. This is I just substitute the A's for the B's for A's. And so if you look at what this says it says that delta steps of recursion. So what we're going to do here is we're going to kind of interleave two kinds of steps a sparsification step that's going to make things worse and a iterative method step that's going to make things better. So with delta steps of recursion we're going to define these like think of epsilon high as being like we'll have two epsilon high epsilon low and we could think of the reduction this delta steps of the recurrence as reducing applying I minus B I inverse with some high epsilon to a low epsilon application of I minus B I plus delta inverse. And then think of epsilon high as being like two to the O of delta times epsilon low plus epsilon sparsification. Okay. So now what we'll do is if I gave you a black box for for the you know for this approximate linear matrix application then you can use precondition Richard's innovation which is a classical thing to reduce the epsilon error at the cost of applying it t times. So if you want to invert a matrix then you can reduce the you can raise the error to the teeth power by applying it t time applying the inverse t times. And so this is in a so the idea is that now we have to solve a recurrence we're going to say break our d steps that we like we want to square d times think of it as breaking it into d over delta blocks of size delta and so we'll do a block of them of size delta then we'll call a bunch of look we'll call that a bunch of times inside an iterative method to improve the precision and then we'll repeat and then we'll just solve our recurrence so tk epsilon let that be the time to epsilon approximate the case element of this thing. Our first thing our delta steps of recursion the thing named word one at the top says that the time to get um like i comma epsilon high that thing is equal to the time to do it the time to do i plus delta low plus some nearly linear stuff in delta and epsilon square so the specification. Then you have another one that says that makes the error that makes things worse with classification you didn't have another kind of thing that says ti epsilon low you call it log epsilon low reps log epsilon high times and you get it that's how you get back to epsilon back and forth between epsilon low epsilon high right so we now just have a recurrence which you can kind of hack around with you get this and I just copied this onto a new slide and now you just solve the recurrence so we're going to set a high epsilon we're going to make the errors range between a constant and two to the negative of delta we'll make the sparsification match epsilon low so think that you know in your head think that like we're going to do sparse vacation that's pretty strong think maybe get your epsilon down to less than a constant with our eventual settings and we got this recurrence that I just written out if you play around a little bit a little bit you end up with the solution that says the actual time to solve your original linear system becomes n to the 2 plus of delta plus d log delta over delta and plus some linear terms that we'll add in later and so this is what the recurrence solves to know that making delta makes one turn bigger makes one turn bigger and one smaller you balance the two terms with delta being like root d log d uh and you add in the overhead for the non recursive stuff at the beginning and you end up with a running time of m plus n times two to the root log kappa along log kappa and that's it right so note that two to the root log n is less than any polynomial in n and so that is an algorithm that gives us an o tilde of m plus n to the root log kappa along log kappa and yeah okay now I'm actually done any questions if there's no question I have a very small one which is just you mentioned that you improved this later on so is that is that the part of the algorithm that you do in a different way or the kind of yeah I mean we still use the same sparsification routines um the uh the new paper is semi-existent and that it's in Rasmus's thesis which is publicly available but we haven't yet uh we hope slash promise to in the next small number of weeks have it on archive but it's also on Rasmus's web page um the uh but um yeah it's basically we you have a we have a bunch of different templates for solving undirected systems this was one of them that was kind of the easiest one to make go through uh what we did was we took a different one of the undirected solvers as our iterative template uh we did one that was like the um it was like approximate Cholesky stuff um like approximate shore complements and basically what it did was um the it was something where you kind of did block elimination type things it was based on this paper by Sushant and um and Rasmus and what happened the way that we fixed in this one we actually got around the error issues uh in some sense we did it in two ways but one of them was that we the matrices that you solve also get smaller in those methods so that like um if you do something based on like partitioning and sort of approximate shore complements which is what these math methods do then you actually are going to reduce solving a big matrix to solving some smaller block matrices and so you actually can pay for running them a couple more times without it destroying up your running time so here we're solving original size matrices so we really need like the iteration count to not blow up in that one if you're reducing it to slightly smaller matrices you can call them a couple more times to make the errors go away um and so we so that was because it's solving it reduces solving an n by n thing to solving some n over two by n over two things ish and that allowed us to kind of pay for it in the running time we also got a stronger error analysis and that one for the iterative method that we're using that we haven't been able to translate back yet but I think hopefully we can so yeah so basically the iterative method is where the issues were the sparsification stayed the same um and we it took a little bit of better iterative methods and a little bit of better probabilistic analysis thanks any other and or people who are still asked yeah questions so in that case I'll thank everyone for staying and watching um we have a bunch of viewers on the on the youtube channel so about 20 people were were watching and we're not in the hangout so this was a popular talk thanks a lot uh john and let me take this offline so it's it's going to stay on john but but i'll i'll turn it offline okay great thank you