 Alright, well thanks. So, I'd like to tell you about approximate computation. That means what you think it does. You have some problem that you want to solve that's intractable, either in the sense of being NP-hard or complete, or that it takes n cubed time or something like that where you might only have something like n or n log n. And the connections with something that I'm going to call regularization. So if you're familiar with that term, I'm going to be using it in exactly the same way, except taking a fairly non-standard approach to it, but exactly the same way. If you're not familiar with that term, what regularization is, is a general class of techniques in machine learning and statistical data analysis. For in some sense, computing a more robust or more useful answer. Typically defined in terms of what the downstream practitioner cares about. So better classification or whatever. And of course you can formalize it in various ways. So I've thought a lot about some of these theory questions and also a lot about applications and scientific and internet data analysis. And I think that some of the questions that we're talking about here really sort of get at the heart of the disconnect that the workshop is sort of addressing in some senses. So the motivating observations are following. So I don't think anyone's going to argue with the observation that the theory of NP-completeness is a very useful theory. So it's a theory, it's an imperfect guide to practice, but really it's a very useful theory. It provides a qualitative notion of fast. It provides qualitative insight into when algorithms will perform well or not. You heard about linear programming and some of the things. In some sense this is really sort of the exception that proves the rule. I think it's also fair to say that in a wide wide range of large scale data analysis applications the modern theory of approximation algorithms is nowhere near as analogously useful. And that's part of the motivation for what we're talking about here. The bounds you get are very weak. You can't get constants. Dependence on various parameters is not even qualitatively right. It doesn't provide an analogous sort of qualitative insight. And so can we get beyond this? So I'll start with the conclusions. The first is the claim I just made. The second is a claim that if you've played around with practice, we were talking yesterday about putting your boots on and getting into some data. If you've played around with practice you'll know that approximation algorithms and approximation heuristics, we've heard they're essentially the same thing except you can't maybe prove results about the latter. But approximation algorithms are various heuristics. Binning, pruning, all sorts of things like that. Oftentimes implicitly perform regularization in the sense that oftentimes if you do them you'll get more robust or better solutions as measured by what some downstream analysts will care about. So one question is if we're thinking about large scale applications when we run an algorithm we're actually doing something. So can we characterize what we're doing? Can we characterize the regularization properties implicit in worst case algorithms? So the usual perspective is jeez I want to compute this intractable thing it's intractable in some sense of the word so I'm going to settle for the output of an approximation algorithm. And this thing is good in some sense insofar as it approximates this. But you've also just computed something. So rather than just saying jeez I computed something that approximates this can you say in fact I exactly optimized something else? And in particular can you say jeez in fact I exactly optimized a regularized version of this? Because if you can then what you're saying is that I've computed in fact a better answer than if you had given me an oracle to the intractable problem. So I'll present an empirical slide at the end having to do with a particular application in large social and information networks and those things are sparse enough and noisy enough and expander like enough that all these things come into play and in fact that's exactly the case. And so in that application if you had given me an oracle for the intractable problem we wanted to solve it would have been worthless. On the other hand understanding the various approximation algorithms and essentially the implicit statistical properties or the implicit geometric properties or the implicit embedding properties that were going on with them was really crucial for the downstream application. So solutions to approximation algorithms need not be something we just settle for. I mean they're actually what you want and better. So computer scientists as you know view the data typically as a record of everything that happened. All the purchases at Walmart all the clicks at a web page. And the goal is to chew on the data to find something interesting which you want is probably hard in some sense of the word and so you develop approximation algorithms to be fast. But the data are all there is to chew on the data. Now a very different perspective is what maybe statisticians and machine learners would say which is you know the data are so valuable because there are particular noisy instantiation of something going on in the world and the goal is to extract information about the world something called inference. And so you make some model and so on and so forth. Now if what you're doing here takes the age of the university compute what you're doing up here leads to worthless answers. You know you're not going to be doing one or the other. But there really is a very sort of significant disconnect between these two perspectives that gets that what we're talking about. So regularization as I said it sort of rose an integral equation theory to solve in some sense ill-posed problems. You compute a better or more robust solution. It involves explicitly sometimes but more typically implicitly making some sort of assumptions about the data in the sense that this algorithm is quote the right thing to do for certain data and it provides a trade-off between solution quality typically measured by the objective function and solution niceness measured in some other way maybe the downstream person cares about or whatever. Alright so essentially you can formalize in various ways but what's going on is it provides a trade-off between solution quality and solution niceness. So the way it's usually implemented is the following. I want to optimize some f of x and you know g of x is the way I'm going to quantify niceness. So I might say g of x is the unit ball of the vector x the one norm of the two norm of the infinity norm ball and I optimize f of x plus lambda g of x. Lambda is a parameter to toggle between the two the niceness and the objective function measure that's determined by some other criterion and so you optimize this combined thing. So for example you might be interested in least squares regression and you want to tack on plus lambda times a one norm constraint or a two norm constraint. If you think the data are sparse you might put a one norm constraint if you think the data are something else you might put a two norm constraint. And conceptually here the interesting thing is that often times the problem when you add regularization this way you get a harder problem. You go from a vector space problem to a linear programming problem which can be solved in this particular case in other ways but you know conceptually go to a harder problem. So the question we're going to have here is can we in some sense formalize the idea that approximate computation can implicitly lead to better or more regular solutions. So we'll do this in two different contexts and the first I'll be talking about an approximate computation of a certain vector, the top eigenvector of the Laplacian and we'll have a very precise theoretical characterization of how the approximation algorithm is implicitly solving a regularized version of the original problem. The second example we'll have to do with a certain graph approximation algorithm, graph partitioning if you're familiar with that and we do not have as precise a theoretical characterization but all these sort of theoretical evidence points in the right direction and the empirical evidence is sort of overwhelming and I'll also present that. So the basic idea we're given a Laplacian matrix A symmetric positive semi-definite, some nice matrix and you take that matrix and you choose a random vector or an arbitrary vector and you dot it into that matrix and you iterate that process a million times. So that's the power method. And there's lots of variants of it but basically you iterate a million times. When you do it a million times, measure zero events don't happen certainly numerically and so you're not going to choose something exactly in the wrong subspace. So when you do that a million times you get the top eigenvector of the Laplacian. Exactly. What if you do this process three times? Or five or ten or twelve times? All right? Now clearly you get something that approximates that top eigenvector in the sense that the Rayleigh quotient is about right but can you in some sense say that you got a regularized version of your problem? And variance of this is sort of what's done in practice. So a couple of different ways. I describe the power method. There's a bunch of other different sort of diffusion based versions of this. Heat kernel, you put some mass here, run for some number of time steps. T, page rank, you diffuse around and you do some global teleportation. There's a global teleportation parameter. Gamma, Q steps of a lazy random rock. This is the power method except it's a little bit lazy. So you do this three or four or five or twelve steps. So this is very common and probably a lot of you have seen people run these sorts of diffusion based procedures and so the question is if I run for some number of time steps or if I jump with some probability or if I do only three steps of the power method I approximate the Rayleigh quotient in some sense. Do I exactly solve some of the problem, a regularized version of the Rayleigh quotient? So this is an optimization problem. The solution of this is the eigenvector you want. I'm not going to explain what all the letters are but the point is I can write down the optimization problem and I could take my objective and put regularization there in the usual way. Now alternatively, here the x is a vector, alternatively I could write this as a semi-definite program. So here x is an SPSD matrix. I've seemingly made the problem harder. Turns out you can take the solution of this to be rank one. So capital X is equal to lowercase x, lowercase x transpose and lowercase x is exactly that one. So these two are equivalent in that sense. Ending up, I want to make my life harder. I can do this and I could put my regularization here. So it turns out for a lot of these spectrum methods this is where the duality is the tightest and so I'm going to be talking about this. Now I'm going to be wanting to run a fast diffusion so I'm not going to be solving any STPs but this is sort of what's going on under the hood. So let's look at this regularization. L dot at x plus lambda f of x. I've changed the letters for f, g to f, sorry. So let's call this an fA to SDP. I want to minimize L dot x plus now I'm calling it 1 over A to f subject to the usual constraints. So theorem that's very straightforward to prove. Given this setup, sufficient conditions for x star to be an optimal solution of this regularized SDP. So these are sufficient conditions for x star to be a solution to this thing. That x star satisfies these constraints and that x star is of this form. Now I'm not going to again explain all these letters but the point is that if x star takes a particular form that is actually pretty nice. So there's an explicit characterization here and as a corollary of this if I choose f appears there and f appears there if I choose f to be a generalized entropy then I get the heat kernel matrix and the number of time steps t corresponds to eta the regularization parameter. If I choose f to be a log determinant I get page rank and the number of the teleportation parameters are going to be related to eta and if I choose f to be a certain matrix P norm then I get the truncated lazy random walk where whether I do three or four or five steps corresponds to different value of the regularization parameter. So we heard yesterday two days ago Dan Spillon saying when you're doing more complicated iterative algorithms you can use up randomness and so on and it's tricky. So this sort of thing empirically is going on of the hood in much more general cases but here we have a particularly simple setup and we have a very precise characterization that when you run any of these diffusion based procedures these approximation procedures to approximate the top eigenvector of the Laplacian are exactly computing, exactly computing regularized versions of this Fiedler vector. Alright? So the algorithm determines that f for the iterations determines The algorithm determines the f and the number of iterations determines the lambda Correct, the a to the lambda, yeah. So what about something like the Langso's method? Can you do an analogy? Because these are all diffusion based The Langso's a little bit more subspace. So the Langso's at the high level no different than this. It's an iteration at each step you need to orthogonalize in this and that. The orthogonalization makes things very tricky in terms of the analysis. So empirically that's going on and I'm sure that at least in idealized forms of that this sort of analysis would go through. But if you look at actual Langso's code there's a lot of complicated stuff going on which would be hard to analyze. But this is structurally no different than Langso's. Part of the reason it works also just has to do with like the Chebyshev polynomial. Right, so here if you run a vanilla version of this diffusion, the power method the vectors, the Chebyshev solves the vectors become very ill conditioned Chebyshev solves that by keeping them orthogonal. So this would go through to that. Yeah. Alright, so here we computed an eigenvector and you have a very crisp characterization of the sense in which approximate computation leads to regularization implicitly. I haven't solved an SDP. I don't need to call an SDP solver. I just run three steps of the power method. Alright, so you may ask, is this peculiar to vector computations or can this apply more generally to graphs? So let's talk about graph partitioning. There's a bunch of ways to formalize this. Basically you want to split the graph into two sets of nodes. Both of them have some sort of internal niceness. Both of them are about the same size. There's a bunch of ways to formalize it. These are going to be intractable. Alright, I'll be talking about conductance. If you're not familiar with conductance it's basically expansion. It's a degree-weighted version of expansion in the sense of an expandograph. So it's a degree-weighted version of expansion. And the great thing about conductance is it's been studied from every sort of possible perspective, except maybe this one, but every sort of possible. So it studied a lot in theory, a lot in practice by scientific computers, by numerical analysts, by machine learners and so on. So there's a very good understanding of the lay of the land of both theory and the practice. The lay of the land is there are spectral methods, compute an eigenvector. In fact, the eigenvector I was just talking about, and use that eigenvector to cut up the graph. There are flow-based methods, totally different algorithm, totally different relaxation, doing something totally different related to multi-commodity flow. Very strong theory, lots of practice. Clearly if you're going to implement these things in practice, you need to do local improvement, multi-resolution, et cetera, et cetera. If you're familiar with the Aurora or Al-Vazirani result, they basically combine these two in a certain way. But this is roughly the lay of the land. And so we're going to be talking about spectral and flow because they come with strong underlying theory. And we actually know how to behave in practice. So to go through the underlying theory in just a little bit more detail, if you're familiar with this stuff, you've probably seen this, but maybe not in this perspective, and that's sort of a high level. But spectral, you compute an eigenvector. You appeal the Chieger's inequality to get what's called quadratic worst-case guarantees. Now these quadratic worst-case guarantees are, in fact, in terms of the conductance. So there's no n, number of nodes. So you don't get some approximation guarantee log n or whatever in terms of n. It's in terms of the structural parameter of the graph, which is awkward for theoretical computer science, right? Because we like to parametrize things in terms of the number of nodes. And you can hide all sorts of intractable stuff in graph parameters. This is sort of a magical graph parameter. So it's quadratic in the worst case. You can ask, is this quadratic thing real or an artifact of the analysis? In fact, it is real. There are graphs that are that bad, basically, quatery-miller type cockroach graphs. And the worst case is a very local property, basically having to do with this diffusion. It takes a lot of effort to push probability mass down the cockroach's leg, the one-dimensional thing. And this, in some sense, embeds you in a line or in a complete graph. Flow is a totally different method. You take the integer program and you relax it in a very different way. You don't compute an eigenvector. You compute an LP. You're going to get log n worst case bounds. Nothing having to do with the structural parameter, but the usual depends on log n. Worst case bounds are achieved on constant-degree expanders, basically, because everyone's a log-in distance from everyone else on average. The worst case is a very global property. Nothing local going on. It has to do with the global capacity because everyone's on average a long distance. This is effectively embedding you in L1. So you have two methods, complementary strengths and weaknesses, and this actually highlights a fairly egregious theory practice disconnect. Because if you ask anyone in theoretical computer science what's a good algorithm for graph partitioning, they'd say flow. It gives you log n, maybe ARB now, but they'd say flow. It gives you log n independent of anything in the graph. Maybe spectral's OK for expanders because the quadratic of a constant is a constant. If you ask anyone outside theoretical computer science, they may not have heard of flow, but they'd say, gee, spectral's pretty darn good. The quadratic of a constant, you know, for anything we apply it to, spectral partitioning works for where everyone applies it to. That's not true because some people try and cut up things they shouldn't, but that's another story. They'd say, why would you be cutting up an expander? That makes no sense. This sort of highlights what seems to me sort of an egregious theory practice disconnect. Theory people say algorithm A is better than B. Everyone else is B better than A. It's a nice hydrogen atom to ask the question, what's going on if you want to go beyond worst case analysis? Empirically, what is determined, if you look at the nodes you pull out of the graph with method A or method B, what you compute is determined at least as much by the approximation algorithm as the objective function. If you look at the nodes that you pull out of a graph, it will say more, in many cases, about whether you did a spectral approximation or a flow approximation, and then that you happen to use conductance as your objective or something else. So before, you know, we had an explicitly proposed geometry. We had G of X and so on and so forth. You explicitly add it as an extra constraint or Lagrange multiplier. As many of you know, it was under the hood and all sorts of approximation algorithms is embeddings. So you don't do it explicitly, but you take the graph, you stretch it out a bit in some strange way. It might be a log-in stretch with four gains, result, whatever. And, you know, you embed it in a nice place. That nice place intuitively might be cutting, trimming the corners or making it a little bit nicer or more robust or more useful for a downstream practitioner. So you don't have a crisp formalization of the way in which this is exactly doing a regularization in the way I had for the eigenvector case. But, you know, you may be a fair question, is this all nice talk or do you actually see this? So this work actually grew out of a large-scale study of large social and information graphs that are challenged at every niceness assumption you might want to imagine of the data have. In particular, they're expanders, not constant degree. And they have long stringy things that, you know, sparsity properties. So you might imagine that you'd see differences between spectral and flow. So y-axis here is the conductance. Down is good. x-axis is size. Don't worry about why I'm tearing the graph up at multiple sizes, but say that I'm interested in finding, you know, clusters of 1,000 nodes in my advertiser-bitted phrase graph or whatever. So down is good. Red is flow. A flow-based method and blue is a spectral-based method. And what you see here is that flow is clearly better than spectral. No questions asked. So if you're interested in this class of graphs at this size scale and you really believe conductance is what you want, there's no point in touching spectral. Flow's much better. Period. On the other hand, you might say, you know, is this useful for some downstream application? You know, the particular clustering or market identification or whatever. So on the right is two different niceness measures. Down is good. By niceness, you can quantify it in a bunch of different ways. These particular ones are the diameter of the cluster and the size of the deepest cut inside the cluster. And in the size scale, in both of these cases, you get a lot of garbage also, but in both, you know, so you want to bias yourself down. In both of these cases, spectral, blue, is better than flow. You're getting a much nicer by these downstream metrics that people care about. Nicer set of nodes, all right? Step back, formalizations aside, this trade-off between solution quality and solution niceness is effectively the defining feature of regularization. Except I haven't tacked on regularization here. You're observing this as a function of two different approximation algorithms for the same intractable objective. Now, you could say, gee's conductance is the wrong thing to be optimizing. Choose some conductance prime and optimize that. But conductance prime is intractable. You're going to have to relax in the same way, and you can relax with spectral or flow or something else. So if you want to scale up to large graphs or to large matrices or whatever, you're going to have to be running some approximation algorithm. And what you see is determined at least as much by the approximation algorithm as the original objective. So in a lot of applications, you see these properties. And we have a very crisp characterization in one context of the way in which approximation algorithms lead to implicit regularization. A lot of evidence and very strong theoretical evidence in another application. And it seems that, as Michael Mitzvah said this morning, a lot of this stuff that we do is sort of tweaking worst case. And this is basically saying, what are you actually computing and what are the implicit properties in which you're actually computing? If you run algorithm A or B, what did you just compute and when is it in fact useful or not? So these approximation algorithms really aren't something we settle for. In some cases, they're better than the solution, the intractable thing. But clearly, you can't do that naively. So I know we've seen that applicable a lot more generally out of other applications. So with that, let me wrap up and I guess we can head to lunch unless there's any questions.