 So, let me begin my talk by saying if you like this conference, you may also like this other conference that we're organizing in Aspen next spring. The deadline for applying is very soon, it's in about a week. And the web page is on the Aspen Physics web page if you want to know more about it. So, let me tell you about algorithms now. So, my title made me think about this thing I read about Netflix, which is that in planning new TV shows they use data mining to figure out which topics are likely to be popular and combine those. And so, you know, it seems like I've just picked two important buzz words and combined them and I should put them into context. So, if you've seen the Gartner hype cycle from the innovation trigger to the peak of inflated expectations, the trough of disillusionment, the slope of enlightenment and the plateau of productivity, you shouldn't take the quantitative too seriously. Machine learning is doing very well, quantum computing is on its way up. And that's what I'm going to tell you about today. And the reason why machine learning, obviously I don't need to tell you why it's doing so well, part of why it's doing so well is because we have these big data sets that we can analyze. So, let me tell you my kind of theorist perspective on big data. So the way I think about it is if your data size is N, then I would call this big if you cannot do something that's much more than linear or nearly linear. So something like order N squared is out of reach. So, you know, here's a data center, that's the inside, that looks pretty big. And an example is, here's a very simple problem that we can definitely solve in polynomial time, linear regression. So you want to find X to minimize the norm of AX minus B. And your data size might be the number of non-zero elements of the matrix N. So if this is some very big number, let's say it's the number of hyperlinks in the web graph, then you cannot solve this with anything that's going to be much more than linear, like Gaussian elimination is out of the question. But an iterative solver that's nearly linear, like PageRank, you could with great effort, you could actually achieve. So that's the kind of size I'm thinking of. People have had a lot of success with these nearly linear time algorithms, but there's also limitations, right? There's only so much sophistication you can do. And so it's natural when you run into computing barriers to ask, can we use a quantum computer to help, right? If Moore's law is running out, can quantum computers help with this task? That's what I want to address in this talk. And at first, it seems as though it's going to be hard to do anything useful with a quantum computer with a giant classical data set. So if I have input size N, then at some point I've probably spent some effort scaling with N to, I should say, before we even load the data, just to acquire the data, right? So if N is the number of hyperlinks in the web graph, it's order N time just to pull those off of the web, let alone to store them, and let alone to program them into the computer. And this order N is if your computer is fully connected in every part, talks to every other part with zero cost, here's some images of what a quantum computer might look like. And one thing you don't see here is a network of N squared wires connecting every qubit to every other one. There may be some long-range gates in ion traps, for example, you have these collective modes, but it's not free. It's not like things are fully connected. So the cost of having N pieces of data going in is at least N, maybe even more. So quantum computers do not seem very compatible with that previous picture I drew of the giant data center. And so when people think about exponential speed ups, or I said exponential, even polynomial speed ups, you might think about, you would expect an advantage for the places where the scaling of the problem, where the difficulty grows rapidly with the input size. And so even if you knew nothing about quantum, you just knew it was a beyond the classical computing platform, and you knew that it was qualitatively more powerful, but also expensive to build, just knowing that, you would look for advantages in the places where we're finding it. You'd look for advantage in cryptography, because that's an example where we've engineered cryptosystems to have complexity scaling with the size of the, exponentially scaling the size of the input, sort of artificially generated complexity. Or you'd look for natural complexity, like in quantum simulation. Or optimization is another case where you have the solution space that you're searching over can be exponential in the input size. Here the exponential is not so obvious because maybe there are better ways of searching over it. Maybe the landscape is friendlier than the worst case you could imagine. But if you don't know how to handle the complicated bumpy solution landscape, then it is an exponential time. And you might look at things like classical systems where things aren't necessarily exponential, but the constants are very large, like protein folding where you have multi-scale dynamics, those are other places where you might look. So that's the typical place where you would look, what is this, you know, how can this help us analyze a big data set? In particular, can we use Grover search, right? So let's remember how Grover search works, given the ability to compute a function F, bind an input X such that the function evaluates to N, evaluates to one. So classically, I need order and time, quantumly, I need order root N time, order root N evaluations, and all kinds of similar semi or unstructured searches, like maximizing a function, approximately counting it, sampling from a Gibbs distribution, evaluating an Andorre tree, like for game playing evaluation, finding trials in the graph, et cetera, et cetera. Many, many other algorithms exhibit quadratic speed ups, but always in this kind of model, where I'm not searching over a physical data set, right? This is not, this does not say, I have a data set on a hard drive, and I'm searching for some record on that hard drive. This is, I have a function that I can evaluate on a quantum computer, and my cost is a number of evaluations of that function. So just to reiterate, the data model is this oracle model where I have an input X, which is some kind of subroutine, and I can evaluate it and spit out the answer to that function. However, this oracle model does not fit what a data center looks like, right? If I have a data center, I cannot send it a superposition of queries for different bits. That's just, that's not how it works. Worse, that's not even how, even if I could do that, that's probably not how I would grow over. If I, you know, take a data center and I wanna ask, is there a one on any one of the hard drives, right? Are they all zeros or is a single one? They're better uses of Grover, but let's just say I wanna do that one. I would not wanna Grover search over the bits in the data center. What I would do is I'd say, well, each of those hard drives is a CPU attached to it, or maybe every four hard drives is a CPU attached to it. Let's just send a message to all of them, tell them all to check their own hard drives and report back. In other words, we rarely have storage, the storage to CPU ratio does usually not grow in an unbounded way. Usually as we add more storage, we're also adding more controller chips and more computing as well. So, classical computers are not fully parallel, but their sort of parallelism ratio of storage to computing doesn't get so out of control that if you just wanna search for one record, you probably would do it in a distributed way. You probably would not, the Grover model would not make sense for searching a data center even if you could put it into superposition. There is a proposal for quantum RAM, which is very intriguing and I think a worthwhile thing to work on, but we don't have it yet and the proposals to build it don't seem tons easier than full scale quantum computers. They might be, but no one's proposing terabytes of quantum RAM and saying that that's going to be only slightly cheaper than a data center. The proposals seem to involve a lot of the quantum computing type elements. So the way that we get speed ups on quantum computers has been what you might call synthetic data. So you can use Grover to solve useful problems one example is AES. If you wanna search over all 128 bit keys, you can do that with only 64, two to the 64 evaluations of the AES function. People have worked out that each evaluation is not that many gates. So that's something you can do. For these linear system solvers and PDEs, they have a similar issue that you can't just load in. The input should not be something that lives on a hard drive. It has to be something implicitly specified. And so the case where you have this is that you might have a finite element model in an implicitly specified shape. And you could use that to repair a quantum state corresponding to some shape that's been discretized at a very fine scale. Commutorial optimization, of course, you have a small amount of input data that implicitly defines a large search landscape. Shored's algorithm famously creates a periodic function with a giant, you're never writing down the whole truth table of the function, but you implicitly describe it with just a few numbers and then compute it in superposition. And this very recent, beautiful work by Bravy, Gosset and Kearney do a similar thing for Bernstein-Vasarani. That was an oracle problem, and they gave an explicit instantiation of that oracle as a computable function that takes some input and computes an output that satisfies the oracle promise. These are beautiful when they work, but we're not guaranteed these will always work. Sometimes you get sort of a flash of insight when you can do it, but these seem to be, in a sense, special cases. So what about big data? What can we do with them on a quantum computer? Let me specialize to it. I mean, there's not a very specialized problem. Here's a concrete version of what I mean. Let's suppose I want to do maximum likelihood estimation. So what this means is I have a set of models, Y. It could be a mixture of Gaussians, and they have centers and covariance matrices, or it could be the weights of a Boltzmann machine, or whatever you like. And I have a giant data set X. And I assume that the data set is IID samples drawn from the distribution specified by the model. So assuming there's a true model generating data points, and I get a billion observations of them or something like that. And I want to figure out which model was most likely to have generated those data points. So I want to maximize over Y the likelihood of model Y. And that's going to involve maybe this term R of Y, which you can think of as either prior or a regularizer. Something doesn't depend on the data. And a sum over every data point of the log likelihood of observing that data point given the model. So that's the canonical maximum likelihood estimation problem involves a maximum over a sum. A closely related problem is Bayesian inference. So now I have a prior pi 0 of Y. And I want to sample from the posterior distribution, which is basically the prior reweighted according to all of the data. So the reweighting looks like sum over X of, again, this log likelihood. And then there's going to be a normalization factor Z, the partition function. So these are the basic questions I want to consider. And so can a quantum computer speed this up? That's the question that I wanted to answer. Please interrupt me, by the way, if there are questions about what the model is. OK, so how long does this take to run? If Y has no structure, then I'm maximizing over the whole set Y. And then I'm summing in the inner loop over the whole set X. So the runtime is going to be the size of X times the size of Y. Now, in practice, we're often doing something much better. Maybe we're doing gradient descent on Y. Or we have some other method of some way of handling structure in Y that's much better than brute force search. However, there's some cases where we don't know much structure. And there are more cases when, after we take advantage of structure, we're still left with something that looks like a search. So maybe we do gradient descent, and that finds the closest local optimum. But we still have to do a blind search over the choice of starting position. So there's still an unstructured component, and then maybe a gradient descent part. So I'm going to leave aside the possibility of useful structure, kind of assuming we've taken advantage of that, and just suppose that we can't do anything intelligent with Y. So in this case, can we use Grover to speed this up? Can we get the Grover speed up here? Exponential would be even better. But we should at least know this. Can we at least get Grover? We at least want to know this much before we go on to fancier algorithms. Or maybe there are better algorithms. So one thing we could do is we could do in the metropolis walk. We could take a point Y and find a nearby point and maybe move to it, or maybe not, depending on whether it improves our score. There are quantum speed ups of this. In general, if you have a classical walk, there are quantum walks that will get a square root of the hitting time. So can you do a quantum walk that will get a quadratic speed up over the classical metropolis algorithm? Or maybe you're going to run stimulated annealing, and you want to replace that with adiabatic evolution. Here's something where we don't always know what the improvement is, but we'd like to try it out. It seems to be kind of incomparable to a lot of classical algorithms. But adiabatic evolution involves evaluating the cost function as you're going along. You have to turn the cost into an energy. And so how can all these things deal with the problem of the large amount of classical data? For concreteness, I'm going to focus on Grover in this talk. But the same considerations, maybe with some small modifications, most of the same considerations apply to the more sophisticated algorithms as well. OK, so let's zoom in on maximum likelihood estimation. And when you look at this at first, it seems almost symmetric. You have a max over y and a sum over x. For Grover, max and sum, both of them Grover gets a square root advantage for. Classically, you both have to iterate over all of the points. So max and sum almost seem the same. This almost seems symmetric between x and y. But there's a big difference in the type of data. So y is not a physical data set. You could call it a synthetic set. This is something that we can query in superposition. Maybe these are weights and centers of clusters. And so I represent them just by a bit string that's log of the size of y that indexes into the set. I never write down a giant table of all of them. x, on the other hand, lives on a hard drive. So these are individual records. And these I do not have superposition access to. So there is an asymmetry in the type of data, even though the problem at first glance, just when you write it mathematically, doesn't look so asymmetric. So what can we do with this? Well, one simple thing you could do is you could have a classical computer controlling a quantum computer. So you have your ion traps. You get there getting blasted with lasers. You're doing readout. And the lasers are being programmed by a classical computer that, as it goes through, is reading the next record off the hard drive and figuring out which laser pulses to do. So we can run Grover. The number of iterations over y is square root of the size of y. But each interloop, you still have to iterate over the whole data set. There's no savings there. So that's OK. I mean, it's better than nothing. But it's not that impressive, and it doesn't seem very practical if your effort is going to depend in this way on the size of the whole data set. So what I want to talk about in the rest of the talk is, can we reduce the dependence on the size of x? Can we run this on a quantum computer? I'm willing to Grover over y. I'm willing to spend square root of the size of y, keeping the back of my mind that maybe I would do adiabatic, maybe I'd do a quantum walk. I'm willing to spend the cost of y. But I really don't like that x there. I don't like the size of the data set also entering in. That seems to me devastatingly large. If I couldn't do more than nearly linear on a classical computer, it's hard to imagine I would be able to on a quantum computer. So I'll describe to you three results. None of them are that technically difficult, but I think they're interesting things to look at in this new model. So the first is you could not adaptively reduce the data set. So I have a data set x, which is gargantuan. I can use a classical computer to reduce it to a smaller set, x prime, and then feed this to the quantum computer. So then the quantum computer iterates over this smaller data set, doing Grover in this way. You still have to search over all of y. And now the inner loop depends only on size of this x prime. So that's the first thing I'll talk about, non-adaptive data reduction. The second one is the second and third algorithms are interactive between the quantum and the classical computer. So the second one is called adaptive importance sampling. And it's a way of using the quantum computer to help develop x prime instead of just doing on the classical computer. And then in the final step, having the quantum computer do the optimization. This one I think is promising, but I do not have any good provable performance guarantees. I think it's worth investigating, like a lot of quantum algorithms, we should try it. There's some promising things about it. But we do not know provably how it's going to run. Finally, there's sort of a special case of this, which are zero sum games, where we can get provable runtime bounds, which I think you could argue could not be improved by very much. OK, so let me talk about this first one about data reduction. So the key idea, which has a long history in the classical literature. And I found it under the name core set. There may be other equivalent names for this. The idea is you have some data set x. And you want to summarize it. Well, that color is not that great. Hopefully you can see there are four little red dots here, each sort of fatter than the dots in x. So you have some data set x with a bunch of points. And you're in a place with some x prime with fewer points. And these points are weighted. So each one will come back to weights later. But there's a small number of points. And each one can have some weight, which is some positive number. And what we would like is that when you sum over this smaller set x prime appropriately weighting the f of x, y, you approximate the original sum over x. So remember, f of x, y is the log likelihood function. It's the log of the probability of obtaining data point x from model y. The thing we want to maximize is the sum of this over all x. We have to iterate over the whole data set to do it. So a core set is something where you iterate over a much smaller set. I mean, hopefully much smaller. You iterate over x prime. And because there's maybe n things here and much, much less than n things here, you probably are going to have to weight the point in order to get these things to be comparable. So you also want to add some weights, which can be positive numbers, generally not all the same. So that's a core set. How do we come up with such an x prime? There's one. So again, here's x. Here's x prime of the red dots within it. And our goal is to have this kind of approximation. So one approach that's been widely used is called important sampling. And what it means is you sample points from x. OK, the most naive thing you could do is sample them uniformly. But this is not so good because if you have a situation like this where there's a big cluster and a few smaller ones, uniform sampling might just totally miss the smaller clusters. And if those are the smaller clusters is where the Higgs boson shows up and the big one is where the Higgs doesn't show up, it'd be very sad if you miss the smaller cluster. So even if it's smaller, if it's far from the other cluster, you want to make sure you hit it with at least one point. So important sampling tries to do that. And the challenges are, OK, so what that means is we want to sample from x in a non-uniform way. The challenges are, we need the estimator we get to be unbiased. So we're going to do a randomized instruction of x prime. We want this approximately equal sign to be an equality and expectation. Next, we want to control the variance. So we want this to be equal in expectation and the expected difference squared to not be too large for any fixed y. And then finally, there's a lot of different y's. We want to control it simultaneously for all different y's. And the important sampling approach to this is to sample x with a probability s, proportional to s of x. You could call it the sensitivity. Actually, technically, it's an upper bound for the sensitivity of x. And whatever that probability is, you assign a weight that goes like 1 over s of x. So if I want to make some points way more likely to choose them, I have to then put them in with a smaller weight so that the whole thing comes out unbiased. So for any distribution s of x, I do this sampling. And I'm going to get an unbiased distribution. To get the variance low, one method is to choose s of x to be greater than the largest what's called sensitivity of x. And what this quantity on the right-hand side says is, for a given point x, what is the most fractionally that it contributes to any model y? So if there's any model y where this data point contributes a large fraction, then we say that the answer is very sensitive to this point x. Now, you look at that sum. And that looks a lot like the original problem that we wanted to solve. So computing this sum is infeasible. But you only need upper bounds for it. Now, the bigger your upper bounds, the more wasteful things are. But if you can get some coarse approximation here, maybe by doing a very crude approximation of the original thing you want to solve, you can use that to get upper bounds that you use for these important sampling probabilities. And the better you bound is, the better you'll control the variance. The final thing is we want to control it for all points y. And so this is an argument. This controlled variance means that if you draw these samples, it'll probably be close for a given point y. But there's a lot of y's. Maybe we've discretized an infinite set. And so we have a very large number of them. We don't want to just do a union bound over all the y's. One common approach is to bound something called the pseudo dimension, which is a measure basically of the complexity of how rich is this family of models? So as you vary over y, it's going to induce different distributions over x. And the pseudo dimension is a measure of basically something like the dimensionality of this space of models. And so if this is bounded, then you don't need too many points in your core set. So in particular, these are the new ingredients. You have S of x is an upper bound for the sensitivity. Capital S is the total sensitivity d to the pseudo dimension. You can prove, this is just a review article, but they have a good overview of important sampling, that the size of a core set that you need goes like the sensitivity, total sensitivity squared times the pseudo dimension times some terms that depend on the error. So epsilon is how accurate your core set is going to be. It says you want to get all those sums you want to get within a ratio of 1 plus or minus epsilon. And delta is a probability that you chose a bad core set, because this is, after all, a randomized construction of them. So we have these provable bounds. What do they look like for some sample problems? One important one is k mean. So you have d dimensional points. You want to group them into k clusters. So your model, y, is basically the position of those k centers of those clusters. S of x, y is determined by the minimum distance to one of those centers. And you want distance squared because it's k mean. So that means that the preferred point of the center of a cluster is at the geometric mean. But there's also k median, where this 2 is a 1, and many other variants of the problem. And you can bound the pseudo dimension scales like dk log k. That's the dimension times the number of cluster centers. And to get an upper bound of the sensitivity, you could use a lousy approximation algorithm to get a very coarse approximate cluster. Maybe one that uses too many clusters and that has points that are further away than is optimal. Still, that can give you an upper bound on sensitivity, which does not depend on the number of data points. So you see total sensitivity. You're summing over all the data points. You might worry this could scale like the size of the data set. But in this case, you get that it only scales like the number of clusters. And so the number of points you need grows. There's some function of d and k does not scale like the size of the original data set. So you get a big reduction in data size. And then what does the overall algorithm look like? Well, you have a classical computer to estimate these s of x and use them to find these samples, x1 through xm. These are just independent samples. You can find a sample with an iteration through the whole data set. So this is kind of a reasonable thing to ask a classical computer to sweep through the whole data set once, or even m times, where m is not too large. Well, I guess if you want m samples, you could probably do just sweeping through it once. And then what you're left with is a smaller number of points where you still have to solve a clustering problem. So the clustering problem hasn't gone away. You still have to solve it. But now it's so small that brute force search becomes a more feasible possibility. And here, you could get a, if you use Grover, a squared advantage. If you use some other quantum method, you might get a better advantage. But all of a sudden, these quantum techniques become possible, right? All of a sudden, we have a data set reduced to a size where the quantum computer can meaningfully access it. So we couldn't have fed the whole original data set through the quantum computer this reduced one we can. So this is the first method I want to present for using a large data set on a small quantum computer. And it's nice, I think, I like it, but there's something unsatisfactory about it, which is the use of a quantum computer here was not very sophisticated. A lot of the work was done by the classical computer and by our probabilistic arguments to get us down to a small data set, and then we have this partially chewed problem we hand to the quantum computer to do the final steps. I think that's still a promising use of quantum computers and something that we should continue investigating, but can we use a quantum computer in an interactive way? In this way, it'd be possible for the quantum computer to access the whole data set. Right in this one, by the time you turn on the quantum computer, you've thrown away most of the data. But could we use it in a way where the answer to the quantum computer could potentially point to anything in the data? So that's what I want to discuss for the rest of the talk. And one approach I'll propose following these classical papers, which considered adaptive core sets using purely classical computers, is to do Bayesian inference. And I'm moving to maximum likelihood in Bayesian inference out of convenience, really. There's not a dramatic difference between, I mean, there are important differences between them, but if you can do one, you can probably do the other one, too, in almost every case. So, okay, let's go back to our Bayesian inference problem. I want the probability of why is given by my prior times the update from all of the data points, and I propose to use a quantum computer in the following sense. Suppose that at step T, I have a core set of size T. So instead of using the entire data set, of size N, let's say, I've chosen T points, X1, X2 through XT, and corresponding weights, and I send those to the quantum computer. And think of T here as pretty small, because every iteration of Grover, the quantum computer has to go through all of them. So I really do not want T to be too large. So then, using this input, I can use the quantum computer to do Gibbs sampling. And so I can use the quantum computer to sample to the corresponding Bayesian inference problem, sample from the corresponding Gibbs distribution. And the cost, it's going to be, well, if I do Grover, it'll be square root of the number of models times T for every inner loop. If I do something smarter, maybe it depends on the number of models will be better. That's kind of an orthogonal question, but the key is whatever smarter thing I do, the inner loop is just going to depend on T. Okay? And now I need to use this quantum answer. So I've sampled this point Y sub T. I'm assuming I'm doing T rounds. So in this point, I've sampled points Y1 in the first round, Y2 in the second round, up to Y sub T in the teeth round. And I'm going to use those to choose the next classical point in the core set. So this step is, well, the quantum computer returned a point. Now the classical computer has to extend the core set by adding one more point. And the proposal I'll use was proposed kind of recently by Campbell and Broderick, but it uses a version of gradient descent due to Frank and Wolfe from 60 years ago. And their version of gradient descent is very beautiful. It says, if I want to do gradient descent in a convex space, then one thing I could do is just move in the direction of the gradient. And then if I go outside the convex set, kind of walk backwards until I'm inside it. But another thing I could do is move, take the point in the convex set that has the largest inner product with the gradient and move in that direction. And so the advantage of that is if my convex set is something like the probability simplex, the points I move in the direction of are very, very simple. They're just corners of my convex set. So for the probability simplex, they look like a single one and all the rest zeros. So it's a method of gradient descent that automatically produces sparse distributions. So that's what we're going to use here to build the core set. So to do this, I want to think about my space of models as a vector space. For model Y, I want to base the state ket Y and now the data points X, each correspond to some vector in this space of where the coefficients look like the log likelihoods. And then the exact log likelihood, just summing these phi of X over all the X's. So that's going to be some vector that I want to approximate. And let's say that up to a certain point, this is my core set, my core set gives me this approximation. I don't, I mean, right, this is some over all the X's, but you can think of most of the weights as zero. So I'm summing over the whole data set. Most of these weights are zero, some of these are non-zero. I get some vector of log likelihoods. Each one of these, by the way, is a vector of length Y. And now this residual vector is what I want to correct. So I'd like to choose my next point in the core set to be in the direction of this residual vector. And so this is the Frank Wolf idea instead of just moving the direction of the residual, which is going to be a very dense vector. I'm going to choose the single data point that moves me best in that direction. So I want to do kind of a sparse update of my vector. I just want to add one point to my core set. It would totally defeat the purpose to add all of the points to my core set. So I want to choose X sub t plus one to maximize the inner product between the residual and the direction I'm moving in. And for technical reasons, I want to weight it by the sensitivity. So I want to sort of prefer points with low sensitivity because a point with high sensitivity will make things more unstable. So I choose X sub t plus one to maximize this. I also have to choose the weight, but that's just a scalar. So I can do that with a line search. I can just sort of do binary search over what the best choice is. So if you look at this, it doesn't look great because there are a number I have to iterate over the whole data set. And for each one of these, I have to compute an inner product over a set of size Y. So now to make this more efficient, I'm going to approximate the inner product. So remember, the quantum computer's return sample is Y1 through YT. And these are supposed to be an estimate of our Gibbs sampling, of our posterior distribution. So I'm going to use these samples to get me an estimate of this inner product. I'm going to estimate the inner product between vectors as just multiplying the components of that vector where I have the samples returned and averaging that. And now each one of these inner products only takes me time T to evaluate. And so I can loop over the whole data set in size at scales like T times the size of X. So remember, T is not too big. For the quantum computer, my cost was the cost of Grover times T. So I don't want T to be too big for that. Now on the classical computer, my cost is the cost to sweep over the whole data set, which I'm resigned to doing at least once, which again is not too large. So what I've done is I've had an algorithm that the runtime is not too much worse than it could be for either classical or quantum. So I think this is promising. The weakness is that I don't have provable bounds on how big T has to be to get a good guarantee. So this is one of these things that we should just run and see how well it does. And I'm excited to work on that in the future. But now let me move to the last topic, which is a case when I can give you a provable bound. And this is going to be a problem called saddle point optimization. So let's suppose we put a little more structure. We say that X is convex. I guess I should say the data points live inside a convex set. And Y is the set of probability distributions over M items. And S of X, Y is going to be linear in the coordinate Y. So it's going to be basically, I'm going to sum over, you could think of it as an average over F sub I of X. And Y is a probability distribution and so it tells me how to do the weights of this average. So it's going to be some weighted average over F sub I with weights determined by Y. And the problem I want to solve is to maximize over Y the minimum value over X of F of X, Y. This is called a minimax problem. I guess it looks like maximum in, but it's minimax because it's equivalent by Von Neumann's minimax theorem to minimizing over X the maximum over I. And that's because of the various linearities and convexities that I've assumed. So this family includes a few important cases. So one of them are zero sum games. So let's suppose that you have a Y player who wants to maximize the function and an X player that wants to minimize it and they're allowed to each choose mixed strategies. They don't have to pick a deterministic strategy. They can choose a distribution over strategies. So now Y is at distributions over M items. X will be distributions over N items. And then you want to evaluate this maximum of a minimum. A related problem is linear program. So let's take each F sub I of X to be the inner product of AI with X, where AI is some vector, minus, I guess that should be B sub I, sorry. So that, we want to know, let's say we want to know, are all of these things less than or equal to zero? And if so, then X lies on one side of a bunch of hyperplanes. So in other words, it lies in a particular given intersection of a bunch of half spaces. And that's equivalent, you can see to this right hand side being less than or equal to zero. Because you want the best value of X and the minimum value of X, the maximum over the constraints I. So if you can find an X that satisfies all the constraints, this will be less than or equal to zero. If you can't, it'll be greater than zero. Finally, it sort of looks like maximum likelihood estimation except the sum over X is replaced by a minimum. And so this you could think of as maximum likelihood estimation in the regime where the sum is dominated by the lowest point, right, by the minimum value. So you have a sum over a giant number of points. It could be dominated by the typical behavior or it could be dominated by the worst case behavior. And in that regime, this could be a good approximation. So what is the algorithm that we'll use to solve this? It's going to be based on this fictitious play algorithm proposed by Greg Goriades and Katchian in 95. And what fictitious play means is, I mean it's not really fictitious, I guess except it's not on a computer. The X player chooses a strategy and then the Y player chooses a response to that, the X player chooses a response to that and so on. And if you choose the best response, this is clearly not gonna work so well. Because if you look at, let's say rock, paper, scissors, one player chooses scissors, the other one chooses rock, the other one chooses paper, it's just gonna cycle forever. But what Greg Goriades and Katchian showed is that by adding a little bit of randomness, you can get this to converge to a distribution, to an equilibrium efficiently. And this algorithm again, you can think of it as a version of Frank Wolf again, which takes a little bit more massaging to make it look like Frank Wolf, but it's sort of again in the same family. So how does this look? So let's say you have this payoff matrix. So X has six strategies, Y has six strategies. The payoff is gonna be if X chooses, X strategies X and the Y strategies Y, the payoff is A sub XY for player Y and it's zero sum. So it's gonna be minus that for X. So Y wants to maximize the value, X wants to minimize it. So the algorithm will go as follows. You'll first choose X1 arbitrarily. So I don't know, let's say I pick X1 to equal two. Then you want Y to respond to that, but it's not gonna be a best response. It's going to be biased towards better responses. So there'll be some parameter epsilon and we'll use that to set a sort of temperature so that the better responses are more likely. So remember Y wants to maximize the payoff, so we'll choose a probability to be exponentially larger for higher payoffs and lower for lower payoffs, according to this Gibbs like distribution. So let's sample from this and let's suppose we get, I don't know, Y1 is equal to three from our sample. Then this defines this column and now we do the same thing for X except the X player wants to minimize. So there's a minus sign here. You'll choose X2 with probability proportional to, you know, this exponentially depends on epsilon times the entries of this column. And now let's say X2 is six. Now the next step, Y will respond to both of the previous strategies of X. So it'll respond to the payoff that's given by the payoff from one, round one and round two. So these are these two columns here. And the key thing to notice is that Y only needs to look at these purple rows. So you have to look at the number of rows which is given by the number of rounds. And so the key question is how many rounds do you need? And they showed that it converges to an epsilon approximation after a number of rounds that goes logarithmically in the size of the matrix. I should say assuming all the entries are between zero and one. And again, this is something that can run well on a hybrid classical quantum computer. So to sample the classical strategy, you need to iterate through the whole data set once for each and for each one you have to do a calculation of size T where T is the total number of rounds. Quantumly, you have to do Grover or maybe some more efficient algorithm. And you have to, again, the inner loop is going to take time T. And so this is again something which runs in nearly optimal time in both the classical and the quantum hardware. Okay, so I'm almost out of time. I'm gonna just sort of close with my manifesto which is I love quantum supremacy as a topic of research of quantum doing something that classical can't. We have to get to that point before we go on to do the useful stuff. But in the long run, we're still going to have big, cheap classical computers so we'd like to use them. So you could think of this as, we've seen this before with chess that at one point we wanted to know whether a computer could beat the best chess player. We found out that the answer is yes. So thanks to IBM for that. But now, if you look at the best rated, you know, the best humans and the best computers are both beaten by humans and computers playing together. This is sometimes called centaur chess or advanced chess. And I think this is a place to look in quantum algorithms. Can we have classical and quantum computers working together? So let me mention briefly some next steps. These are kind of abstract speed ups I'm talking about. Can we apply it to practical questions? We'd also like to look at classical data that is high dimensional. So you're not only going to have many data points, but maybe the data points correspond to gene microarrays where you have a ton of dimensions per data point. So what can you do then? There may be more hybrid algorithms for synthetic X like belief propagation, kind of like the way people do with the D-wave machine. I think there's more room for that. And quantum simulation is another area where there's been some work on hybrid simulation. I think there's a lot of scope for more research as well. So with that, I'll stop. Thank you for your attention. Okay, we have a few minutes for questions. Hi, thanks so much. I've got a million questions like also about like if it was so easy to reduce your data set, why don't classical machine learning people do or classical people, data scientists do that very easily with their methods and like sample complexity. And I wonder if you can get into a lot of trouble there, but actually my much easier question is, so if you reduce the size of the data set to T and you say T is small and then everything is nice, obviously the size of a data set is also influenced by the dimensionality of your data. And in big data, you have not only a billion data sample or data points, but you also have a billion dimensional data points, let's say a million. So let's say these numbers are like of the order of the same order, then you reduce actually your size only like quadratically somehow. I mean, you can be happy with that, but can you comment on that because I was missing the dimensionality of the data as a problem in your reading a little bit. Right, so no, those are both great questions. The dealing with high dimensional data is not, I would say that's future work. And I think there probably is scope for quantum advantage there, but it's not immediately obvious what to do. As for your first question about why don't classical computers do this already, sometimes they do, and so in the not adaptive case, they do, but then they still have a very hard search problem at the end. And so that's still, you know, classical computers do the data reduction and then they solve the smaller problem and the point is quantum computers can solve the smaller problem better. And then the second answer is they couldn't do this already if it required an adaptive back and forth with a quantum computer, right? So that's something they couldn't have already done. Now, you could still do it, I mean it's an important point that you still want to work in regime with a set of models why it's pretty big or pretty hard to search over. Because if why is very easy to search over, you can do that step on a classical computer also and there'd be no need for the quantum computer. So you're absolutely right that we have to look for the space where quantum can, it fits within quantum capabilities and it's outer reach of classical. And so this suggests that why should be quite large but not, you know, large enough to be hard for classical and not large enough to be impossible for quantum. But so I think there's a parameter range where this is possible but you're right that it's not the entire parameter space. Any more questions? Now let's thank the speaker again. Oh.