 on sampling problems. Go ahead. Okay, thank you for the invitation. So I'm going to talk about or the topic of my talk is trace estimation, which I think will probably be familiar at least to some people in the audience. But also there's an interesting story about dimension. So the idea comes from a setting where the dimension is exponentially large. And what I've done is kind of downgraded it to setting where it's only linearly large, but still perhaps large. So that's not maybe the usual pipeline of ideas. But anyways, in the lower dimensional setting, certain improvements are possible. And that's the topic of the talk. So okay, so trace estimation. Suppose you can only perform matrix vector multiplications, which are called MATVEX by an N by N matrix. And you want to estimate the trace of this matrix. Then basically the most widely used generic approach is arguably a randomized algorithm called the Hutchinson estimator. And I'm abusing the nomenclature a little bit because this is really a Hutchinson type estimator, which estimates the trace as the expected value of this quadratic form averaged over Gaussian random vectors. So there's a simple formula. So this is an unbiased estimator and it's variance. There's a simple formula for it. It's just the Frobenius norm of the matrix squared. And when M is positive semi definite, the trace can be lower bounded by the Frobenius norm, which means that basically, well, literally the relative standard deviation of the Hutchinson estimator is less than equal to one. So it doesn't depend on the dimension at all. And it's hard to imagine you could ever do better. And you really can't. So why wouldn't you always use the Hutchinson estimator? Well, in many scenarios, you're really interested in computing some trace of the form trace A times M, where only M is positive semi definite. And in this case, you can't really get any a priori control or not not satisfying control. So the reason is that this Frobenius norm of AM can grow faster than the trace is in some high dimensional limit. Because in general, the off diagonal entries of AM are not controlled by the diagonal entries. And just for future reference, I'll comment that we may also be in fact interested in computing several of these traces at once for the same AM fixed AM, but several different AK. And the assumption is that you can multiply by these AK efficiently. So a major example of interest is the problem of computing the diagonal of a matrix, a positive semi definite matrix. And this can be viewed in the framework of the last slide by thinking of this as computing a bunch of traces simultaneously, where these AKs are just the indicator matrices along the diagonal. And in this case, if you just apply the Hutchinson estimator, you get this diagonal estimator, which is given by taking the expected value of the entry wise product of Gaussian random vector with the products MZ. The circle product means entry wise product. So you can compute the variance of this estimator explicitly. And it turns out to be the just the norm. So the variance for the ith entry estimator is just the column norm of the matrix M. So the relative error will be large if basically the off diagonal part of AM is dominating the diagonal part. So it's very simple to understand when it would would or wouldn't work well. So in particular, if the matrix is very sparse or local in some sense, especially this is, you know, has a chance. And in fact, if you know a sparsely pattern a priori, you can kind of de randomize Hutchinson estimator. This has been considered in this nice paper of tangents up. So there's kind of one other tool in your arsenal, if you want to say compute the diagonal of a matrix that you only know how to multiply by, and that's to exploit low rank. So if the matrix is low rank, then, and you know the low rank factors, then everything can be done easily can explicitly compute the diagonal. And that's fine. So there's some hope that you could combine this then with a randomized bit that gets if you're only approximately low rank combine, you know, the using a low rankness with some randomized algorithm to sort of mop up the rest of the air. And that approach is called a hutch plus plus, which is pretty famous now. And that will work in some settings, but there are a lot of matrices that are neither low rank, nor kind of, I'm saying diagonally predominant tongue and cheek is doesn't have to be diagonally dominant, but just that you don't have that the off diagonal doesn't predominate in the matrix. So there's many matrices that satisfy neither of these criteria. And in fact, hutch plus plus can be worse than hutch if this if you split a matrix into a low rank piece plus another piece that somehow ruins where this second piece now has a ruined sparsity structure. So there's been some effort paid, you know, attention paid to trying to balance these considerations. But at the end of the day, you can't really combine these ideas on certain matrices of interest to get a constant variance estimator while still maintaining linear scaling of computation. So there's a lot of examples of trace estimation problems that fit into the framework that I just set out. So one is statistical leverage scores, or Mahal and Ovis distances, which will come up later in an example. I'll just give a few examples later. Also, if you do Gaussian process regression, and you want to get error bars for your predicted function, you need to compute the diagonal of a matrix, heat kernel signature, which is a shape descriptor used in graph analysis and computer graphics can be viewed as the diagonal of something graphs and certain graphs and trality scores can also be viewed as a diagonal. So in electronic structure, there are certain I can solve for free approaches to density functional theory that require one to estimate the diagonal of the matrix, only known implicitly. And those are all diagonal estimation problems. But also if you're trying, if you have some problem where you try to optimize some objective involving log determinants and you need its gradient, that also has this kind of structure. So for example, when you're, if you're doing MLE or maximum up posteriori estimation within a Gaussian process regression framework, then this will also come up. And actually, I don't really have time to talk about it in the talk, but my motivation for doing this was coming from semi definite programming. So it turns out that if you can do this kind of estimation that suggests randomized algorithms for pretty generic semi definite programs. So the inspiration comes from quantum statistical mechanics, where the inspiration for the idea of the talk, where you want to compute so-called thermal expectations of the form in this equation. And here, the numerator is the kind of thing that we saw before, and the denominator is called the partition function. So here, either the minus beta H, well, that's a positive definite matrix. And H defines physics of the problem. I mean, there will apply this to settings that are not physical at all, but just as where the approach is coming from. And beta is an inverse temperature. So actually estimating Z in our setting just by Hutchinson is not the real problem. So if you can estimate this quotient, then you're happy. And there's a way to do this that was introduced in the quantum many body physics literature called minimally entangled typical thermal states algorithm introduced by Stoudmire and White. And yeah, so this is developed in a setting where the vectors are exponentially large dimension. And the approach is specifically useful for when applied within an approximation framework that stores vectors in a compressed tensor format, specifically matrix product states. And the original motivation for this whole algorithm, which is featured in the phrase minimally entangled, actually has no bearing on today's talk whatsoever. So they came up with it for a very different reason. And that was kind of specific to this tensor approximation framework. And in our setting, we're going to assume that we can basically deal with vectors directly, but not the operators directly. So we don't ever want to form a full matrix, but forming a full vector is fine. So in this original inspirational setting, even forming a vector is not an option. And so the point is then to introduce some improvements in this, in some sense easier setting and demonstrate broad applicability of the idea by combining with other numerical linear algebra tools. Okay, so what is METS? The basic idea of METS is to rewrite this trace in the numerator in the following way. So I can't see in my own mouse, so I guess I'll just have to use words. I don't know if you can see it, but it's not very helpful if I can't see it. So the trace is rewritten first by doing, so splitting up crucially the exponential into two halves, the matrix square root of the exponential, then cyclically permuting, and rewriting the trace as a sum over the diagonal elements essentially. And then one defines this handy vector, which is a unit vector, which is just the normalized vector proportional to e to the minus beta h over two times the ith standard basis vector. So that's appearing in the last expression in the first line. And if you rewrite the first equation in terms of this phi vector, you get the following, or this last equation that just appeared. And the important point is that you can view it as a statistical average over unit vectors, namely the phi's that are drawn from a distribution p over the index i. And so in the red box, that that expression, you view it as the ith entry of a probability vector p, it's not negative and they sum over i up to one. So you can think of this expression as an expectation. And it's important that this quadratic form involving the phi, well, within that the phi is actually a unit vector now. It's not a vector of potentially arbitrary size, which would have been if we somehow just naively looked at the diagonal elements of the trace beginning to define the trace in the beginning. So I'll come back to why unit vectors are good provided we can sample p. But in order to evaluate this statistical average, then we're going to want to do it by sampling from this probability vector. And okay, so we need to sample p. And we can't form p fully because forming p fully, as you can see in the last expression involves computing the diagonal of the matrix, which is exactly what we're trying to do in the first place. So okay, so we want to sample p. And we've kind of shifted all the difficulty there. So we can't build it explicitly. And this Metz paper introduced, in my opinion, a very ingenious idea, which will generalize, and it's not really even clear to me what the precedent for it was, which is to define a Markov chain, which has p as the invariant distribution. Well, of course, that's not so ingenious. You know, that's a classic approach to trying to sample from something that you can't do sample from directly. But what's ingenious is the very specifically adapted choice of Markov chain that just works for this example, which is defined by the formula in the last line, where when you the transition probability from from the ith index to the jth index is just the square of, so you take phi of i, so you take that unit vector, and you look at all of its entries squared, that forms those those add up to one, and you just pick one randomly according to those weights. So you actually kind of thinking of phi as a wave function and then doing a random measurement of the wave function, if you're happy with that intuition. And okay, I'll explain why this works later in the more in the other setting. So for now, that's just an inspired choice that happens to work. And what it means is you can get samples by iterating these steps. So you build a unit vector phi equals five I, and then you sample I, the index according to the entries squared of five, and you repeat and you wait for that Markov chain to mix and your sample I will be a sample from the probability vector P that we wanted and we can use that to estimate the trace. So there is a catch, which is exactly what we'll address to this approach, which is, well, first I'll say what's good about it. So if this matrix exponential is approximately rank one, which happens in the low temperature limit, then this will mix in one step, multiplying by that matrix or square root will essentially project you onto the whatever the top singular vector is and then it will never leave. And that's what you want. But what happens that's unfortunate, and which wasn't really the emphasis of this original work is when beta is large. And in this case, the sampler very much gets stuck. So the mixing time becomes arbitrarily long for this Markov chain. And that's because this matrix exponential is becoming very nearly diagonal. And so if you multiply a unit vector by a nearly diagonal matrix, well, a standard vector, you get approximately the same standard vector back. And so you resample from the entries of that, you'll probably stay in the same place. So you tend to stay in the same place and not move around if the temperature is high. And this is precisely when Hutchinson does a good job when the matrix is nearly diagonal. So we definitely want to be performing well in that regime, if we want a conclusive advantage. And in fact, we really want something that works uniformly well across all temperatures. So there's some flexibility available to us that wasn't available in the original setting. And this is, well, for reasons that will remain somewhat obscure, why it's unavailable in the original setting. But for us, it's simple to just instead of using a trace formula, in terms of the diagonal entries, we can use the trace formula as an expected value over Gaussian random vector, just like in the Hutchinson estimator, but applied in a different way. And we'll get an expression for this, the numerator of our thermal expectation, that looks very similar to the one from before, except instead of involving a sum over the index i, it involves an integral over the sort of Gaussian vector z. And you get if you define an analogous unit vector phi, you get an analogous formula for the thermal average, where again, you are interpreting the thermal average as an expectation of a quadratic form 5z transpose a 5z. So you're taking an expectation of that with respect to a probability distribution over the vector z, which is the remaining part of that expression when the integral. Okay, so to recap, we rewrite the thermal expectation in the following way, where this probability distribution p of z is now given by this formula, which up to proportionality, you can rewrite as I've done in the last expression. So we wanted a sample from this actually very simple looking formula where you take a Gaussian PDF and you multiply it by a quadratic form. And that's actually not easy to do directly, even though it looks like maybe there should be some weird trick to do it. So again, we'll define a markup chain, very analogous to the one before. So we want a sample from this probability distribution. And the principle here will explain also the earlier rule in a new way, which is that we want to sample from this p of z, we introduce a essentially resolution of identity. So we sort of unwind fake integration by some extra Gaussian random vector w, and then realize that this means that we can view the original p that we wanted to sample from as the marginal of a symmetric PDF given by the last expression, where f again is the Gaussian, standard Gaussian PDF. And then one can sample from this joint probability distribution, of which we really want one marginal by Gibbs sampling. So alternatingly sampling from p z given w and w given z. But since it's a symmetric distribution, actually Gibbs sampling we can really just sample z given our last z and so on forever and kind of forget that we're doing Gibbs sampling at all. So this is kind of a different interpretation of what was going on in the Metz paper. Yeah, so although you can view it as a Gibbs sampler, you never really see the fact that you have two kind of marginals because it's symmetric. So the anyways, the problem reduces to sampling from this conditional PDF given the last expression, which if you can then do you can you can get an algorithm for estimating the trace by building a unit vector phi of z, and then sampling your next z according to that conditional probability distribution. And it turns out step two can actually be done explicitly. So this conditional sampling can be done explicitly by a trick where you split z into parallel, so parts that are parallel and perpendicular to phi. Maybe I'll omit the details. So there's kind of an alternative useful perspective, which is to think of phi instead of z as the sampling variable. And then imagine that we're sampling from some density on the unit sphere. Then when you do this viewing phi as being sampled from some density q on the unit sphere, you can rewrite the kth thermal expectation you're looking for in the following way as an expectation. More compactly, if you want many at once, we'll group them with the bold notation. And the key points that's guaranteed by the fact that these fires are unit vectors is that if you can draw samples, then this this natural estimator where you just. Yeah, just then just the obvious choice of estimator. Given the expression on the last slide. This has a variance that can be bounded very simply. So it's unbiased and its variance can be controlled in terms of somehow these ak matrices. And in the case of diagonal estimation, this whole expression simplifies and you get a very simple bound on the covariance. So in some sense, the difficulty of the diagonal estimation problem has been shifted entirely to the difficulty of actually sampling from this distribution. And once you can do it, once you can get samples, then the variance of your estimator will be controlled nicely. Okay, so before just getting into a couple quick examples, I'll have to kind of explain one extra thing. So in the original algorithm, meds, this Hamiltonian h is sort of a given object. But that happens in some settings to follow where you're given h and you will want to compute the diagonal of e to the minus beta h, for example. And in that case, all we need to do to run the algorithm that I just described is to perform efficient mad vex by a matrix exponential. So there's many approaches for doing this that only rely on the ability to multiply by h. For example, the one I just cited is preferable in a lot of settings, but there are many others. And so that's one kind of way this can be used. But oftentimes, these trace estimation problems don't appear with a matrix exponential. For example, you often want to compute thermal averages of the following form where you're presented not m directly, but you're presented m as the inverse of a matrix k, that's positive semi definite. For example, a kernel. And what you don't, you don't need to take a logarithm of k inverse divided by two and re exponentiate, you can just do all that without dealing with matrix exponentials at all, just direct, all you need to be able to do is directly perform mad vex by the square root of k inverse. So in other words, by k to the minus one half. And there are very beautiful approaches to doing this based on contour integration, the philosophy of which is that to do a mad vex by k to the minus one half, sorry, I put m here, but I meant k. It's not in some sense much harder than doing a linear solve by k. So it kind of reduces roughly the complexity of solving linear systems in terms of k, which is probably all you can hope for if k inverse is appearing on the problem, the first place. And there's a couple extensions I mentioned briefly. So it is possible to exploit approximate low rankness in this framework in a way that doesn't damage the variance bound. So you can it sort of does no harm to the method compares favorably to 100 plus plus. And also, it's possible to apply this in a setting where you factorize a positive definite matrix in a non symmetric way. So our derivation considers the symmetric factorization into its matrix square root. But in fact, you consider general factorizations m equals B B transpose, as long as you know how to multiply by B and B transpose. Okay, so I'll give a couple quick examples. So the first is this thing I mentioned before called the heat kernel signature. So what this is is that if you're given a graph, graph, and you form its graph will plus C and L. If you take this matrix exponential of minus TL, so you take either the minus TL and look at its diagonal, that thing, well, that's a vector, which varies with time. So vector defined on the graph vertices, which varies with time, that's called the heat kernel signature. So in fact, at small times, this kind of recovers a notion of scalar curvature, even for graphs, but originally for Riemannian manifolds, and modulo some caveats, if you know the entire signature, then in principle, it determines the whole space up to isometry. And consequently, the heat kernel signatures used as a shape descriptor and computer graphics, but also where the graphs are triangle meshes, but also more generally graph analysis. And up to my knowledge, there's no scaling scalable algorithm for estimating it across all T. If you were to form this matrix exponential directly, the scaling would be cubic in the in the graph size, which is not scalable really. And it's not really suitable for either Hutchinson or a low rank estimator, because although the this matrix exponential is approximately diagonal for small t, it's very much not so when t goes to infinity. And when, although it becomes low rank as t goes to infinity, there's an intermediate regime where it's neither local nor low rank. And so you need a new idea. And this thermal sampling approach works uniformly well across T as a little show. Oh, I got my mouse back. Okay, that would have been useful for okay, so this is a movie of the heat kernel signature on the graph as a function of time. And on the left, what I'm showing is the relative error computed by taking 10,000 samples for Hutchinson versus thermal sampling of the heat kernel signature at various times. And so notably, these 10,000 samples are not effective samples. So they're not, they're not actually independent for the the thermal sampling approach, they're just 10,000 samples taken after some burning time, the Markov chain. And so an important upshot of that is that since the performance is staying constantly good across time, an implication of that is that actually the mixing time of this Markov chain is also independent of time. So it's mixing uniformly fast across all temperatures from zero to infinity. That would be the main concern of this approach, but it doesn't in practice, in practice somewhat miraculously, it seems not to be a concern at all. Here's just another example showing the same, yeah, the exact same kind of a result, but on a different triangle mesh. Okay. And the other quick example is well, it's called the ridge leverage score. And what this is, is it's given by this matrix diagonal I've just written here. If you're given a matrix A, it's motivated for sketching matrix sketching applications in terms of a. So depending on the context, the rows could be you can think of them as corresponding either to samples or features. And this lambda the regret rich parameter appearing in the expression. It's to it's somehow tuned to yield a rank k approximation. And you need to take larger lambdas to get lower rank approximations, which conveniently makes this matrix inside the inverse better condition. And you could evaluate this diagonal directly with MK squared scaling, but you may actually want to avoid this. So if K, even if K is far less than M, the K squared might still be something that's a nuisance. And thermal sampling instead achieves order MN scaling. So the idea here is to split the matrix that we're trying to estimate the diagonal of into non symmetric factors now. So this is a non symmetric factorization example. And without really saying much about this example that I'm providing, it's just this fairly generic toy problem. You can see the same kind of graph as before. So this is the performance over a range of ridge regression parameters lambda. You can see Hutchinson very much depends on the performance is very much lambda dependent. And yet it's thermal sampling. And in particular, the mixing time remain constantly good across all temperatures or lambda, which is kind of like a temperature. So this is the reference for the talk and thanks for your attention. Thank you, Michael. Are there some questions in the new audience? Yes. Hi. Thank you very much for the nice talk. So I have a question mostly on the numeric. So you compare with Hutchinson. Did you try to compare also with lower rank methods? I'm wondering how large the sweet spot is because you said T small, T large, other stuff works. So I'm wondering what's the range where it's much better. Let's just take this example for example. So in this one, the matrix will become low like approximately rank one when T goes to infinity. So low rank will become good in that regime. So in principle, you could imagine like starting with Hutchinson and then switching. I mean, and maybe it's an elegant, but you might be happy with it, you know, switching when the range gets small. But yeah, really like it's not totally satisfying because like basically the approximation quality of the low rank approximation for any fixed T will itself be independent. So for any fixed T, if you want something that's kind of scales linearly, the low rank, even using low rank approximation won't work and neither will combining the two, although I didn't show the result. Hopefully that answers the question. I see. Thanks a lot. I was curious to know what do you mean by which notion of of optimality were you using for the sketching problem? What does it mean here, optimal sampling? Yeah, so this is, yeah, so well, of course it's somewhat context dependent, but like, I mean, basically say you want to, so you want, you know, you have a matrix that's approximately low rank and you want to write it, you know, you want to view it as a, it's column space as being approximately spanned by some subset of columns, which columns do you choose? For example, that would be the kind of motivating question here. For many matrices, you don't want to just choose a bunch of random columns, you actually want to favor certain columns over others and you can use these rich leverage scores to score, score them according to which ones ought to be sampled more, which ones ought to be sampled. And that was worked out in those papers by Musko and others. Well, there's two Musko's, I think twins or brothers. Okay, thanks. Any last question? All right, sorry. Hi, thanks for your talk. I have a question about something that you mentioned regarding the use of MPS for compression. So I don't know if I understood it right. So the idea is that in order to construct the vectors phi that you need, do you use this compression scheme? Is this right? So, well, I don't think my slide is that informative, but I'll still come to it. Okay. Yeah, so in the original setting, so the vectors themselves are too large to be stored. So they are represented as MPS. And otherwise, you do everything exactly the same. So as long as you can do this kind of imaginary time evolution by H, in other words, multiplying by e to the minus beta H, of within the class of MPS tensors, then you can run this algorithm. You can't run the improved version which requires these Gaussian random vectors, but you can run the original one. And that's literally how it works. Right, thanks. And the H there would be an MPO. So it only requires you to be able to do imaginary time evolution of MPS by Hamiltonian, which is an MPO. And that's kind of very standard within the tensor network physics community. Okay. Thank you very much, Michael, for the talk again. Recording stopped.