 In the interest of time, we are going to start again. I think most people are back in the room. So for this second session, we are actually going to focus on generative models. Our first speaker will be Michael Albergo from NYU. We're currently finishing maybe his PhD, not yet. Okay, take it away. Can you hear me okay? Okay, thanks Mary Lou. So yes, I'm going to talk about a topic in generative modeling. It's some work I've been doing with Eric Vanden Eiden and Nick Boffey, who's here in the audience with us at NYU. The topic is going to be what we're calling stochastic interpolants, which we're trying to describe as a unifying framework for flow-based and diffusion-based generative modeling. So just some background on me. I'm not an applied mathematician by training. I'm technically a physicist. I'm interested in problems and computational approaches to quantum field theory, so some things in high energy physics, some things in condensed matter. In addition to that, I'm interested in developing machine learning techniques that are inspired by the underlying physics and the scientific computing demands that come from that. Some people call this ab initio AI. So let's roll with that. Just some diagrams here of some things we might like to compute. This is a strong force of vacuum energy density, things about how you measure what's going on inside of the nucleus of an atom, and some sort of generative model in a manifold related to simulating that. Before I go any farther, I just want to say thanks to my collaborators and mentors, in particular Nick and Eric for instrumental in the work we're going to talk about. So props to them. The agenda today is going to be density, estimation, and sampling with transport maps. So I want to give a little bit of some motivation and background on the flow-based picture of transport, how people use it for generative modeling, and then I'm going to frame the challenge of basically, we don't have the maps that provide this transport up priority, and the challenge is to learn expressive and scalable maps. So one inspiration for that is going to be score-based diffusion, which we know has proven to be very successful of late. What I hope to do is take this transport picture and take the diffusion picture into the unifying together what we're calling stochastic interpolants. So what we'll get from this is an unbiased generative modeling paradigm that allows us to use either deterministic or stochastic processes for the generative process. We'll discuss a little bit what's the trade-off between this deterministic and stochastic process, and ways that we could design these stochastic interpolants for different things we might want to do. So just to frame the problem setup, the goal is to estimate an unknown probability density function that I'm calling row one. We're going to do this in this setting from sample data, and why is this a compelling problem where over the past 10 years, we've gone from drawing black and white digits of numbers to being able to query a large language model to draw, for example, an image of bears acting as chemists in a cartoon fashion in a matter of seconds. What I think is compelling about this trajectory is that the measured transport perspective in these various advances has started to emerge over the past five years or so. In particular, it's well-founded in the diffusion picture. So the transport framework of this is going to rely on the following notions. We're going to introduce what I call a base density, row zero. This is something that's usually like a Gaussian. We want to build a reversible map that I'll call t that pushes row zero onto a target density that I'm calling row one. So if we do this, we get an ability to write down the likelihood, row one, expressly in terms of the transport t. So it relies on this determinant of the Jacobian of the inverse of the transport map given here on the bottom. And this is useful because if you don't know row one, you have samples from it, for example, you could learn this map t in a way that allows you to compute likelihoods under row one. So of course, if we don't have t a priori, we need some estimator t hat to do this. And if you want to be able to compute this likelihood in any computationally reasonable fashion, we need both this determinant term to be tractable. So you'll need to put some sort of constraints on t hat, your approximation for the true map. But you also want it to be maximally on constraint so that in a learning sense, you can learn some sort of expressive map to model your data. Now, this perspective has about a 25-year history now. I mean, you could go back to the 60s if you want to do it. Some early work by Chen and Gopinath on Gaussianization from 2000, Esteban, Tabak, and Eric from 2010, and then Tabak and Turner in 2013, describe learning this map as, OK, it's probably hard to do all at once. What if we break it down into smaller sequential steps? So if I learn a set of tk, and I do this under some sort of maximum entropy perspective, then if I learn them sequentially, then this sort of breaks the learning paradigm down into a simpler task. And in the mid-2010s, Laurent Dyn and Daniela Rosenda and George Papa Macarios spearheaded their collaborators a ways to make each of these tk an invertible neural network to sort of add expressivity to this, turn it into an optimization problem that's amenable to neural networks. But you can imagine, instead of taking many little tk, I could take the k equals infinity limit here. Each tk is infinitesimally small. And this allows me to rethink this transport t as the solution of a continuous time flow. So this flow will be governed by an ordinary differential equation. This is sort of the idea by Will Grathwal, Ricky Chen, and company. And if we do this, this determinant we had to compute turns into a trace over what I'm going to call a velocity field B. It the Jacobian of this velocity field B. And making this exchange allows us to rely on some clever trace estimation techniques to evaluate the likelihood in a better way. There's the advances of the neural ordinary differential equations that make a lot of this optimizable and compatible with neural networks. And we think this is sort of the right way to be thinking about things into the future. So I want to spend some time explaining a little bit what this continuous time flow looks like. What we could do is think about the flow map, which I'll call xt. And this is going to be given by our velocity field Bt. You can think about it at the level of the particles in this form here, where I have some initial condition at time t equals 0. We say that the flow map xt equals 0 is equal to your initial condition x. And that its time dynamics are given expressly in terms of the velocity field B evaluated at the flow map xt. So pictorially, this means something like starting at x0 of x here, and then following these flow lines along this map of the change of the probability density from some rho 0 to some rho 1. But of course, we can also think about this then at the level of the distribution as the picture here suggests. And to do that, we need to think about what I'm calling here a transport equation or a continuity equation that just basically says that the time dynamics of the change in rho t, which is now a time-dependent probability density, is given by a current and the flux of this current, basically telling us that the probability density needs to be conserved as it evolves over time. So if rho t solves this transport equation, then that tells us that our representation of the transport rho tx at time t equal 1 should arrive directly on the density that we wanted to model, which I'll call rho 1 here. Let's see. Oh, I got a notification that the meeting is being recorded. OK, so if rho t solve the continuity equation or transport equation here, then we can rely on Ben-Amore-Brenier theory to say that there exists some velocity field Bt in this continuity equation that does this push forward for us at the level of the samples themselves. And then the question is, how do we find this sufficient Bt such that this continuity equation is preserved, that maps rho 0 to rho 1? So some initial work in this direction was done by Will Grothwell and Company, and they basically proposed to, because we can take the change of variables formula given here in the top right, which is just the continuous time version of the likelihood that I wrote down on the previous slides, we can effectively compute a KL divergence between the target rho subscript 1 and our model rho parentheses 1. And if we do this, then we can perform some sort of negative log likelihood evaluation. If Bt is some big neural network, and we can use clever tricks like the adjoint sensitivity method to compute the gradients of Bt with respect to the subjective function, then we can effectively train these models. The trouble is that computing this objective involves integrating the ODE itself. That means it involves solving these equations given here in the top left every time you want to compute the loss of in your training paradigm. So this is going to be very expensive. And moreover, there are many Bt which would tell you to go from rho 0 to rho 1, and it might be useful to have your learning algorithm choose that path ahead of time. So this sort of asks, is there a simpler paradigm for learning Bt instead of relying on this maximum likelihood estimator? Well, one inspiration for that could be score-based diffusion. We know that this has been a resoundingly successful generative modeling paradigm. I could feed one of these score-based language models the sentence, a brain riding a rocket ship headed toward the moon. And I'll get that image pretty immediately. This is the work of Yang Song, Stefano Armand, sort of built on the ideas of Yasha Sol Dijkstein and Appa Hvarinen. And some of the denoising perspectives of these models goes back to Vincent in 2011. But the main idea is the following. You have a data density. Perhaps it's images of dogs. And you want to devolve these images under some Gaussian noise until, at some time, T equals infinity, this sample has been totally lost. The signal has been totally lost to noise. And this is actually described by a stochastic differential equation called an Ornstein-Olenbeck process, given here in the top. And that seems pretty useful a priori, because it doesn't tell you anything about going backward. But if you look at the SDE that describes the backward evolution from the Gaussian to the data, then what pops up is this term grad log rho T. And that's what we call the score of the density. And if you want to learn a model, for example, that tells you how to go backward from the Gaussian to the data, then it amounts to learning a model for this score function. Because then this SDE is simulable, and you could sample under the data density. Moreover, Song and Company show that there's also an ODE formulation of this. I've stripped down this SDE to be sort of in this basic form, but it tells you how you can write down a deterministic mapping from this case, again relying on the score. So yeah. What is really useful about this perspective? Why does it work so well in comparison to some of the other stuff? We know maximum likelihood is a really useful concept. So how can something be beating it? Well, one idea that we think is quite important is that what's happening here is that there's data available for every time T under rho T through the whole map from rho 0 to rho 1. And this is quite useful because if you can evaluate the path everywhere, then you are effectively choosing a path in the space of measures from rho 0 to rho 1 that allows you to turn this generative modeling into a regression problem. I just need to learn S at every time density rho T. And I could do this now in sort of a quadratic loss form. This is basically saying that if I can fit a model S hat of T to grad log rho T using these score matching techniques of Apo Havarnan, then this simplifies the learning paradigm quite a bit. But there are some limitations to this that may be empirically useful, or maybe not. Go ahead. So you're asking what grad log rho T is with respect to a denoiser. Yeah, so if we go back. So in this picture, what we have is a noise process here that I'm calling dx T equals minus x dt plus square root 2 times some incremental Brownian noise. What's happening is as we go this way, we're adding noise to this image until this signal is entirely lost. The reverse process is not the same as the forward process. What comes up is that you need information about the time dependent density anywhere along this map. If I say this is rho 1, and this over here is rho 0, then I need to know rho T everywhere in the intermediate time. In fact, I need to know the gradient of a flock density. Exactly. Yeah, it's the density that follows this. If you're looking at the deterministic dynamic, it follows this continuity equation. In the other case, it would follow Focker-Planck equation. So they established this simple regression problem. There are a few caveats to this. One is that it requires that you transport to a Gaussian. The OU process relies on this Gaussianity. And so if you want to build a generative model between any other two densities, this is going to be one limitation. The other part is that technically this OU process only converges to the Gaussian in the T equals infinity limit. So in this noising interval, you actually need to be able to evaluate rho T of x or grad log rho T of x across the entire interval 0 to infinity. In practice, this means we have to truncate at some high T. And in doing so, this introduces a slight bias in the model. I will say it converges exponentially quickly, so T does not really need to be that large for this to work decently well. But it's sort of asked the question, once this is thought of as a regression, it's not a priori clear that you need this OU process to make this generative modeling paradigm work. So we can ask, how can we work on a fixed interval, T equals 0, 1? We choose an arbitrary starting density rho 0, and we go to some other density rho 1, and we can build some connection between them that directly gets this velocity field that's in the picture of the continuity equation as I introduced the flow-based transport without having to rely on this OU process, or initially a notion of the score. But we'll see it's kind of still useful. So to do that, I'm going to describe a really simple function that I'm going to call an interpolant function. This is going to be I of T x0, x1, it's a function of T, time in the interval 0, 1, samples from the base density, samples from the target. And importantly, this is a function with the boundary conditions that at time T equals 0, this interpolant evaluates to x0, and at time T equals 1, it evaluates to x1. So the trivial example here is just 1 minus T times x0 plus T times x1. And the takeaway I want you to get from this is that if x0 and x1 are drawn IID randomly from their respective densities, then I of T, this interpolant function, is a stochastic process which samples the intermediate density rho T of x. And we call this sample x of T. So just in this picture, this means if I take any x0, which is this single normal Gaussian to this mixture model at time T equals 1, we can get any smooth interpolation between rho 0 and rho 1 at the level of the density by sampling the stochastic process. So more formally, we would say that rho T of x is equal to expectation over samples from the base density, samples from the target density evaluated along the interpolant under this Dirac delta function. So my claim is gonna be that this does indeed satisfy a continuity equation. If we look at the form of rho T, just as a reminder, this continuity is given here. And you can ask why? And it amounts to intuitively looking at the time derivative of rho T evaluate that we have written up here. So if you expand, if you apply the chain rule on rho T here, you can see that we can rewrite this expectation over rho 0, rho 1 in terms of the, the time dynamics when we take the derivative of the interpolant times the gradient of this delta function, which allows us to write a current density expressly that I'm calling JT of x here. And moreover, if I rewrite JT of x in a way that pulls out the interpolant density, rho T, I can get a form for the velocity field directly. So now we have that BT of x is equal to the current divided by the time dependent density. If you plug this in above, you'll find that we solve this continuity equation generically, writing this as a current. This is true for the case where rho T of x is greater than zero, that we have support. Otherwise, it's gonna be equal to zero. So at this point, it's gonna be useful to introduce a slightly different definition of a conditional expectation with respect to this interpolant xT, okay? So the important party is we're gonna say this conditional expectation of any function f, that's a function of T, x zero, and x one, given that the interpolant is equal to x, is such that the integration under the time dependent density, rho T, is accessible via expectations only with respect to rho zero and rho one. So we can get these conditional expectations with only taking expectations over rho zero and rho one when we have this condition on the interpolant. And if we think about things this way, it gives us a very simple form for this velocity field that before I wrote down is the ratio of a current density over the probability density. If you take the form of the current, you take the form of the density, you write them out, you see that this is expressly a conditional expectation of the time dynamics of the interpolant. So this is nice because we're going to wanna evaluate rho Bt over the time dependent density rho T. And if we can do so by just sampling under our endpoints, then this is maybe a useful paradigm for setting up some sort of optimization. The proposition that I'll make is that the PDF rho T that we just described, satisfying the continuity equation, has a velocity field B, which is the unique minimizer of a simple quadratic objective function. It's basically saying because we can sample under rho zero and rho one to get expectations under conditional expectations under rho T, we can just match a model of the velocity field B hat to the time dynamics of the interpolant evaluated under rho zero and rho one. Of course, taken over the full time interval T equals zero to T equals one. Here I've used some sort of notation to say that X of T is explicitly written as the interpolant evaluated X zero and X one. Okay, why is this nice? Okay, the loss is directly estimable. This allows us then to take something like the score-based diffusion paradigm, but now have a generative model connecting any two densities. Does not require this OU process. And because we're satisfying this ODE, we have likelihood evaluations available to us and sampling is efficient because we have very fast ODE integrators. You can write down a bound between the model density rho one of X and the target rho subscript one. Based on the Vosterstein two distance, I'd say it's maybe a little vacuous because it requires sort of an exponential with respect to the Lipschitz function of a neural network, which is probably not so controlled, but at least you can do it. And it works in practice. For example, what we can have above is this, we have a map from some complicated density in 2D to this checkerboard density over here at time T equals zero to time T equals one. You can see what the interpolant would say the path is. You can see how well the flow learns it. You can benchmark it on the ImageNet data and importantly it scales quite well in high dimensions. So on 128 by 128 by three dimensional problem, we can get this to work decently around a single GPU. But okay, so this only gave us a deterministic map. This was the part of this paradigm that is the flow-based transport picture of things. But we know that score-based diffusion works under stochastic dynamics as well and it might be interesting to think about if we can do something similar to sort of complete the picture here. So indeed we can. And I'll say that it initially just in perspective amounts to learning what we'll call an interpolant score. So this is a score function that's related to the dynamics of the interpolant if we change the interpolant a little bit. So we had this interpolant before I was calling IT of x zero, x one. But now if we add some Gaussianity to the interpolant via a factor here that I call gamma t z where z is distributed as a Gaussian variable and gamma t is such that it disappears at the endpoints then we can make the following proposition. Just relying on the same notion of conditional expectation that I wrote down before. Okay, the exact velocity field bt of x is now not just written in terms of conditional expectation with the time dynamics of the interpolant IT but also with respect to this gamma t z term. Not much has changed. We've just add dt gamma t z to the definition of b here. But you can use the same sort of proof techniques to then show that the score of the time-dependent density rho t is given as minus one over gamma times the conditional expectation with respect to our Gaussian latent variable z. And in fact this minimizes a very similar quadratic objective to the one that we wrote down before. Instead of b we now have an objective function for s. And so now we have a way of learning sort of the score function and can potentially then use it in some sort of stochastic generative model. So, sort of to summarize, before we've described the transport equation given here and learned a velocity field b, right? Now we can think about a Focker-Planck type equation that says because we have the score available to us we can ask what sort of equation does this augmented drift component that I call b forward or backward here which is given as b plus or minus some epsilon times the score that allows us to write down a STE that can be integrated stochastically for as a generative model. It's dependent on bf and it's dependent on the score. So in the first paradigm you can learn either just b hat. In the second paradigm if you wanna have some sort of stochastic dynamics you can learn b and s. But okay, if you can do this is there any sort of trade off between the two? Why would one wanna use the ODE versus the STE? Is there is an accuracy paradigm that we can think about here? We've introduced an epsilon into this stochastic differential equation. Is there any dependency on that? Why would one wanna do this? Well, you could start by asking Nick who's in the audience who has a clever way of showing basically that the following statement. If I have some model density rho hat, okay? I've removed the time in the x from the labeling here just to keep things simple. And I wanna know if I've pushed forward from rho zero to rho one using a deterministic dynamics from that transport equation we gave before given by the velocity here b hat. And I wanna know the KL between pushing forward with the approximate model b hat versus pushing forward with the exact model b. Well, you'll see that the KL sort of expands between these two densities to be a difference of the score terms, grad log rho hat and grad log rho as well as a difference between the drift terms b hat and b. And you would hope that if you've matched b hat to b and you've gotten close at least that this should control the KL. But unfortunately, matching bs is not actually sufficient in this scenario because the Fisher divergence is uncontrolled by small errors in b hat minus b. Okay, but slightly differently, if we think about a approximate model for the forward model of the SDE which we call here bf, which is b hat plus epsilon s then sort of miraculously this problem disappears because the introduction of the score term actually cancels some things in the equations for the KL divergence. And now you can actually control the KL solely based off the difference of the drifts between bf and s. So this is sort of suggestive. It says maybe that we can do better if we use an SDE as a generative model. But, you know, does this mean much practically? There's two scenarios which we wanted to test this. The first one is in the case of Gaussian mixtures. For the Gaussian mixture case, the exact forms of b, bf, the score, et cetera that the velocity fields are given analytically. So we can do the following. We can say, let's look at 128 dimensional Gaussian mixture. Let's take small slices of cross-sections, two dimensions of this so we could visualize things a little bit here. So this here is a cross-section of this high-dimensional density over two dimensions. You can see that what we're looking at here is basically a difference between the true density based off of the analytical solution and the model. And what we'd like is that each frame of this to be entirely gray. That means we've learned perfectly. Where you see red means that we've underestimated the target Gaussian mixture. Where you see blue means that we've overestimated the target Gaussian mixture, or sorry, the other way around. But ideally, you'd like to see just gray which says that these two match exactly. And what we have here is over different values of epsilon, epsilon zero corresponding to the ODE, epsilon four, epsilon 12. We see that there seems to be a sweet spot, a certain epsilon where you do better than you would using the ODE alone or with too much noise added into the stochastic dynamics. And in particular, you can plot the KL by sort of doing some kernel density estimation as we've done here. To see that there is a sort of minimizer of this curve here for B and S. There are a few ways of learning this, but I figured we'd stick with just B and S. We see that sort of falls around epsilon equals five. So we're pretty close to finding an optimal epsilon. The theory says that the optimal epsilon should be a ratio of the losses for B and S. And so maybe this is what this is telling us is that we've learned S a little bit better than we've learned B. But I will say these results are not necessarily generic. If you try this on an image data set, you'll see basically no difference between the models. There's too many small hacks that go into getting those things to work very well. But in the case of Gaussian mixtures, we have things kind of nicely controlled. So the nice thing about this framework, let's go here, is that epsilon is chosen after everything else. So you've learned B, you've learned S, then you're free to choose what epsilon you use in the integration. So you could use the same network for all of them. And then of course, you're comparing it to the analytical solution. So then you can do this. And then the last thing I wanna say is that, now that we've written down this framework of thinking about things in terms of interpolants, there's a, opens up the freedom to ask, okay, what other sort of mappings can we write down? We have set of rules. We have some boundary conditions on the interpolant. It needs to end and begin at the right place. And then of course, if you want the score, then there needs to be some sort of Gaussianity somewhere in the interpolant, either at an endpoint or by including this latent variable gamma tz in the middle. But you can do some different things then, besides just mapping to Gaussian. I thought it might be fun to show a few. For example, you can write down what we call a mirror interpolant, which is just the function x of t equals data from your target density, x1, plus this Gaussian latent variable. And what this is saying is as t goes on, you add noise to your interpolant and then you remove it. But you go from the density, row one back to the density row one. So it's like a map to and from itself. And what this means is that if you have some image, like a flower here with a little bug on it, and you use your model of the score, for example here, you can basically edit these images to change them under stochastic dynamics. You're gonna, depending on the level of noise, change this flower to be another flower in the image data set that is different from the original flower, depending on the scale of the noise. Which is kind of like a weird way of controlling your image generation process. But this may take many SDE steps. It's maybe not the best idea in practice. And then the more conventional one that we're used to thinking about is this one-sided interpolant, as we're calling it, which is saying one of the densities is a Gaussian and you'd like to go to some image set. So again, we can write down the B, we can write down the S, exactly. We can write down our loss functions to do this then. But just a point of stress is that we have, as Hugo was just allowing me to point out, we have to have a tunable amount of diffusion. So this first line here is generating a pair of lilies from Gaussian noise. And if we take the initial starting condition that we had for epsilon equals zero and we add some noise now for epsilon equals one, you could see that we drift to a slightly different flower. If you add a little bit more noise, you end up to something that is not even reminiscent of the original structure other than maybe around the colors. And if you increase the noise even more, you could end up to an entirely different image in your data set. So there's some stuff to think about here about a framework for the initial parts of the generating process. The caveat is the more noise you wanna add, the more stochastic steps you need to take in these integrators. So it's not necessarily clear that the benefits that we saw in the Gaussian mixture case from using an SDE really apply in the case of costly image generation type problems. Okay, so just as a quick summary, we discussed a method of calling stochastic interpolants here that allows them to build deterministic or stochastic generative models between arbitrary densities. And hopefully it provides a language for designing new types of maps like this. Some questions going forward though is, can we use this interpolant paradigm to study the inductive bias in transport-based generative models? What are these assumptions we've made on the OD and the SDE framework and are they realistic for discerning any sort of discrepancy between the two? Are there better ways to sample? Are there ways to use this for variational inference type tasks and ways that we can actually try to study things a little more closely? And if you want some more information, you should see Nick who's here for the week. If you wanna learn a little bit more about optimizing the transport in terms of optimal transport or the Schrodinger bridge problem, you can check out the two papers as well as for more experimental details and some preliminary code is available below. Thanks. Thank you very much, Michael. Do we have questions from the audience? I think it should work better when you're very high dimensional. Well, I would say it's not necessarily the case that this is gonna work better because in fact you could think of score-based diffusion as a subset of this. I think actually I have a backup study. I thought maybe someone might ask this. If you wanna think about score-based diffusion as an interpolant, then you choose the time-dependent components on the initial data, X0 and the Gaussian noise z should just be the following terms. Take e to the minus t, square one minus e to the minus two t, and you take time to infinity. This is score-based diffusion right there. So in principle, they should be doing approximately the same stuff. If you actually benchmark them on some image data sets, you're gonna see that these FID scores that people compute to compare how well they're doing on their generative modeling task are all about the same ballpark. The one thing I would say that actually is kind of a nice feature of score-based diffusion is you only have to learn one term, just the score for both the ODE and the SDE. If you wanna use the ODE for what we've described here, then you need to just learn B. But if you wanna use the SDE, then you need to learn B and S. But the one thing I will say is the cases where I think this is more useful is actually in sort of scientific domains, scientific problems for generative modeling because you can more exactly compute the likelihood and trust those numbers because we don't truncate at any point. They take our large T, they have to truncate it at some point. This introduces a bias into those quantities. Yeah. The final thing, not. Ah, boom, approved. That wasn't actually gonna be our question. Yeah, so proof is kind of, I think, hard in general for that case if you don't wanna think about Gaussians. Nick might have some ideas though because he has some toy examples over this case. This is the case, but I think there's some Gaussianity hidden there. But in a practical sense, for example, say one common thing in scientific computing is sampling under some target density that we want. We only know if your access to the log likelihood. But if you wanna build a map, you'd hope maybe you could build a map to some proximal theoretical description that you can evaluate analytically. Some perturbation of some Gaussian system that then is close to the target system that you want. This would be a paradigm that allows you to build in structured information into that map. No problem. So can you clarify what you meant by the first density? You're saying basically like a Dirac measure or something, strange? Okay. Yeah, I think you, well, there's some trouble that emerges in one sort of existing on a lower dimensional manifold. So there are, people have done a lot of work on trying to write down what these transport maps look like when you're not perfectly preserving the sort of topology of the space. And there's some tricks like augmenting it to, you know, R3 that you can do that then make everything work. But in general, if something was in R2 and going to R3, this wouldn't really work. Yeah, no, there's ways to do that then. Yeah. But you just can't actually change dimensions. And you, yes. Your interpolate is a lot of freedom, right? The simplest case of just a Gaussian over here and a Gaussian over here. Two ways to do that would be to do what you put in the picture, which is to sweep the Gaussian over. But another way, which is any shorter distance in information space is to just grow one Gaussian and shrink the other in place. Right. And speak to what guides sane choices of the interpolate. Yeah, so actually the two you've talked about there, one amounts to what's called like a Moser flow. And you can do it by interpolating at the level of the densities themselves. That's the one where you've appeared and disappeared. The problem is that results in a pretty ill conditioned velocity field. So the velocity field you'd have to learn to do something like that. I mean, we could talk about it if you like offline. It doesn't look nice in any sort of learnable sense. If you wanted to do this one where you're just sort of shifting it over, then an interpolant that uses sort of trigonometric terms, so like a cosine of pi t and sine of pi t would give you something that sort of variants preserving along that path, which should work a lot better. And here's a question. Well, if not, let's sync Michael again.