 Okay. For the sake of time, let's start the last session of today, which is going to be a second session on generative models. And I would argue a session on generative models for scientific computing. And the first talk will be given by Agil, who, unfortunately, is still in Berkeley. And take it away. All right. Thank you for having me. I'm sorry I couldn't make it there. I was really looking forward to it, but I had problems with my visa. And so the title of this talk is compress sensing using generative models. It'll be half-babe between theory and practice. And this is joint work with my collaborators from UT Austin. And yeah. So here's a brief outline of the talk. I will spend some time on background, then describe the posterior sampling algorithm we've been looking at recently. And finally, applications to compress sensing MRI. So what is the problem of compress sensing? Very simply put, if you want to recover some unknown signal, for example, an image from noisy linear measurements, then compress sensing can be used in these sorts of applications. So for example, if you want to get a brain scan from a patient, you would put them into an MRI machine. The machine would take some readings of their brain, and using those readings, you would like to reconstruct the brain scan. And similarly, in astronomy and oil exploration, you get some measurements of an unknown signal, and you would like to reconstruct that signal from those measurements. In the simplest form, we model this as a system of linear measurements where y denotes the measurements that you obtain. A denotes a matrix that takes you from the image to measurements. X star denotes the unknown image or signal. And you may have some additive noise. And for notational sake, M would be the number of measurements you obtain. And N would be the dimensionality of the signal that you're trying to reconstruct. And the fundamental questions in compress sensing are how many measurements do you need to recover the signal, and also what algorithm would you use for recovery. Now let's take an even simpler case where there's no noise in the measurements. Now we can write this as a system of linear equations where, again, you have the measurements over here, you have the measurement matrix, and you're trying to recover this unknown image using the measurements and the measurement matrix. And elementary linear algebra tells you that if you have fewer observations of the signal than there are dimensions in the signal, then this is an under-determined system of linear equations, and there are infinite solutions to the system. And hence, uniquely recovering X star is impossible if M is smaller than N. However, if you impose additional structure on this problem, it's indeed possible to recover X star exactly using very few measurements. So for example, the seminal result by Candice Roberg and Tao in 2006 showed that if X star is k-sparse, meaning that it's a large vector, but the number of non-zero entries are at most k, and the rest of the entries in the vector are zero. Then the number of measurements you would need scales as the number of non-zeros times the logarithm of the dimension of X star. And this also assumes that A, the measurement matrix, has some nice properties. So for example, if it's drawn Gaussian IID, then it will satisfy these nice properties. And the intuition is that since you're only considering vectors that are sparse, these form a low-dimensional subset of Rn. And since if you're choosing k entries where X star is non-zero, there are only n-choose case choices for the support. And the number of observations you need is roughly the log of those choices for the support, which is k log n over k. And the algorithm for recovery would be simply lasso, where you say that you optimize for your image X, such that once you multiply X with the measurement matrix A, it's as close as possible to Y in, say, L2 norm. And you also add a penalizer for the L1 norm of X to ensure that it is sparse. So if you run this with an appropriate choice of lambda, you will exactly recover X star using very few measurements. And now a natural question to ask is why do people care about sparsity? It's because if you look at images in some harmonic basis, like for example, in the wavelet basis, if you take this example, if you take this image of a castle and you perform a wavelet transform, you see that most of the wavelet coefficients are sparse. And now instead of recovering the image in the pixel domain, you can recover it in the wavelet domain. And since this is a bijective transform, you're OK if you just recover these sparse wavelet coefficients using compressed measurements of it. And this is a classical 90s, a 2000 signal processing approach to compression and compress nancy. And many algorithms like JPEG exploit sparsity in an appropriate harmonic basis to achieve tremendous compression rates. And in this talk, we will not use sparsity anymore. We will talk about how we can leverage machine learning. And I'm not trying to be adversarial to sparsity. I only mean to say that this is a more contemporary version of sparsity. And the intuition is that every year, millions of MRIs are taken. And this gives us some data driven model for modeling these images. And if you have these millions of images, you should be able to look at them and develop some better understanding or more complicated understanding of them than just simply saving them as sparse in a wavelet basis. And this structural understanding of the MRI or the structural understanding of what a brain MRI looks like should allow you to take fewer measurements in the future, which will imply lesser scan times and make MRIs more accessible and inexpensive. And if you want to model images in 2023, I think in my opinion, the best way are through deep convolutional neural networks or transformers. And in this talk in particular, I'm going to concentrate on generative models. So we had prior work in 2017, which said that if you consider low dimensional generative models such as GANs and VAEs, these are continuous differentiable functions from a low dimensional space RK to a high dimensional space RN, where K is much smaller than N. And the rough idea is that since your model is sufficiently smooth, and it's a low dimensional mapping from, it's a mapping from a low dimensional space to a high dimensional space, the range that of images captured by your generator model is low dimensional. And this is analogous to what Can you hear us, Ajil? Sorry, my zoom crashed. I'm not sure what went wrong there. Okay. Just give me a minute. Yes. Sorry about that. Okay. So I'm sorry about that. I don't know what happened. So the first thing we were able to prove back in 2017 is that if you have a generative model, which is ellipsis, then the number of measurements you need scales as the input dimension of your generative model times the logarithm of the ellipsis constant. And if you're now considering rallied neural networks with D layers, then the number of measurements scales as the input dimension times the depth of the neural network times the logarithm of the output dimension. And these results are really analogous to what we find in sparsity because in sparsity was just K log N. And now we have these substitutes, which are the ellipsis constant of the neural net of the neural net or the depth times the logarithm of the output dimension of the neural net. And really, there were lower bounds. It said that generative models are actually a strict generalization of sparsity. One can create generative models that output sparse vectors, but these would be handcrafted generative models. And instead of someone give you data, you will be able to learn generative models that capture richer structure in your data. And the algorithm is also a generalization of lasso. And it's provably poly time in certain settings. This is work that's not by us, right? So low dimensional generative models are good because they allow you to prove theorems that are very analogous to what we find in sparsity. But empirically, when we run them, we find this very interesting phenomenon where we trained a generative model that goes from a hundred dimensional space to like a 12,000 dimensional space. And the y-axis here shows you the reconstruction error between the estimated image and the true image. And the x-axis shows you how many measurements each algorithm was given. And what we see is that the red line is the algorithm we proposed in 2017. And the green and the blue lines are what lasso are the performance by lasso. And what we find is the red line drops off very quickly when you have few measurements. But once you hit 500 measurements, it plateaus. And the reason for this is that we only have a model that's going from R 100 to 12,000. And it has some number of layers. And once you hit 500 measurements, you've already found the best possible image that the generative model can produce. Because, for example, this, for example, was run on human faces. So you take someone's face, you take measurements off that face, and now you try to reconstruct what the person's face would be. And since the generative model is not perfect, it can't capture every human face. It faces some fundamental representation error or capacity or like limited capacity, which leads it to this plateauing phenomenon where after 500 measurements, you giving me more information doesn't help my generative model because it simply cannot do any better. In lasso, we can tune that lambda parameter, which allows for more or less capacity. But in generative models, this low dimensional space is fixed. And hence, there's no, there's nothing you can tune unless you train a new generative model from scratch. And another interesting phenomenon that was observed by Asim et al. is that if you train a GAN on human faces, it tries to impose a face on everything. So this shows that it has a huge data set bias. So for example, in this example of a barn that has like something that looks like eyes in a mouth, a generative model would try to reconstruct a face on this, whereas lasso gets grainy images, but it's at least able to tell that this thing is not a human face. And this really tells you that there's something really problematic here with using low dimensional generative models or GANs. And in and around this time is when diffusion models and score-based generative models started becoming more popular. In fact, score-based models were invented around 2018 or 2019 and diffusion models were invented around 2020. And for those of you not familiar with this, I'll briefly go through what a diffusion model is. I won't talk about how it's trained. And the rest of the rest of my talk is basically going to argue that we should all be using diffusion models and posterior sampling using diffusion models. So diffusion models try to do the following problem where if you have a probability distribution P and assume there's some variance beta square, let P tilde be the annealing of P where I'm taking P and convolving it with a Gaussian of variance beta square. So this makes P a little smoother. And of course, this P tilde would be the distribution of a random variable that is distributed according to P and it also has some Gaussian noise of variance beta square added to it. And there exists some algorithm where if you give it IED samples of X1 to Xm drawn from your distribution P, then a diffusion model can be trained to go from Rn cross R to Rn where the first parameter in Rn would be some image in the support in Rn and beta would be how much annealing is happening to the distribution. And this diffusion model is trained to approximate the gradient of the log of the annealed density at X. And the key point in diffusion models is that they are trained to map X over various beta. So you can pass in a range of betas and it will approximate the gradient of the log annealed distribution at that noise variance beta. And since these diffusion models are supported on the whole of Rn, they don't suffer the same capacity problems that GANs do. And what we observe is that they also don't exhibit the same data set biases observed in GANs. Before I go to the next part, are there any questions? Also feel free to interrupt me if there are questions. And now I'll talk about how we used diffusion models in procedure sampling in our experiments in theory. And procedure sampling is a very intuitive algorithm that one can run. So basically it says that assume your distribution had some probability distribution associated with it. And you're given linear measurements with Gaussian noise. Then a possible estimate for this unknown X star is to draw X hat, which is distributed according to the posterior distribution of images conditioned on the measurements. Now, if you ask a Bayesian this, they would tell you that this is the obvious thing to do. But I think until very recently, we just didn't have generative models that were rich enough to capture this procedure distribution and to also allow you to sample from it. And if you want to perform this procedure sampling, you can run, you can approximate it via Langevin Dynamics, which follows the following, which is the following iterative procedure, where you initialize your image to just be IED Gaussian noise. And then for T large enough, you run this process over and over again, where the image at time t plus one is the image at time t plus some positive step size along the gradient of the log posterior, plus some Gaussian noise whose variance is proportional to the step size that you just took. And now if you write this posterior out using Bayes rule, you get the gradient of the marginal density plus the likelihood density minus the log of the normalization factor. But since you're differentiating with respect to your current image, the gradient of this normalization factor with respect to your current image is zero and that goes away. And the diffusion model can provide some approximation for the gradient of the log marginal. And this log likelihood term is a Gaussian because the way these measurements were generated were through a linear transformation plus Gaussian noise. And so if you write that all out, the gradient of the log Gaussian density becomes this expression here, the gradient of the normalization factor is a constant with respect to x, so it's become zero. And now you have a closed form expression for what the Langevin updates should be. So this says that assuming you had access to the true posterior of the distribution of the image, if you're given linear measurements with Gaussian noise, then you can write out the update rules for Langevin dynamics to sample from the posterior distribution. And this gradient log p term comes from the diffusion model. And this is a little bit of a disclaimer because when I was running these experiments in 2020, the diffusion models are not very well understood and algorithms relating to diffusion models are also not well understood. So it wasn't clear how one should be sampling. And now there's been a lot of progress in it. But three years ago, I sort of hacked this procedure to actually get it to work in practice. And the reason things break is that it's not that the equations on the previous slide are incorrect, they are correct, but they don't consider the fact that diffusion models learn the annealed distribution. So while I said that a diffusion model would be able to approximate the gradient of the log marginal of your image, diffusion models don't actually do this. They learn some gradient log of p tilde, where p tilde would be an annealed or modified version of p. And since I don't consider that annealing, if you do actually try writing out the former equations using annealing, it breaks and it's no longer correct. And a more principled version of it appeared in Eichler 2023, which by these authors. So it actually took roughly two or three years to figure out what is a principle way of using diffusion models, such that you account for the posterior sampling and also the annealing that happens in while training diffusion models. But in my experiments, I simply introduced some hacks and hyperparameters that allow, that seemingly allowed me to get things to work in practice, at least empirically. It seemed to work empirically. Okay. And then we ran our experiments on the flicker faces data set. So recall, I said that GAN suffered this huge data set bias, data set bias, or some sort of bias, where the top row here shows high resolution images. The second row shows blurry, like downscaled versions of the upper row images. The third row shows reconstructions by pulse, which is an algorithm that uses GAN and does map estimation. And you can see that it's really propagating, destroying like ethnic and race features. Whereas if you use diffusion models and posterior sampling, we seem to preserve certain details much better. And the key point here is that the GAN and the diffusion model were trained on the same data set, which were flicker faces. So whatever racial biases existed in flicker faces is inherited by both of these generative models, but we're able to see that clearly one is better than the other. And also they have an algorithmic difference in that pulse performs a point estimate via map estimation, whereas posterior sampling is randomized and samples from the posterior. And recall that diffusion models are supported on the whole ambient space of RN. So it's not clear why we should be able to use them for compressed sensing. Most compressed sensing results, or almost all compressed sensing results, require some lower dimensional structure in the image that can be exploited to get fewer measurements. So now a question now would be if you were to take diffusion models, how do we characterize what this lower dimensional structure is? And in order to achieve this goal, we propose this new notion of approximate covering numbers, which is a measure of the complexity of a probability distribution. And it says that if you're given any distribution p and some parameters epsilon and delta, which are positive, then the approximate covering number at epsilon and delta of p is the smallest set of epsilon radius balls I need to cover one minus delta mass in p. So if you look at this heat map of a probability distribution, the blue areas are like low density areas, and the reds are the high density areas. If I take epsilon radius balls and keep them along the diagonal, then these cover most of the mass under p. There's some mass outside, but if you take the union of these balls, it covers a large fraction of p. And you look, and if you also count these balls, they seem to be only along the diagonal rather than in the whole plane. So this tells you that there is, if you're willing to throw out some probability mass in some probability mass captured by a diffusion model, then there may exist some lower dimensional structure. And you should be able to exploit that to get compress sensing results that are similar in spirit to what we are classically used to in compress sensing and sparsity. And we're arguing that this notion of approximate covering numbers exactly characterizes how many measurements you would need to do compress sensing recovery using this probability distribution. And we have the following theorems which which work for IID Gaussian matrices A. So in this setting A is scaled, the variance of A is scaled, that's that the norm of your measurements is roughly the same as the norm of your ground truth X star. And the additive noise eta is scaled such that it has norm roughly epsilon. And the upper bound we can show is that if you run posterior sampling for this measurement process where A is Gaussian and the noise is also Gaussian, then the number of measurements you need is scales logarithmically as the epsilon delta approximate covering number of the distribution of X star. And if you do posterior sampling, then the estimate X hat is epsilon close in alto norm to X star with probability one minus three delta. So this guarantee is, it's the sort of guarantee that you would get in like frequentist statistics in compress sensing, but it's an algorithm that's randomized. And there's also some failure probability that depends on how much mass in your probability distribution be that you're not considering. Because recall that this epsilon delta covering number number tries covering a one minus delta fraction of P using small radius balls. And this result tells you that if you did posterior sampling with measurements that scale logarithmically as this epsilon delta covering number, then the posterior sampling estimate is epsilon close in alto norm. And furthermore, we are able to show this robustness result which says that if there's a mismatch between the distribution of your ground truth image, so if it's drawn according to some distribution R, and you only have an approximate posterior meaning that you're sampling X hat from P of X condition on Y, then the upper bound still holds as long as the distribution of your image R and the distribution of your generator model P are epsilon square root delta close in Wasserstein distance. And the reason I say that posterior sampling is almost optimal is that we're able to show a very close lower bound which says that if your image is drawn according to P and assuming it's compact, meaning that the norm of X star is at most R, then any algorithm that gets epsilon close to X star that any reconstruction algorithm that gets epsilon close to X star with probability one minus delta requires measurements that scale as the logarithm of the epsilon delta covering number. Now the reason I say almost optimal is that there's an extra additive, there's an extra multiplicative log one plus R square factor in the denominator and these epsilon and delta constants that the constants that multiply epsilon and delta are different between the lower and the upper bound. In the upper bound I had log epsilon delta and the lower bound I have three epsilon and four delta. An interesting thing to note about this lower bound is that in traditional compress sensing lower bounds we would construct a minimax instance P which is hard to reconstruct and then we would show a lower bound for that. In contrast this lower bound works for any probability distribution. You give me a distribution P of your choice and this lower bound will still hold for it. This is not a minimax lower bound it's an instance lower bound and our approach to showing this lower bound is purely information theoretic. Recall that the upper bound said that if you want to get epsilon close with probability one minus three delta the number of measurements you need is log of the epsilon delta covering number and if you want to show the lower bound you should think of this whole process as being a Gaussian channel. The mutual information between X star which is the ground truth image and X hat which is the reconstructed image is at most the mutual information between X star and Y. This first inequality is via the data processing inequality and the second inequality follows from Shannon's AWG and theorem which says that the mutual information between X star and Y is at most M times the log SNR and the reason this holds is that Y is a linear transformation of X star and has added Gaussian noise to it so you can directly apply Shannon's AWG and theorem and now the only part that remains is to show that if you're able to successfully get epsilon close to X star then the mutual information between X star and X hat should be at least the log of the epsilon delta covering number and this is actually very close in spirit of Thanos inequality and the main difficulty here is that Thanos inequality is typically proven or is proven for a discrete distribution and a uniform distribution over discrete elements whereas here P is continuous and arbitrary and we need to show this lower bound and I'm not going to talk about how we show it but we are able to show some version of that and if you now take this capacity inequality between the which gives an upper bound of the mutual information between X star and X hat in terms of the number of measurements and you have this lower bound for the mutual information in terms of the log epsilon delta covering number you put these two things together and the lower bound directly transfers and it says that the number of measurements you need should be at least the epsilon delta covering number and now we'll talk about applications of posterior sampling that we've been doing for the last year or two so compress sensing MRI has become very popular and it's actually getting a lot of FDA approval to build devices that actually use deep learning based compress sensing techniques and the basic way to model compress sensing MRI or to model MRI as a compress sensing problem is is the following equation which says that you should think of X star as being your brain scan and once you put a person into an MRI machine the way the machine works is that it performs a Fourier transform of your brain with some discretization level so this F would be the discrete Fourier matrix and then it selects some subset of the Fourier coefficients based on some suggestion by the lab tech or the radiologist so the whole process is your brain gets Fourier transformed and then the machine selects some subset of those Fourier coefficients and there's also some additive Gaussian noise of variance epsilon square and this is very clearly a linear measurement process so everything I said directly transfers because you can just run Langevin dynamics for this for this compress sensing MRI problem and we compared our results to model and varnet which are state-of-the-art deep learning techniques which are trained to use one-third of the Fourier coefficients and reconstruct the brain so you you feed in a certain third of the Fourier coefficients and these deep learning models will spit out brain reconstructions and what we find is that if we run Langevin dynamics using a diffusion model trained on brain scans we are competitive so the the leftmost image shows the ground truth that a radiologist would like to see the next two columns show reconstructions by state-of-the-art deep learning baselines and the last the last column is our algorithm that uses Langevin dynamics and diffusion models and we are competitive with these models which is quite surprising because these things have been trained end to end so they they have been trained using examples of here's one-third Fourier coefficients and here's the brain scan I would like to see and you you train this sort of like a regression framework to say that you will take in Fourier coefficients and then spit out a brain scan and since it's been optimized for this setting we were expecting to be beaten a whole lot more by model and varnet but what we find is Langevin which has no prior information about what this measurement process would be or which Fourier coefficients are going to be selected is competitive with these algorithms and now if you change the measurement process a little bit and you say that okay instead of taking one-third of the Fourier coefficients I'm going to take one-twelfth of the Fourier coefficients then these deep learning baselines fail because they have been taught how to use one-third of the coefficients and get a good brain scan out whereas Langevin doesn't care at all it's been trained only on brain scams and now if you tell it that okay I'm going to give you these one-twelfth Fourier coefficients it just doesn't care you run the same algorithm again you just have to change the the matrix because now you've gone from one-third to one-twelfth but as long as you you're able to specify that matrix to Langevin dynamics it gets reconstructions that are much much better than these baselines so this is one reason to to use generative models in that they are much more modular than end-to-end deep learning techniques for performing compressive reconstructions so you can use the same generative model and recall that these models are very expensive to train so if you train one you can use them for any lap setting or scan setting in the clinic and another interesting thing we observed is that if you train all of these models on brain scans and you test them on a similar compress sensing problem but with knee scans these algorithms don't break completely and we find that Langevin tends to be a little bit more robust so in this case you get one-fourth of the Fourier coefficients of this knee scan and these again the the left column shows the ground truth that a radiologist would like to see it's the gold standard and the next two columns are the deep learning base the end-to-end deep learning baselines which seem to introduce much more artifacts than Langevin dynamics Langevin still has some you can see some artifacts over here as opposed to here but this was this was very surprising in that all of these models have never even seen a knee before they've been trained purely on brain scans and you feed them Fourier coefficients of this knee and supposedly these models can incorporate information about the so the Fourier coefficients contain information that this is a knee but since you've heavily sub-sampled and selected only a quarter of the required Fourier coefficients there will be some blurriness but if you use the these models they they know that a brain has some smoothness and structural information in there and it it it compiles that information along with information that you you just scan the knee via the Fourier coefficients to get pretty reasonable knee scans for models that have never even seen a knee or know what a knee is and quantitatively we find that Langevin is Langevin tends to be better than other algorithms so we tested it across different anatomies scanning techniques and acceleration factors where we we tune how many Fourier coefficients are shown to the models and this blue line is Langevin and you can see that it tends to be higher than the rest in the majority of cases another interesting application we've had is motion correction so during an MRI scan if a patient moves their head like they can either rotate it or translate it then this movement manifests as a blur so for example this L1 wave width reconstruction shows what happens if a patient moves you're not able to tell what state it was in but assuming that the rotation is rigid you can actually modify the Langevin updates to estimate both the brain scan and what angle opposition the brain was in so if you do that we're able to see that the our reconstruction over here which which is using corrupted motion corrupted measurements that have motion in them you're able to see that the reconstruction quality is close to what L1 wave width would obtain if there was no motion so the the third column is for L1 wave width if there was no motion at all whereas the fourth column is our Langevin technique even with motion corruption in the measurements and then the bottom row shows comparisons of like a zoomed in section of the ground truth image so the ground truth image is here this is L1 with motion corruption L1 without motion corruption and this is ours with motion corruption so you can see that it's much better and then these are the errors in each of the algorithms magnified by a factor of 10 yeah okay that concludes my talk here's the this is the code for our MRI project we've also released our models publicly so if you would like to use them if you feel go ahead and these are the references from my talk thank you do we have questions for Ajil hi very nice talk um i was wondering if you could tell us something about the comparison between the computational complexity of the Langevin approach you proposed versus the one based on deep learning right um so when I was running these things it took okay so we don't have uh just to be clear we don't have any theory guarantees for how long Langevin would take on this problem I'm speaking purely in like uh processing time on the computer uh it took me roughly five to six minutes per image whereas the end-to-end baselines are much quicker they just take five to ten seconds but I think that's a problem that was only true in 2020 or 2021 now you can do uh posterior sampling or like I don't know if it's actually doing posterior sampling but there are some uh sampling techniques MCMC sampling techniques which can be done in like 10 to 20 seconds so speed is no longer a problem hi um I would like to know so if you are able to sample the posterior instead of just taking one simple why wouldn't you average over it shouldn't be better oh yeah um uh the the pragmatic reason is that it was very slow so averaging would be would make it even slower uh but we saw that um and also if I was doing it more than once I could also I could I could pick and choose a cherry pick which reconstruction will be will be is best but that would violate the statistical validity of my results so then we just chose to take the first sample but we also ran experiments where we sample multiple times and then created uncertainty estimates and yeah that we see we we saw some diversity uh over multiple rounds of lunch with yeah so it's totally valid to do this multiple times and take averages or median or yeah some word some statistic based on that okay thank you I have a question so I guess I will go ahead uh thank you very much for the nice talk so I would argue that one of the of the risk of those algorithms that are based on generative priors is that the generative model has actually a lot of information and in some sense is able to hallucinate a solution even if you don't have enough measurements to actually uh do a reconstruction so you could have something that is very good looking and be very um confident that this is the right one but then maybe not so I don't know if you have any comment on how your methods uh deal with this this issue uh yeah we don't have a comparison right now and actually uh it's not just generative models uh all deep learning models have some artifact or the other so even model and varnet which you think would be more robust if you train on MRI scans that were done using say GE scanners and you use it on like semen scanners um it it starts hallucinating artifacts it it may not look like a tumor but there will be like some shading uh some artifacts that happen so yeah both of these both uh end to end deep learning baselines and generative model techniques do hallucinate um and I guess like the the threshold should be that radiologists are human and they do make mistakes then as long as the generative model or the deep learning model makes mistakes that are rarer than a radiologist mistakes I think they're okay but yeah obviously this needs a lot more study and um I know that my collaborator John Tamir is actually working with people or radiologists at UT's medical center to try and see like how these things compare okay thank you very much maybe one last quick question okay and if not I guess for the sake of time we will leave it there thank you very much Agile