 Hello, everyone. My name is Wasim. I am an entrepreneur in residence at Fast AI and I'm currently at the Fast AI headquarters at the moment in Australia, although I'm originally from South Africa, from Cape Town. And I'm joined here today by Tanishk. Tanishk works at Stability AI and we've been working together with a couple other people on diffusion models, generative kind of modeling, and that's been super fun. So Tanishk, do you want to maybe, you know, introduce yourself as well? Yeah, so my name is Tanishk. I am a PhD student at UC Davis, but I also work at Stability AI. And I've been exploring and playing around with diffusion models for the past several months. And so it's been great to also explore that with the Fast AI community as well in these last few weeks as well. Awesome. Cool. Cool. So this talk is us or me trying to understand the math behind diffusion. So, you know, if you've done the Fast AI courses before, you know that you don't need to understand the math to be effective with any of these models. In fact, you don't even need the math to do research, to do novel research and contribute to these. But for me, it was all about it came out of interest. And, you know, I thought it was kind of it's kind of beautiful how what, you know, diffusion models were discovered. And I think a large part of that was thanks to some some really clever math. And so I wanted to understand that I'm not I don't have a math background. And so I want to help kind of describe how I think about it and how I how you can kind of interpret, you know, all of these notations and things. Cool. Yeah, so so I can just dive into it, I think. So the first bit of math that we see in this paper is q of x superscript zero. And they call this the data distribution. Do you want to mention what the paper exactly which paper this is? Right. Good question. So this paper is the 2015 paper. Do you remember the authors of that paper, Tanish? I think it's Jasa Sol Dijkstein, who now works at Google, I think, and it's from Syria Ganguly's lab. So yeah, yeah. So this was the paper as far as I understand that introduced this idea of diffusion. Yeah, 2015 by those authors. They start out by, you know, defining this data distribution, and they use this notation and already, like a lot of people, you know, myself included find this quite confusing. But let's go through what's described here. So they have an x. And, you know, in math, x is often used as the input variable, much like y, which is then used often as the output variable. Yeah. And the fact that it has a superscript also implies something. So the fact that we have x superscript zero implies that there might be a sequence of x's. And, you know, I think it's useful to get comfortable with this idea of simple compact notations implying a lot more than, you know, might be obvious at first glance. So x implies that it means something about this quantity. It's an input variable. And the zero implies that there might be other things that you might have. You might have an x1, an x2, and so on. But we'll, we might see that. And then the third part is you have q. And q is a, what we call a probability density function. So the first part here is probability. And the question is, you know, what does, what does q have to do with probabilities? Well, it's because usually we use the letter p to describe probability density functions of interest. And then because q is right after that, it's another common one. So it's kind of like how you use x and y. We use p and q. And the fact that we use q here instead of p is because it suggests that there might be a p that we'll introduce. And maybe p is the thing that we're modeling. And q is kind of supplementary to that. Does that sound right, Tanishk? Yeah, yeah. And I think it's also helpful to kind of maybe think about like x zeros in a more practical, concrete way. Of course, if we work with images, then x zero would be, you know, that's, that's what's representing the images. So it's also useful to think about it from kind of that concrete practical approach as well. Right. So x zero might be, you know, an MNIST digit. And then we got q. So q, I'll just use this to mean q is some function. So we look at it as a box, and it takes in x zero. And it gives us the probability that this x zero, which is an image, looks like an MNIST digit. So in this case, you know, this would be 0.9, or maybe even, yeah, it's 0.98. So this is quite high probability that this is an MNIST digit. Hi, this is Jeremy. Can I jump in for a moment? Please do. Oh, thank you. I just wanted to double check. This looks a lot like the magic API that we had at the start of lesson one, that you feed in a digit, and it gives you back a probability. Is that basically what, what Q is doing here? Absolutely. It's a magic API. That's a good way to think of it. We don't know. We couldn't write down what Q is. But we imagine that somebody, somebody has it some way. Yeah, so this is a concrete example. And like, if you had to do something to this image, you might get a smaller number. So another thing worth mentioning here is probability density functions. So these are these magic APIs that, you know, give us a number, tells us how likely the thing is. They, you don't often see them, they don't often make it all the way to your code. In fact, they very rarely will appear in your code. But it turns out that they are very useful ways or tools to work with random quantities. Because they allow you to represent random quantities as functions, just ordinary functions. And because they're functions, you have a whole, you know, centuries worth of math to analyze and understand it. So you'll often find probability density functions in papers. And eventually they work out to really simple equations or formulas that end up in your code. Do you want to add anything, Tonish? I think that sounds all correct. Of course, I think you probably will go over some examples of probability density functions, especially relevant to this one. But yeah, it's useful to think about the also the sorts of functions you may have in a simplified case. And that's what we probably are going to talk about next, right? Yeah, yeah, that's exactly what we're talking about. So we have this qx of zero. And then we introduce another one. And like you said, this is going to turn out to have a really nice simple form. But before that, the next thing we define is qxt given xt minus one. So we'll say what we define this to be, but to begin with, this is another probability density function. And this bar over here means it's a conditional probability density function, which you can think of as you are given the thing on the right to calculate probabilities over the thing on the left. In this case, you can think of it as something that takes images. So maybe another magic API and produces other images. But we don't know what these look like yet, because we haven't defined over here. And this, this we would call kind of, you know, xt minus one, which could be x zero. And this would be x t, which in the x zero case would be x one. Something worth noting is this notation can be a little bit confusing, because we said q is one thing earlier. Now we're saying q is another thing. So this year, I'm going to need your help on this one, Tanisha. I think people would usually, if in the strictest sense, define the first one, you know, like this maybe, and the second one with a subscript. And that this notation that we see here on the left is just a shortcut, where they, you know, they wanted to save the space of writing that and kind of included that implied it by what was in the practice. Is that true? Yeah, I mean, I think here they use the variables q and then of course later on vcp to kind of describe as we'll see different aspects of the diffusion model, the sort of different processes of the diffusion model, which we'll see. So I think that's what, you know, those, they use the same variables to kind of demonstrate this is corresponding to this process. The other variable corresponds to the other process of the diffusion model. So we'll obviously go over that. So I think that's where those variables or those letters are being used in that manner. But if you do want to make it more specific, more clear, yeah, I think that that that notation is fine as well. Right. Okay. Yeah, that makes sense. Okay, so so let's describe what this q does, you know, to the image on the left to produce the one on the right. So I'll start over here. So we have more space. I'll write it out first, and then we can go into the details. Okay. So kind of like the bar, you can think of this semicolon as, you know, grouping things together. And so you have the things on the left and the things on the right. My understanding is these two things on the right are the parameters of the model, sorry, of the probability. And the thing on the left is actually, Denise, could you help me understand what the thing on the left is? You know, right? Well, so this is again, like a probability distribution. And the thing on the left is saying this is a probability distribution for this particular variable. So that's just representing what it is a probability distribution for. And then the stuff on the right are the parameters for this probability distribution. So that's kind of what's going on here. So anytime you have like a normal distribution, and it's describing some variable, you'll have that sort of notation where it's the normal distribution of some variable. And then these are the parameters that describe that normal distribution. All right. So just to clarify the bit after the semicolon is the bit that we're kind of used to seeing to describe a normal distribution, which is the mean and variance of the normal distribution. So we're going to be sampling random numbers from that normal distribution according to that mean and that variance. Is that right? Yes, that's correct. Yeah. Yeah. So we need to describe a bit more there about normal distribution. We kind of, you know, skip past that. So we have this fancy N and fancy letters in math for distributions usually refer to well known distributions. And the N here stands for normal, which is also known as a Gaussian distribution. And it's probably the most well known probability distribution that you can find. And when I say well known, I mean that these things pop up everywhere. You know, you can do in all sorts of fields, measuring all sorts of things, turns out that they follow roughly something that looks like this distribution. And because they pop up so much, you know, people studied them, studied all of their properties. And we understand them really well now. The reason that they use often in cases like this is because they turns out they have really useful properties and they're easy to work with. Some reasons are they're described by just two parameters. So the mean called the mean and the covariance. Another property is that they have kind of, you know, what people would call SunTales, which kind of means that they only, you only need to describe their behavior in a small region of space. You can kind of just ignore the rest. Yeah. Do you mind drawing a quick example of a normal distribution? That's a good point. So we have, let's say our random variable is just one kind of dimensional. So just a single number of floats. This is sort of what the normal distribution would look like. And in this case, that would be our mean. And the variance would sort of describe the width over here, which in this case, you'd use a small sigma because you're doing a single variable. In our case, we use a capital sigma, which is a symbol for multiple variables or multiple dimensions. And yeah, I also didn't say that this is the letter Greek, letter mu. So capital sigma, mu and lowercase sigma. I just wanted to note that typically the lowercase sigma represents the standard deviation, which is the square root of the variance. So for example, sometimes you may see in papers, sigma squared, and that's just the variance, but they will write it sometimes as sigma squared instead. So it depends on the notation. So sigma is the standard deviation often and sigma squared would then be the variance. Cool. Yeah, we can also show with our example what this would look like. So we start out with a M-ness digit, put it through this magic API. And what would we get out? Okay, so something we didn't describe is, you know, what does this I mean? Did you want me to talk about that, Wasim? Ah, yes, please. Okay, sure. So because I think this is something that actually, can I borrow your pen? It actually came up in the lesson we were doing kind of in an interesting way. So in that lesson, do you want to get in the video? Oh, no, they know what I look like. Oh, well, okay, I'm in the video now. Yeah, in the video. Hi, Tanish, nice to see you. Yeah, so in the lesson, like, we did this thing for clip. I don't know if you remember, Wasim, where we had the, you know, the various pictures down here. I'm so embarrassed, you're better at the graphics tablet than I am. And it's my graphics tablet. And we had the various sentences along here. Right. And we said, Oh, you know, it'd be kind of cool to like take the dot product of their embeddings. Because like if their dot products are high, that means they're similar to each other. And, you know, if we subtracted the means from those first, right, then you've got the and instead of having images down here, right, what if we had the exact same the exact same vectors on each side, then what you've got down here is basically x minus, you know, the average, right, if we just check that first, squared. And that is the variance, right. So that's like the variance for each one of these vectors. But what's interesting, as you pointed out, is that like normally, you know, at high school, when we look at a normal distribution, it looks like this, right, but you're not just doing one normal distribution, you've got a whole bunch of kind of normal distributions, right, for all of your different pixels. They're the pixels, right, Tanishk, normally distribution of every pixel. So there's a whole bunch of them. And so one of them might have a normal distribution that's there, and another one might have a normal distribution that's here, and another one might have a normal distribution that's like here. And it's more than that though, because like, it's possible that that, you know, one pixel tends to be higher when another pixel tends to be higher, or one pixel tends to be higher when another pixel's lower. So it actually kind of creates this like surface, you know, in n dimensional space, where there's a number of pixels. So if you now like look at like, okay, well, what happens if we multiply this by this, just like we did in clip, right, then if this number is high, then it's saying that when this variable is high, where this pixel's high, this pixel tends to be high, and vice versa. Or if it's low, it's saying when this pixel tends to be high, this one tends to be low. Or, interesting to us, what happened, oopsie, Daisy, sorry about that. What happens if this is zero? That says that if this is high, then this could be anything. Well, this is high, this could be anything. There's no relationship between them. So statistically, we would say that these two pixels are independent. And so now that basically means we could do that for all of these. We could say, oh, you know, these are all zeros. And what that says is that, oh, every pixel is independent of every other pixel. Now, of course, in real pictures, that's not how real pixels work. But that's the assumption we're making, because if we start with a very special matrix called i, which is one, one, one, one, zero, zero, zero. If we take this very special matrix, it's very special because I can multiply it by something, say beta. And if I multiply it by a matrix, I get back the original matrix. If I multiply it by a scalar, I'm going to get beta, beta, beta, and lots of zeros. And so if I multiply something by this matrix, right, then I'm just multiplying it by beta. But what's interesting about this is that this is what Wasim wrote. Wasim wrote i times beta, i times beta t. So what he's saying is, oh, we've now got a covariance matrix where for each individual pixel, it's like pixel number one, beta one, pixel number two, beta two, this is the variances of each one. And the covariance is, you know, the relationship between the pixels zero, they're expected to be independent. So that's where we're kind of going from like statistics you do in high school to statistics you do at university is like suddenly covariance is now a matrices, not individual numbers. Does that sound about right to you, Tanishk? Yeah, that's a great explanation of it. Yes. Awesome. Cool. So now let's try to describe, you know, what this would do with two M this digits. So, you know, we let's put back our mean equation and our covariance, whoops, our covariance, our mean and our covariance. And let's look at how this behaves, you know, at the edges, sort of. So it's really hard to, you know, understand this. I don't think anybody can kind of just look at this and know what it means. What we typically do is we try to describe it kind of at the edges. And so we'll start with like what what happens if that's zero. And we'll work with x zero as well instead of, you know, x t minus one, which would mean like an M this digit. So if beta zero, and we get our x zero, you know, square root one minus zero, which is one and square root of one is one. So that kind of falls away. So we just have a mean of our previous image. And this is just variance of zero. So we have a normal distribution with a mean of our previous image of variance of zero, which means we have the same image. Yeah, just to clarify, when you have variance of zero, that means that there's really no noise or anything. It's just at that mean and, you know, your distribution is just saying that's the only point that you can get from it. So, yeah, that's what it just becomes the same image because, yeah, there's no noise or variance because the variance is zero. Yeah, exactly. And then when our beta is one, we still have this and then we have, you know, square root one minus one. And that becomes zero. So this whole thing becomes zero. And this thing becomes i times beta t, which is, you know, i. And if it's just i, then as Jeremy described, it would, you know, imply a variance of one. And so our image through this function would just be pure noise. So let, you know, mean of zero standard deviation of one. And it would just be a bunch of noise. And kind of somewhere in between that, we have to say over here, you know, what would it produce? It would be some mixture. So, you know, like maybe a light, the lighter pixels of eight and some noise. Maybe a bit darker. And we can kind of draw this and you would have seen this in the previous lecture. You can draw the sequence of things that become progressively more noisy in very small steps all the way until it becomes pure noise. This is what we call the forward diffusion process. And we can now describe some of these things. So this would be a sample from our data distribution q x zero. This would be the function for the conditional probability density function that takes x. So of x one given x zero and so on. And the way that the terminology that we would use or that mathematicians used to describe this is they would call it a mark of process with Gaussian transitions. And you know, this this can sound quite scary, but we've just described exactly what this is. So when we say process, it usually means, you know, something where there's a sequence involved. When we say Markov, it means that the thing at time t depends only on the thing at t minus one. The transition is this function. How do you actually go from t minus one to t? And Gaussian is the fact that that transition is the normal distribution. Does that sound right? Yes. Just to also clarify a couple of things. When we say that, you know, we're sampling from the data distribution, what that is referring to is trying to find some random, you know, sample or some random data point that maximizes that likelihood or that has a high likelihood. So when we say that, you know, we're looking at that that API, that magic API, as we were talking about, and we're trying to get some, you know, some data points that have a high value with, you know, from that API. And, you know, for some, so for some distributions, when it's very simple, and we know how it works, like a Gaussian distribution, and we know the parameters of that Gaussian distribution, it's very easy to be able to do that sampling. And then of course, in other cases, it's not very easy. It's not, it's quite difficult to do that sampling. So then we have to figure out alternative ways of doing that sampling. But that's why in this case, with the forward distribution, we just have these simple Gaussian transitions. And we already know the parameters of those Gaussian transitions. So we can easily do that sampling. And going back also to that, I think it's a worthwhile to also kind of show and think about maybe how this is again done practically. Because one of the nice properties of Gaussian distributions as a whole is that you can, you know, simply take some normal noise with a mean of zero and variance of one. So that's, I think they usually typically call that a unit distribution. It's just like, yeah, normal of zero one. And then if you want to get to some other point with a mean of whatever value you specify and a variance of whatever value you specify, you can simply take that normal distribution, scale it by the, you multiply it by the variance, and then you add your mean. So then there's a simple equation that you can take to get the, you know, to get at any particular mean and variance. So that's how you would, you know, get the samples for these other distributions that we have defined throughout the forward distribution. So, you know, for example, when you're coding this up, of course, a lot of these softwares, they will have a way of getting a sample from this normal distribution of zero one. And then you just use that equation then to get it at the desired mean and variance. And so that's how it kind of happens under the hood when you're when you're kind of describe this with code. That's really helpful. Yeah. And this idea of we can't really sample from this thing. That's exactly, you know, the problem that generative kind of modeling is trying to solve. Like, how do you represent this in such a way that you can easily sample from it? And so it turns out that if you have one of these processes, you know, where you have many, many steps, so let's say a thousand steps, a thousand of these steps going to the right. And they're all very small steps that eventually go to noise. Somebody, you know, maybe in the 1950s, I think discovered that you can represent the process of going backwards in exactly the same functional form with just different parameters. So what that means is if we say P is the thing that goes backwards. So, you know, the previous one, given the current one, this P has the same functional form. So it's also the transitions are also normal. But the mean is, you know, some unknown. So we'll use the square and the variance is some unknown. We use a triangle. Is that correct? Yeah, that's correct. And just going back to our previous point about P versus Q, here we can see that the Q was describing the sort of forward process going, you know, yeah, the sort of steps that we're doing. And then the P is describing what we're going in the reverse way. So that's why, you know, these papers are using, you know, Q for one process and then P for another. That's what they're kind of indicating, at least in the diffusion model literature. Mm-hmm. And P is kind of like X, you know, it's the one we want to figure out. So like Q is kind of like Y and P is kind of like X. That's how I like to think of that. And so, you know, we have this functional form. And the next question is how can we use this or, you know, we just don't know what these parameters are. How can we figure out what those are? And this goes back, you know, to early kind of statistics literature where you can fit this model by maximizing what's called the likelihood function. So we can try different parameters until we have one that maximizes the likelihood. It turns out that we can't quite do this exactly because you would need to calculate some integral. And that integral is over very high dimensional values, continuous values. So you can't actually calculate this. I think you can think of it because, you know, we're having these thousands of steps that we're trying to go in this reverse process. And so, you know, you have these thousands of steps that there are going to be many possible values for each step. So it's kind of hard to evaluate it over all these thousands of steps and all the possible values for all these different steps. So I think that's kind of where the challenges arise. And that's what it makes it difficult because you have to find you have to evaluate it over these multiple steps and try to find these functions for all these different steps. So that's what's kind of where the challenge is. And so you might see people talk not about the likelihood function, but about the log likelihood and correct me if I'm wrong here, Tanish, but I think the log here is a bit of a, you know, computational trick almost. So I think it has a few properties. The first is that it's always increasing and, you know, people would call this monotonic. You know, it looks always kind of increasing. And because it's always increasing, if it's the same, you get the same parameters if you optimize the log likelihood versus you optimize the likelihood. It also takes products to sums because and that's helpful because we have joint distributions, you know, which turn out to be products. So it turns out we have a lot of products here and they become sums, which is easy to work with. And the last thing is that, you know, this normal distribution has exponential or exponential functions and those disappear with the log. So this is a much friendlier thing to optimize. Yep, that's correct. Cool. And then there's one more step. You know, we still can't optimize the log likelihood of the thing that this eventually describes. But again, and this is kind of the beauty of math is that somebody figured out a long time ago that there's a way to optimize some other quantity called the elbow for short, which stands for evidence lower bound. And the evidence is just another name for the likelihood. And the lower bound means it's sort of, you know, the lower bound of the evidence. And if you optimize that, it's almost as good as optimizing the thing that we really want to. But this one we can calculate very, very easily. And so you can use this as a loss function to train two neural networks that predict our square from earlier, which was our mean and our triangle, which is our variance of this reverse process. And once you have that, you go all the way back here. So then you have these values. You can start with pure noise and keep calling these neural networks, sampling from those normal distributions, kind of applying that iteratively over many steps, and you recover the data distribution. One thing that's important to clarify here is that you can recover the whole distribution, but you can't necessarily take a single image, convert to pure noise, and then convert it back. So this operates sort of at the distribution level. So you can take this kind of magic API, you can reconstruct that whole API. And if you can do that, then you can generate images, image digits, or cats, or dogs, or whatever you want to. I want to just clarify one thing about this process of the kind of the loss function. So this sort of evidence lower bound loss function, the kind of approach that it's taking is that we have this forward process. We can go from the original images and figure out these sorts of intermediate distributions going all the way finally to noise. With this sort of evidence low bound loss function, what we're really kind of doing is trying to match our distribution that we're trying to optimize to those distributions that we saw in the forward process. So that's what we're trying to do. We're trying to match that sort of those distributions. And there's a specific type of function that is able to do that. It's called a KL divergence. That's the sort of function that can compare probability distributions. And again, because we're dealing with Gaussians, you can calculate that analytically, and a lot of the math becomes very simple. So that's again, with the whole Gaussians, we know them quite well and the math is very simple. So that allows us to do this sort of comparison between these distributions very easily and optimize that. And so we want to kind of minimize the difference between the distributions we see in the forward process and the distributions we're finally determined for the reverse process. Perfect. Then there's one more thing, I think one more kind of major step to get closer to the form that you would have seen in Jeremy's lesson. So there was a 2020 paper. The initials of that model is DDPM. Tanish, do you know what this stands for? Yeah, it stands for denoising diffusion probabilistic model. Okay, cool. And what they did was they said, let's assume that this variance is just a constant so we don't learn it. And we assume also that the step size from earlier, you know, the variance of the noise that we added each step is also a constant. We don't learn that. We're just predicting the mean and these are set to some really convenient values. Then the loss turns out to be that you predict the noise. So you can restructure this whole thing as you take in, you need to train a network that takes in images. So here's your network. And it tells you what of this image is noise. Thanks to these, you know, these simplifying assumptions. And even though the assumptions turns out you can train much more, you know, models that produce much better images. Now, I think this relates to something from the, you know, the lesson that Jeremy gave. Tanish, do you remember that there was something about the gradient or something like that? Yes, yes. So this idea of, you know, adding noise and learning to remove noise. The idea is that kind of by, you know, again, you have this sort of this image that you have noise, right? And by, sorry, let me think about the best way I say this. Oh, yeah, sorry. Okay, let me start over. So I'll just start. Yeah, so like Jeremy was saying in his lesson, what we want to do is we want to figure out the gradient of this likelihood function. So this is just kind of a different way about thinking about this. If we had some information about this gradient, then we could, for example, you know, use that information to produce, like we talked about, kind of this optimization, kind of produce images with high likelihood. So the idea is that we can add noise to the images that we have. So those are samples that we have. And that kind of takes us away from, you know, the regular images that we have. And, you know, that kind of decreases the likelihood, right? So we have those images and we're adding noise that decreases the likelihood. And we want to kind of learn how to get back to high likelihood images and kind of use that to provide some sort of estimate of our gradient. So this sort of denoising process actually allows us to do that. So there are actually theorems also, I think, from the 1950s that demonstrate that, especially in the case of this sort of Gaussian noise that we're working with, this denoising process is equivalent to learning what is known as the score function. And the score function is the gradient of the log of the likelihood. So again, they have this log here, which, again, makes the math nicer and easier to work with. But the general idea is the same because as we talked about, log is a monotonic function. So again, the general ideas are the same, but the score function specifically refers to the gradient of the log likelihood. So this sort of denoising process allows us to learn the score function. So that's what we're doing, this noise predicting that, you know, we had this whole probabilistic framework using that sort of likelihood framework, and it came back down to just predicting the noise. And that's what the DDPM paper showed in 2020. But it turns out that is equivalent to calculating out this sort of score function and using that information to be able to sample from our distribution. So that's kind of how these two approaches connect. So there's a lot of literature talking about maybe that sort of probabilistic likelihood perspectives of diffusion models. And there's also a lot of literature talking about this score based perspective. But, you know, this hopefully allows you to think about the similarities and how these two approaches connect with each other. Yeah. Awesome. Yeah. And that's kind of the, you know, the beauty, I think, of the math side of things here is that you find all of these relationships between different fields and also like between different centuries, basically, and that allows you to do really kind of powerful and unexpected things. Okay, so you can just do a quick recap of where we got to. So we started out with our data distribution, which we want to model. We said, you know, we'll define this forward diffusion process, which is a way of kind of adding noise to this model. And because we added in this specific way, thanks to, you know, some discovery in the 1950s, the reverse process has the same form. And then, you know, we already know how to train a neural network for this using the elbow. And then a couple years later came the discovery, you know, simplifying assumptions that in the end, all we do is predict the noise. And I just remembered we take actually the MSE of this noise prediction, the mean squared error, which is a nice, very simple framing of the model. And then Tanishk spoke about another way to derive all of this, which is the score function approach, the gradient of the log likelihood. Okay, cool. Yeah, I highly recommend checking out the course lesson as well if you haven't. You know, if you don't understand this, there's no need to be intimidated. You can still be very effective without ever using math. You can be very effective at deep learning as fast AI has shown us. And you can do novel research as well. For me, this is it's interesting. And, you know, it's even beautiful in a way. So I recommend checking it out. But don't feel intimidated. You can find the course lesson links in the past AI forum. We'll add those links as well in the description of this video. We'll also have a topic in the forum for this lesson. You can have discussions there, post any comments, add any, you know, relevant links to the math. And then we have another lesson, you know, video by Jono, which I really recommend checking out. He's a, you know, he's a great teacher. And he was, I think he was the first person to do a full course on stable diffusion. Yeah, Jono's video is kind of a deep dive into some of the code a little bit more and into some of the concepts a little bit more. So I feel like between these three videos, it's a good overview. You know, I think, I mean, just to clarify, you don't need to understand all the math that was described in this video. That's not to say you won't need to understand math. We'll be covering lots of math in these lessons. But we'll be covering just the math you need to understand and build on the code. And we'll be covering it over many, many more hours than this rather rapid overview. Perfect. Cool. And yeah, thank you so much, Daneesh. I had a lot of fun. And thank you so much, Westry. And that was awesome. Awesome. Cool. Bye-bye.