 Okay, hi everybody. And this is lesson 19 with extremely special guests, Tanishk and Jono. Hi guys, how are you? Hello. Hey Jeremy, good to be here. And it's New Year's Eve 2022, finishing off 2022 with a bang, or at least a really cool lesson. And most of this lesson is going to be Tanishk and Jono, but I'm going to start with a quick update from the last lesson. What I wanted to show you is that Christopher Thomas on the forum came up with a better winning result for our challenge, the fashion eminence challenge, which we are tracking here. And be sure to check out this forum thread for the latest results. And he found that he was able to get better results with Dropout. Then Peter on the forum noticed I had a bug in my code. And the bug in my code for ResNets, actually I won't show you, I'll just tell you, is that in the Res block, I was not passing along the batch norm parameter. And as a result, all the results I had were without batch norm. So then when I fixed batch norm and added Dropout at Christopher's suggestion, I got better results still. And then Christopher came up with a better Dropout and got better results still for 50 epochs. So let me show you the 93.2 for 5 epochs improvement. I won't show the change to batch norm because that's actually, that'll just be in the repo now. So the batch norm is already fixed. So I'm going to tell you about what Dropout is and then show that to you. So Dropout is a simple but powerful idea where what we do with some particular probability, so here let's say probability of 0.1, we randomly delete some activations. And when I say delete, what I actually mean is we change them to 0. So one easy way to do this is to create a binomial distribution object where the probabilities are 1 minus p and then sample from that. And that will give you a 0.1 probability. So in this case, oh, this is perfect. I have exactly 1, 0. Of course, randomly, that's not always going to be the case. But since I asked for 10 samples and point one of the time should be 0, I so happened to get, yeah, exactly one of them. And so if we took a tensor like this and multiplied it by our activations, that will set about a tenth of them to 0 because multiplying by 0 gives you 0. So here's a Dropout plus. So you pass it and you say what probability of Dropout there is, store it away. Now we're only going to do this during training time. So at evaluation time, we're not going to randomly delete activations. But during training time, we will create our binomial distribution object. We will pass in the 1 minus p probability. And then you say, how many binomial trials do you want to run? So how many coin tosses or dice rolls or whatever each time? And so it's just one. And this is a cool little trick. If you put that one onto your accelerator, you know, GPU or MPS or whatever, it's actually going to create a binomial distribution that runs on the GPU. That's a really cool trick that not many people know about. And so then if I sample and I make a sample exactly the same size as my input, then that's going to give me a bunch of ones and zeroes and a tensor the same size as my activations. And then another cool trick is this is going to result in activations that are on average about one tenth smaller. So if I multiply by 1 over 1 minus 0.9, so multiply this case by that, then that's going to scale up my... to undo that difference. Jeremy? Yeah. In the line above where you have probse equals 1 minus p, should that be 1 minus self dot p? Oh, it absolutely should. Thank you very much, Jenna. Not that it matters too much because, yeah, you can always just use nn dot drop out at this point. And I've only used point one, which is why I didn't even see that. So as you can see, I'm not even bothering to explore this because I'm just showing how to repeat what's already available in PyTorch. So yeah, thanks, Jenna. That's a good fix. Yeah, so if we're in evaluation mode, it's just going to return the original. If p equals 0, then these are all going to be just ones anyway. So we'll be multiplying by 1 divided by 1. So there's nothing to change, so with p of 0 does nothing in effect. Yeah, and otherwise it's going to kind of zero out some of the activations. So we can... a pretty common place to add drop out is before your last linear layer. So that's what I've done here. So yeah, if I run the exact same epochs, I get 93.2, which is a very slight improvement. And so the reason for that is that it's not going to be able to kind of memorize the data or the activations, you know, because there's a little bit of randomness. So it's going to force it to try to identify just the actual underlying differences. There's a lot of different ways of thinking about this. You can almost think of it as a bagging thing, a bit like a random forest. You know, it's each time it's giving a slightly different kind of random subset. Yeah, but that's what it does. I also added a drop 2D layer right at the start, which is not particularly common. I was just kind of like showing it. This is also how Christopher Thomas' idea traded as well, although he didn't use drop out 2D. What's the difference between drop out 2D and drop out? So this is actually something I'd like you to do to implement yourself as an exercise, is to implement drop out 2D. The difference is that with drop out 2D, rather than using x dot size as our tensor of 1s and 0s. So in other words, potentially dropping out every single batch, every single channel, every single xy independently. Instead, we want to drop out an entire kind of grid area, all of the channels together. So if any of them are zero, then they're all zero. So you can look up the docs for drop out 2D for more details about exactly what that looks like. But yeah, so the exercise is to try and implement that from scratch. And come up with a way to test it. So like actually check that it's working correctly. Because it's a very easy thing to think that it's working and then realize it's not. So then, yeah, Christopher Thomas actually found that if you remove this entirely and only keep this, then you end up with a better results for 50 epochs. And so he's the first to break 95%. So I feel like we should insert some kind of animation or trumpet sounds or something at this point. I'm not sure if I'm clever enough to do that in the video editor, but I'll see. Okay, so that's about it for me. Did you guys have any other things to add about drop out? How to understand it or what it does or interesting things? Oh, I did have one more thing before. But you go ahead if you've got anything to mention. So I was going to ask just because I think the standard is to set it like remove the drop out before you do inference. But I was wondering if there's anyone you know of or if it works to use it for some sort of test time augmentation. Oh, thank you because I wrote a callback for that. Did you see this or are you just like, okay, this is a test time dropout callback. Nice. So yeah, before epoch, if you're a member in learner, we, we put it into, you know, training mode, which actually what it does is it puts every individual layer into training load. So that's why for the module itself, we can check whether that modules in training mode. So what we can actually do is after that's happened, we can then go back in this callback and apply a lambda that says if this is a drop out, then this is, yeah, then put it in training mode all the time, including a devaluation. And so then you can run right at multiple times just like did for TTA with this callback. Now, that's very unlikely to give you a better result because it's not kind of showing it different versions or anything like that, like TTA does that are kind of meant to be the same. But what it does do is it gives you some a sense of how confident it is if it kind of has no idea, then that little bit of dropouts quite often going to lead to different predictions. So this is a way of kind of doing some kind of confidence measure. You'd have to calibrate it by kind of looking at things that it should be confident about and not confident about and seeing how that dropout, test time dropout changes. But the basic idea it's been used in medical models before. I wouldn't say it's totally popular, which is why I didn't even bother to show it being used. But I just want to add it here because I think it's an interesting idea and maybe could be more used than it is, or at least more studied than it has been. A lot of stuff that gets used in the medical world is less well known in the rest of the world. So maybe that's part of the problem. Cool. All right. So I will stop my sharing and we're going to switch to Nish, who's going to do something much more exciting, which is to show that we are now at a point where we can do DDPM from scratch, or at least everything except the model. And so to remind you, DDPM doesn't have the latent VAE thing and we're not going to do conditional. So it's not going to be like, we're not going to get to tell it what to draw. And the unit model itself is the one bit we're not going to do today. We're going to do that next lesson. But other than the unit, it's going to be unconditional DDPM from scratch. So, Nish, take it away. Okay. Hi. Welcome back. Sorry for the slight continuity problem. You may notice people look a little bit different. That's because we had some Zoom issues. So we have a couple of days of past and we're back again. And then Chavato recorded his bit before we do Nish's bit. And then we're going to post them in backwards. So hopefully there's not too many confusing continuity problems as a result. And it all goes smoothly. But it's time to turn it over to Nish to talk about DDPM. So we've reached a point where we have this mini AI framework. And I guess it's time to now start using it to build more sophisticated models. As we'll see here, we can start putting together a diffusion model from scratch using the mini AI library. And we'll see how it makes our life a lot easier. And also it'll be very nice to see how, you know, the equations in the papers correspond to the code. So I have here, of course, the notebook that we are, that we'll be working from. The paper, which, you know, we have the diffusion model paper, the noising diffusion probabilistic models, which is the paper that was published in 2020. It was one of the original diffusion model papers that set off the entire trend of diffusion models and is a good starting point as we delve into this topic further. And also I have some diagrams and drawings that I will also show later on. But yeah, basically, let's just get started with the code here and of course the paper. So just to provide some context with this paper, you know, this paper was published from this group in UC Berkeley. I think a few of them have gone on now to work at Google. And this is a big lab at UC Berkeley. And so diffusion models were actually originally introduced in 2015. But this paper in 2020 greatly simplified the diffusion models and made it a lot easier to work with. And, you know, got these amazing results, as you can see here, when they trained on faces. And in this case, seat par 10. And, you know, this really was very, kind of a big leap in terms of the progress of diffusion models. And so just to kind of briefly provide, I guess, kind of an overview. If I could just quickly step, just mention something, which is, you know, when we started this course, you know, we talked a bit about how perhaps the diffusion part of diffusion models is not actually all that. Everybody's been talking about diffusion models because that's, particularly because that's the open source thing we have that works really well. But this week, actually a model that appears to be quite a lot better than stable diffusion was released that doesn't use diffusion at all. Having said that, the basic ideas, like most of the stuff that Tanishk talks about today will still appear in some kind of form, you know, but a lot of the details will be different. But strictly speaking, actually, I don't even know if we've got a word anymore for the kind of like modern generative model things we're doing. So in some ways, when we're talking about diffusion models, you should maybe replace it in your head with some other word, which is more general and includes this paper that Tanishk is looking at here. Iterative refinement, perhaps? Yeah, that's not bad. Iterative refinement. I'm sure by the time people watch this video, probably, you know, somebody will have decided on something. We will keep our course website up to date. Yeah, this is the paper that Jeremy was talking about. And yeah, every week there seems to be another study of the art model. But yeah, like Jeremy said, a lot of the principles are the same, but the details can be different for each paper. And just to, I just want to again, also, like Jeremy was saying, kind of zoom back a little bit and kind of talk about a little bit about what, you know, just to kind of provide a review of what we're trying to do here. Right? So let me just, yeah. So with this task, we were trying to, in this case, I would try to do image generation. Of course, it could be other forms of generation like text generation or whatever. And the general idea is that, of course, we have some, you know, data points, you know, in this case, we have some images of dogs. And we want to produce more like these, the data points that were given. So in this case, maybe the dog image generation or something like this. And so the overall idea that a lot of these approaches take for image, you know, for some sort of generative modeling task is they have, they try to... Oops, not over there. Here, they try to... Oops, what happened here? Maybe my... Yeah, bomb, yeah. So let me just zoom in a bit. P of X, which is our, which is basically the sort of likelihood, like we just happened here, likelihood of, of data point X, of X. So let's say X is some image. Then P of X tells us, like, what is the probability that you would see that image in real life? And, like, if you... we can take, like, a simpler example, which may be easier to think about of, like, a one-dimensional data point, like height, for example. And if we were to, like, a height, of course, we know, like, we have a data distribution that's kind of a bell curve. And, you know, you have, we'd be some, you know, mean height, which is, like, something like 5'9, 5'10. Yeah, but I guess it'd be 5'10 inches or something like that, or 5'9 inches, whatever. And then, of course, we have something which is, like, we have some more unlikely points, but that is still possible. Like, for example, we have 7 feet, or where you have something that may be 9, which is, like, you know, like 3 feet or something like this. So here's the x-axis is height, and the y-axis is the probability of some random person you meet being that tall. Exactly. So, you know, you, you know, the, yeah, this is basically the probability. And so, of course, you have this sort of peak, which is where, you know, you have higher probability. And so those are the sorts of, you know, values that you would see more often. So this is, this is our, what we would call our P of X. And, like, the important part about P of X is that you can use this now to sample new values if you know what P of X is, or if you have some sort of information about P of X. So for example, here you can think of, like, if you were to, like, say, maybe have some, let's say you have some game, and you have some human characters in the game, and you just want to randomly generate a height for this human character, you know, you could, you wouldn't want to, of course, select a random height between 3 and 7. That's kind of uniformly distributed. You would instead maybe want to, you would want to have the height dependent on this sort of function where you would more likely sample values, you know, in the middle and less likely sample these sorts of extreme points. So it's dependent on this function P of X. So having some information about P of X will allow you to sample more data points. And so that's kind of the overall goal of generative modeling is to get some information about P of X that then allows us to sample new points and, you know, create new generations. So that's kind of a high level kind of description of what we're trying to do when we're doing generative modeling. And of course, there are many different approaches. We, you know, we have our famous scans, which, you know, used to be the common method back in the day before diffusion models. You know, we have VAEs, which I think we probably talk a little bit more about that later as well. We'll be talking about both of those techniques later here. Yeah. And so there are many different other techniques. There are also some niche techniques that are out there as well. But of course, now the popular one is, are these diffusion models? Or, you know, as we talked about, maybe a better term might be hybrid refinement or whatever, you know, whatever the term ends to be. But yeah, so there are many different techniques. And yeah, so let's just, so this is kind of the general diagram that shows what diffusion models are. And if we can look at the paper here, which let's pull up the paper. Yeah, you see here, this is the sort of, they call it directed graphical model. It's a very complicated term. It's just kind of showing what's going on in this, you know, in this process. And you know, there's a lot of complicated math here, but we'll highlight some of the key variables and equations here. So basically the idea is that, okay, so let's see here. So we, so this is like, so this is an image that we want to generate, right? And, and so XO, X0 is basically, you know, these are actually the samples that we want. So we want to, X0 is what we want to generate. And, you know, these would be, yeah, these are our images. And, and we start out with pure noise. So that's the, that's what X uppercase T, pure noise. And the whole idea is that we have two processes. We have this process where we're going from pure noise to our image. And we have this possible thing from our image to pure noise. So the process where we're going from upper image to pure noise. This is called the forward process. What word? Sorry, Mike. By typing instead of my handwriting, that's a good in it. So hopefully it's clear enough. Let me know if it's not. So we have the forward process, which is mostly just used for cleaning. Then we also have our reverse process. So this is the reverse process. Should I write up here? Reverse process. So this is a bit of a summary, I guess, of what you and Wasim talked about in lesson 9b. Yes. And just, it's just mostly to highlight now what, what are the different variables as we look at the, the code and see, you know, the different variables in the code. Okay. So we'll be focusing today on the code, but the code will be referring to things by name. And those names won't make sense very much unless we see the, what they're used for in the math. Okay. I will dive too much into the math. I just want to focus on these sorts of variables and equations that we see in the code. So basically the general idea is that, you know, we do these in multiple different steps. You know, we have here from time step zero all the way to time steps, time steps, uppercase t. And so there's some fixed number of steps. But then we have this intermediate process where we're going, you know, from some particular, yeah, some particular time step. Yeah. We have this time step, lowercase t, which is some noisy image. And, and, and yes, we're transitioning between these two different noisy images. So we have this was sometimes called the transition. We have this one here. This is like sometimes called the transition kernel or yeah, whatever it is, it basically is just telling us, you know, how do we go from, you know, one, in this case, we're going from a less noisy image to a more noisy image and then going backwards is going from a more noisy image to a less noisy image. So let's look at the equations here. So the forward direction is trivially easily to make it something more noisy, just add a bit more noise to it. And the reverse direction is incredibly difficult, which is to particularly to go from the far left to the far right is strictly speaking impossible because none of that person's face exists anymore, but somewhere in between you could certainly go from something that's partially noisy to less noisy by a learned model. Exactly. And that's like what I'm going to write down right now in terms of, you know, in terms of, I guess, the symbols in the map. So yeah, basically I'm just trying to pull out the, just to write down the equations here. So we have, let me zoom in a bit on. Yes, so we have. Two, oops, let's see here. Two of x t, x t minus one. Or actually, you know what, maybe it's just better if I just snip. Yes, it's snipped it from here. So the one that is going from our, the one that is going from our forward process is this, this, this equation here. So I'll just make that a little smaller for you guys, just so right there. So that is going. And basically, to explain, we have this, we have this sort of script, a little bit of it, maybe a little bit confusing notation here, but basically this is referring to a normal distribution or a Gaussian distribution. And this is just saying, okay, this is a Gaussian distribution that's describing this particular variable. So it's just saying, okay, you know, n is our normal or Gaussian distribution, and it's representing this variable x of t, or x, sorry, x t. And then we have here is, is the mean. And this is the variance. So just to again clarify, I think we've talked about this before as well. But like, you know, this is a, you know, this is of course a bad drawing of a Gaussian, but you know, our mean is just, our mean is just, you know, this, you know, the middle point here is the mean, and the variance just kind of describes the sort of spread of the Gaussian distribution. So if you think about this a little further, you have this beta, which is one of the important variables that kind of describe the diffusion process beta t. So you'll see the beta t in the code. And basically, beta t increases as t increases. So, basically, your beta t will be greater than your beta t minus one. So if you think about that a little bit more carefully, you can see that, okay, so at, you know, t minus one, at this time point here. And then you're going to the next time point, you're going to increase your beta t, so you're increasing the variance, but then you have this one minus beta t and take the square root of that and multiply it by x t minus one. So as your t is increasing, this term actually decreases. So your mean is actually decreasing, and you're getting less of the original image because the original image is going to be part of x t minus one. So as you just to let you know, just like, you know, we can't see your pointer. So if you want to point at things, you would need to highlight them or something. Yeah, so I'll just, let's see, yeah. Or, yeah, basically, I mean, I would particularly point at anything in specific. I was just saying that, yeah, basically, if we have our x of t here, as the time step increases, you know, you're getting less contribution from your x t minus one. So that means your mean is going towards zero. And so you've got to have a mean of zero and you know, the variance keeps increasing and basically just how a Gaussian distribution and you lose any contribution from the original image as your time step increases. So that's why when we start out from x zero and go all the way to our x of t here, this becomes pure noise. It's because we're doing this iterative process where we keep adding noise. We lose that contribution from the original image and, you know, that leads to the image having pure noise at the end of the process. So just something I find useful here is to consider one extreme, which is to consider x one. So at x one, the mean is going to be root one minus beta t times x naught. The reason that's interesting is x naught is the original image. So we're taking the original image and at this point, one minus beta t will be pretty close to one. So at x one, we're going to have something that's the mean is very close to the image and the variance will be very small. And so that's why we will have a image that just has a tiny bit of noise. Right, right. And then another thing that sometimes is easier to write out is sometimes you can write out, in this case, you can write out two above x t directly. Because these are all independent in terms of like q of x t is only dependent on x t minus one. And then x t minus one is only dependent on x t minus two. And you can, you can, it's just independent. Each of these steps are independent. So based on, you know, the different laws of probability, you can get your q above x t in close form. So yeah, that's what's shown here. So x t did it, the original image. So this is also another way of kind of seeing this more clearly where you can see, you can see that. So I'm going back here. Yeah, so this is another way to see here more directly. So this is, of course, our clean image. And this is our clear, our noisy image. And so you can also see again, now alpha bar t is dependent on beta t. Basically it's like one minus like the cumulative. This is, I mean, we'll see the card for it, I guess, so maybe. Yes. It might be clear to see that this is alpha bar t or something like this. But basically, basically the idea is that alpha bar t, alpha bar t is going to be, again, less. This is what is going to be less than alpha bar t minus one. So basically alpha, this keeps decreasing, right? This decreases as time step increases. And on the other hand, this is going to be increasing as time step increases. So again, you can see the contribution from the original image decreases as time step increases, while the noise as shown by the variance is increasing while the time step is increasing. Anyway, so that hopefully clarifies the forward process. And then the reverse process is basically a neural network, as Jeremy had mentioned. And yeah, screenshot this. That's, paste this. That's, yes, this is our reverse process. And basically the idea is, well, this is a neural network. And this is also a neural network, a neural network. And we learned during the training of the model. The nice thing about this particular diffusion model paper that made it so simple was actually, we completely ignored this and actually set it to constants just based on data. We can't say what you're pointing at. So I think it's important to mention what this is here. This term here. So this one, we just kind of ignore. And it's just a constant dependent on beta t. So you only have one neural network that you need to train, which is basically referring to this mean. And when the nice thing about this diffusion model process is that it also repaver prices mean into this easier form, where you do a lot of complicated math, which we'll not get into here. But basically you get this kind of simplified training objective where, let's see here. Yeah, you see the simplified training objective. You instead have this epsilon beta function. And let me just screenshot that again. This is our loss function that we train. And we have this epsilon beta function. And you can see it's a very simple loss function, right? This is just a... You can just write this down. This is just an MSC loss. And we have this epsilon beta function here. That is our... I mean, do folks like me here are less mathy? It might not be obvious that it's a simple thing because it looks quite complicated to me. But once we see it in code, it'll be simple. Yes, yes. Basically, you're just doing like... Yeah, you'll see it in code how simple it is. But this is just an MSC loss. So we've seen MSC loss before, but you'll see how this is basically MSC. So the nice... So just to kind of take a step back again, what is this epsilon beta? Because this is like a new thing that seems a little bit confusing. Basically, epsilon... You can see here basically... Yeah, absolutely. This here is saying... This is actually equivalent to this equation here. These two are equivalent. This is just another way of saying that. Because basically it's saying X of... That's X of t. So this is giving X of t in just a different way. But epsilon is actually this normal distribution with a mean of zero and a variance of one. And then you have all these scaling terms that changes the mean to be the same as this equation that we have always here. So this is our X of t. And so what epsilon is, it's actually the noise that we're adding to our image to make it into a noisy image. And what this network is doing is trying to predict that noise. So what this is actually doing is this is actually a noise predictor. And it's predicting the noise in the image. And why is that important? Basically the general idea is like if we were to think about our distribution of data, let's just think about it in a 2D space. Just here, each data point here represents an image. And they're in this blob area, which represents a distribution. So this is in distribution. And this is out of the distribution. And basically the idea is that if we take an image and we want to generate some random image, if we were to take a random data point, it would most likely be noisy images. So if we take some random data point, the way to generate random data point, it's going to be just noise. But we want to keep adjusting this data point to make it look more like an image from your distribution. That's kind of the whole idea of this iterative process that we're doing in our diffusion model. So the way to get that information is actually to take images from your data set and actually add noise to it. So that's what we try to do in this process. We have an image here and we add noise to it. And then what we do is we try to play a neural network to predict the noise. And by predicting the noise and subtracting it out, we're going back to the distribution. So adding the noise takes you away from the distribution and then predicting the noise brings you back to the distribution. So then if we know at any given point in this space how much noise to remove, that tells you how to keep going towards the data distribution and get a point that lies within the distribution. So that's why we have noise prediction and that's the importance of doing this noise prediction is to be able to then do this iterative process where we can start out at random point which would be, for example, pure noise and keep predicting and removing that noise and walking towards the data distribution. Okay. So let's get started with the code. And so here we of course have our imports and we're going to load our data set. We're going to work with our fashion amnesty set which is what we've been working with for a while already. And, you know, yeah, this is just basically the same code that we've seen from before in terms of loading the data set. And then we have our model. So remove the noise from the image. And what our model is going to take in is it's going to take in the previous image, the noisy image and predict the noise. So the shapes of the input and the output are the same. They're going to be in the shape of an image. So what we use is we use a unit neural network which takes in kind of an input image. And we do see your pointer now by the way. So feel free to point at things. Yeah. So yeah, it takes in an input image. And in this case, a unit is for purpose, but they can also be used for any sort of image to image task where we're going from an input image and then outputting some other image of some sort. And we'll talk about a new architecture which we haven't learned about yet and we will be learning about in the next lesson. But broadly speaking, those gray arrows going from left to right are a lot like very much like ResNet skip connections. But they're being used in a different way. Everything else is stuff that we've seen before. So it's basically we can pretend those don't exist for now. It's a neural network that the output is the same size or a similar size to the input. And therefore, you can use it to learn how to go from one image to a different image. Yeah. So yeah. So that's where the unit is. Yeah. Like Jebus said, we'll talk about it more. The, the sort of units that are used for diffusion models also tend to have some additional tricks, which again, we'll talk about them later on as well. But yeah, we'll just for the time being, we will just import a unit from the diffusers model library, which is the hot and face library for diffusion models. So they have a unit implementation and we'll just be using that for now. And so yeah, of course, strictly speaking, we're cheating at this point because we're using something we haven't written from scratch, but we're, we're only cheating temporarily because we will be writing it full scratch. Yeah. And yeah. So, and then of course, we're working with one channel images are fashion in this images are one channel images. So we just have to specify that. And then of course, the channels of the different blocks within the unit are also specified. And then let's go into the training process. So, so basically, the general idea of course is we want to train with this MSC loss on what we do is we select a random time step. And then we add noise to our image based on that time step. So of course, if we have a very high time step, we're adding a lot of noise. If we have a lower time step, they were adding very little noise. So we're going to randomly choose a time step and then yeah, we add the noise accordingly to the image. And then we pass it the noisy image to a model as well as the time step. And we are trying to predict the amount of noise that was in the image and we predicted with the MSC loss. I have some pictures of some of these variables I could share if that would be useful. So I have a version. So I think Tanishka is sharing notebook number 15, is that right? And I've got here notebook number 17. And so I took Tanishka's notebook and just as I was starting to understand it, I'd like to draw pictures for myself to understand what's going on. So I took the things which are in Tanishka's class and just put them into a cell. So I just copied and pasted them, although I replaced the Greek letters with English written out versions. And then I just plotted them to see what they looked like. So in Tanishka's class, he has this thing called beta, which is just in space. So that's just literally a line. So beta, there's got to be a thousand of them and they're just going to be equally spaced from 0.01 to 0.02. And then there's something called sigma, which is the square root of that. So that's what sigma's going to look like. And then he's also got alpha bar, which is the cumulative product of 1 minus this. And that's what alpha bar looks like. So you can see here, as Tanishka was describing earlier, that when T is higher, this is T, the x-axis, beta is higher. And when T is higher, alpha bar is lower. So yeah, so if you want to remind yourself, so each of these things, beta, sigma, alpha bar, they've each got a thousand things in them. And this is the shape of those thousand things. So this is the amount of variance, I guess, added at each step. This is the square root of that. So it's the standard deviation added at each step. And then if we do 1 minus that, it's just the exact opposite. And then this is what happens if you multiply them all together up to that point. And the reason you do that is because if you add noise to something, you add noise to something that you add noise to something that you add noise to something, then you have to multiply together all that amount of noise to say how much noise you would get. So yeah, those are my pictures. If that's helpful. Good to see the diagram or see how it, the actual values and how it changes over time. So yeah, let's see here, sorry. Yeah, so like Jeremy was showing, we have our lens space for our beta. In this case, we're using kind of more of the Greek letters so you can see the Greek letters that we see in the paper, as well as now we have it here in the code as well. And we have our lens space from our minimum value to our maximum value. And we have some number of steps. So this is the number of time steps. So here we use a thousand time steps, but that can depend on the type of model that you're training. And that's one of the browners of your model or hack the browners of your model. And this is the callback you've got here. So this callback is going to be used to set up the data, I guess, so that you're going to be using this to add the noise so that the models then got the data that we're trying to get it to learn to then denoise. Yeah. So the callback, of course, makes life a lot easier in terms of studying up everything and still being able to use, I guess, the mini AI learner with maybe some of these more complicated and maybe a little bit more unique training loops. So yeah, in this case, we're just able to use the callback in order to set up the data, I guess, the batch that we are passing into our learner. I just want to mention, when you first did this, you wrote out the Greek letters in English, alpha and beta and so forth. And at least for my brain, I was finding it difficult to read because they were literally going off the edge of the page and I couldn't see it all at once. And so we did a search in a place to replace it with the actual Greek letters. I still don't know how I feel about it. I'm finding it easier to read because I can see it all at once. It was a scroll and I don't get overwhelmed. But when I need to edit the code, I kind of just tend to copy and paste the Greek letters, which is why we used the actual word beta in the initial, in the init parameter list so that somebody using this never has to type a Greek letter. But I don't know, Jono or Tanishk, if you had any thoughts over the last week or two, since we made that change about whether you guys like having the Greek letters in there or not. I like it for this demo in particular. I don't know that I'd do this in my code, but because we're looking back and forth between the paper and the implementation here, I think it works in this case just fine. Yeah, I agree. I think it's good for like, yeah, when you're studying something or trying to implement something, having the Greek letters is very useful to be able to, I guess, match the math more closely and it's just easy just to pick the equation and put it into code or, you know, white square style, looking at the code and try to match it to the equation. So I think for, like, yeah, educational purpose, I tend to like, I guess, the Greek letters. So, yeah. Yeah, so, you know, we have our initialization, but we're just defining all these variables. We'll get to the predict in just a moment, but first I just want to go over the before batch where we're ever setting up our batch to pass into the model. So remember that the model is taking in our noisy image and the time step. So, and of course, the target is the actual amount of noise that we are adding to the image. So, basically, we generate that noise. So that's what the epsilon is that target. So epsilon is the amount of noise is not the amount of is the actual noise. Yes, epsilon is the actual noise that we're adding. And that's the target as well, because our model is a noise predicting model. It's predicting the noise in the image. And so our target should be the noise during trick that we're adding to the image during training. So we have our epsilon and we're just generating it with this random function. It's the normal random normal distribution with a mean of zero variance of one. So that's what that's doing and adding the appropriate shape and device. Then the batch that we get originally will contain the clean images. Right. These are the original images from our data set. So that's X zero. And then what we want to do is we want to add noise. So we have our alpha bar and we have a random time step that we select. And then we just simply follow that equation, which again, I'll just show in just a moment. That equation, you can make a tiny bit easier to read, I think, if you were to double click on that first alpha bar underscore T, cut it and then paste it, sorry, in the XT equals torch dot square root. Take the thing inside the square root, double click it and paste it over the top of the word torch. That would be a little bit easier to read. That's ingenious. And then you'll do the same for the next one. There we go. Those parentheses. Yep. Yeah. So basically, yeah. So yeah, I guess let's just pull up the equation. So let's see. So there's a section in the paper that has the nice algorithm. Let's see if I can find it. No, no. Here. I think earlier. Yes, training. So we're just following these same sort of training steps here. We select a clean image that we take from our dataset. This fancy kind of equation here is just saying, take an image from your dataset. Take a random time step between this range. Then this is our epsilon that we're getting. Just get some epsilon value. And then we have our equation for X of T. This is the equation here. You can see that it is square root of alpha bar T, X0 plus 1, square root of 1 minus alpha bar T times epsilon. So that's the same equation that we have right here, right? And then what we need to do is we need to pass this into our model. So we have XT and T. So we set up our batch accordingly. So this is the two things that we pass into our model. And of course we also have our target, which is our epsilon. And so that's what this is showing here. We passed in our X of T as well as our T here, right? And we pass that into our model. The model is represented here as epsilon theta. And theta is often used to represent, like, this is a neural network with some parameters and the parameters are represented by theta. So epsilon theta is just representing our noise predicting model. So this is our neural network. So we have passed in our X of T and our T into our neural network. And we are comparing it to our target here, which is the actual epsilon. And so that's what we're doing here. We have our batch where we have our X of T and T and epsilon. And then here we have our prediction function. And because we actually have, I guess in this case, we have two things that are in a tuple that we need to pass into our model. So we just kind of, you know, get those elements from our tuple with this. Yeah, we get the elements from the tuple, pass it into the model, and then HuggingFace has its own API in terms of getting the output. So you need to call dot sample in order to get the predictions from your model. So we just do that. And then we do, yeah, we have learn.prez. And you know, that's what's going to be used later than when we're trying to do our loss function calculation. So the, just so, I mean, it's just worth looking at that a little bit more since we haven't quite seen something like this before. And it's something which I'm not aware of any other framework that would actually do this, you know, literally replace how prediction works. So many AIs kind of really fun for this. So because you're inherited from trainCB, trainCB has predict, you know, defined, and you've defined a new version. So it's not going to use the trainCB version anymore. It's going to use your version. And what you're doing is instead of passing learn.batch0 to the model, you're, you've got a star in front of it. So the key thing is that star is going to, you know, and we know actually learn.batch0 has two things in it because that learn.batch that you showed at the end of the before batch method has two things in learn.zero. So that star will unpack them and send each one of those as a separate argument. So our model needs to take two things, which the diffusers unit does take two things. So that's the main interesting point. And then something I find a bit awkward, honestly, about a lot of hugging face stuff, including diffusers, is that generally their models don't just return the result, but they put it inside some name. And so that's what happens here. They put it inside something called sample. So that's why tineshcadded.sample at the end of the predict, because of this somewhat awkward thing which hugging face like to do for some reason. But yeah, now that you know, I mean, this is something that people often get stuck on, I see, on Kaggle and stuff like that. It's like, how on earth do I use these models? Because they take things in weird forms and they give back things with weird forms. Well, this is hell. If you inherit from trainCB, you can change predict to do whatever you want, which I think is quite sweet. Yep. So yeah, that's the train loop. And then of course, you have your regular train loop that's implemented in many AI where you are going to have, yeah, you have your loss function calculation. And you get the predictions, learn.preads, and of course the target is our learn.batch one, which is our epsilon. So, you know, we have those and we pass it into the loss function and calculate the loss function and does the back application. So, I'll just go over that. We'll get back to the sampling in just a moment, but just to show the training loop. So, most of this is copied from our, I think it's 14 augment notebook, the way you've got the T-max and the shed. The only thing I think you've added here is the DDPM callback, right? Yes, the DDPM callback. Translate loss function. Yes, so basically, we have to initialize our DDPM callback with the appropriate arguments, so the number of time steps and the minimum beta and maximum beta. And then of course, we're using an MSC loss as we talked about. It just becomes a regular train loop and everything else is from before. Yeah, you have your scheduler, your progress bar, all of that we've seen before. I think that's really cool, but we're using basically the same code to train a diffusion model as we've used to train a classifier just with one extra callback. Yeah, yeah. Anything callbacks are very powerful for allowing us to do such things. You can take all this code and now we have a diffusion training loop and we can just call it learn about fit. And yeah, you can see, got a nice training loop, nice loss curve. We can save our model on a torch, saving functionality to be able to save our model and we can load it in. But now that we have our trained model, then the question is, what can we do to use it to sample the dataset? So the basic idea, of course, was that, you know, we have, we have, like basically we're here, right? We have, let's see here, okay, so we have a basic idea is that we start out with a random data point. And of course, that's not plenty within the distribution at first, but now we've learned how to move from, you know, one point towards the data distribution. That's what our noise prediction, predicting function does. It basically tells you how, you know, in what direction and how much to do. So the basic idea is that, yeah, I guess I'll start from maybe a new drawing here. Again, we have distribution is, and we have a random point. And we use our noise predicting model that we have trained to tell us which direction and so it tells us some direction. Or I guess, let's say, on other area, like, okay, so here, okay, so it tells us some direction to move. At first that direction is not going to be, like you can't follow that direction all the way to get the correct data point. Because basically what we were doing is we're trying to reverse the path that we were following when we were adding noise. So like, because we had an original data point and we kept adding noise to the data point and maybe it followed some path like this. And we want to reverse that path to get to. So our noise predicting function will give us an original direction, which would be some kind of, it's going to be kind of tangential to the actual path at that location. So what we would do is, we would maybe follow that data point all the way towards, you know, we're just going to keep following that data point. We're going to try to predict the fully denoised image by following this noise prediction. But our fully denoised image is also not going to be a real image. So let me, I'll show you an example of that over here in the paper where they show this a little bit more carefully. Let's see here. So X0, yeah, so basically you can see the different data, you can see the different data points here. It's not going to look anything like a real image. So you can see all these points, you know, it doesn't look anything. Okay, that's what we would do is we actually had a little bit of noise back to it and we start, we have a new point where then we should maybe estimate, get a better estimate of which direction to move. Follow that all the way again. We follow a new point and then I get add back a little bit of noise. You get a new estimate. You take a new estimate of, you know, this noise prediction and then moving the noise, you know, follow that all again completely and add a little bit of noise again to the image and merge onto a dimension. So that's kind of what we're showing here. That's a lot like SGD. With SGD we don't take the gradient and jump all the way. We use a learning rate to go some of the way because each of those estimates of where we want to go you know, not that great, but we just do it slowly. Exactly. And at the end of the day, that's what we're doing with this noise prediction. We are predicting the sort of gradient of this P of X but of course we need to keep making estimates of that gradient as we're progressing so we have to keep evaluating our noise prediction function to get updated and better estimates of our gradient in order to finally converge onto our image. And then you can see that here where we have this maybe this fully predicted denoised image which at the beginning doesn't look anything like a real image but then as we continue to route the sampling process we finally converge on somebody that looks like an actual image. Again, these are CFAR 10 images and it's still a little bit maybe unclear about how realistic these images, these very small images look but that's kind of the general principle I would say. And so that's what I can show in the code. This idea of we're going to start out basically with a random image, right? And this random image is going to be like a pure noise image and it's not going to be part of the data distribution. You know, there's not anything like a real image. It's just a random image. And so this is going to be our X, I guess X uppercase T, right? That's what we start out with. And we want to go from X uppercase T all the way to X zero. So what we do is we go through each of the time steps and create, we have to put it in this sort of batch format because that's what our neural network expects so we just have to format it appropriately. And we'll get to Z in just a moment. I'll explain that in just a moment. But of course we just take it half similar where it's bar, alpha bar, beta bar, which is getting those variables that we... And we defect beta bar because we couldn't figure out how to type it so we used B bar instead. Yes, yes. So, yeah, we were pretty able to get beta bar to work, I guess. But anyway, at each step what we're trying to do is to try to predict what direction we need to go and that direction is given by our noise predicting model, right? So what we do is we pass in X of T and our current time step into our model and we get this noise prediction and that's the direction that we need to move in. So basically, we take X of T, we first attempt to completely remove the noise, right? That's what this is doing. That's what X zero hat is. That's completely removing the noise. And of course, as we said, that estimate at the beginning won't be very accurate. And so now what we do is we have some coefficients here where we have a coefficient of how much that we keep of this estimate of our denoise image and how much of the originally noisy image we keep. And on top of that, we're going to add in some additional noise. So that's what we do here. We have X zero hat and we multiply by its coefficient and we have X of T, we multiply by some coefficient and we also add some additional noise. That's what Z is. That's basically a weighted average of the two plus images. And then the whole idea is that as we get closer and closer to a time step equals to zero, our estimate of X zero will be more and more accurate. So our X zero coefficient will get cold. As we're increasing going through the process and our X T coefficient will get closer and closer to zero. So basically we're going to be weighting more and more of the X zero hat estimate and less and less of the X T as we're getting closer and closer to our final time step. And so at the end of the day, we will have our estimated generated image. So that's kind of an overview of the sampling process. So yeah, basically the way I implemented it here was I had the sample function that's part of our callback and it will take in the model and the kind of shape that you want for your images that you're producing. So if you want to specify how many images you produce, that's going to be part of your back size or whatever and you'll just see that in a moment. But yeah, it's just part of the callback so then we basically have our DDPM callback and then we could just call the sample method of our DDPM callback and we pass in our model and then here you can see we're going to produce for example 16 images and it just has to be a one channel image of shape 32 by 32 and we get our samples and one thing I forgot to note was that I am collecting each of the time step the effects of T so the predictions here you can see that there are a thousand of them. We want the last one because that is our final generation. So we want the last one and that's what we're... No sad actually. Yeah and this is... We come a long way since DDPM so this is slower and less great than it could be but considering that except for UNET we've done this from scratch well actually from Matrix Modification I think those are pretty decent. Yeah and they're only trained for about 5 epochs it took like maybe like 4 minutes to train this model something like that. It's pretty quick and this is what we could get with very little training and it's pretty decent and you can see of course some clear shirts and shoes and pants and whatever else Yeah and you can see Fabric we've got texture and things have buckles and well yeah you know I was going to compare like we did generative modeling in the first time we did Part 2 back in the days when I think called Vassus Guy and Gan was just new which is actually created by the same guy that created PyTorch or one of the two guys, Sumith and we trained for hours and hours and hours and got things that I'm not sure were any better than this. So things have come a long way. Yeah and of course then yeah so we can see then how this sampling progresses over time over the multiple time steps so that's what I'm showing here because I collected during the sampling process we are collecting at each time step what that estimate looks like and you can kind of see here and so this is the estimate out of like the noisy image over the time steps oops and I guess I had to pause yeah you can kind of see but you'll notice that actually so we actually what we did is like okay so we selected an image which is like the ninth image so that's this image here so we're looking at this image particularly here and we're going over yeah we have a function here that's showing the time step during the sampling process of that image and we're just getting the images and what we are doing is we're only showing basically from time step 800 to 1000 and here we're just having it like where it's like okay we're looking at like maybe every five steps and we're going from 800 to 9 and this kind of would make it visually easier to see the transition but what you'll notice is I start all the way from 0, I start from 800 and the reason we do that is because actually between 0 and 800 there's very little change in terms of like it's just mostly a noisy image and it turns out as I make a note of this here it's actually a limitation of the noise schedule that is used in the original VDT on paper and especially when applied to some of these smaller images when we're working with like size 32 by 32 or whatever and so there are some other papers like the improved DDP on paper that propose other sorts of noise schedules and what I mean by noise schedule is basically how beta is defined basically so we had this definition of torch.lenspace for our beta but people have different ways of defining beta that lead to different properties so things like that people have come up with different improvements and those sorts of improvements work well when we're working with these smaller images and basically the point is like if we are working from 0 to 800 and it's just mostly just noise that entire time we're not actually making full use of all this time steps so it would be nice if we could actually make full use of those time steps and actually have it do something during that time period so there are some papers that examine this a little bit more carefully and it would be kind of interesting for some of you folks to also look at these papers and see if you can try to implement those sorts of models with this notebook as a starting point and it should be a fairly simple change in terms of noise schedule or something like that so I actually think this is the start of our next journey which is our previous journey was going from being totally rubbish at fashion MNIST classification to being really good at it I would say now we're like a little bit rubbish at doing fashion MNIST generation and yeah I think you know we should all now work from here over the next few lessons and so forth and people you know trying things at home and all of us trying to make better and better generative models you know initially a fashion MNIST and hopefully we'll get to the point where we're so good at that that we're like oh this is too easy and then we'll pick something harder and eventually that'll take us to stable diffusion and beyond I imagine that's cool I got some stuff to show you guys if you're interested I tried to better understand what was going on in Tunisian and try to do it in a thousand different ways and also see if I could just start to make it a bit faster so that's what's in notebook 17 which I will share so we've already seen the start of notebook 17 well one thing I did just drew a picture for myself partly just to remind myself what the real ones look like and they definitely have more detail than the samples that Tunisian was showing but they're not you know they're just 28 by 28 I mean they're not super amazing images and they're just black or white so even if we're fantastic at this they're never going to look great because we're using a small simple data set as you always should when you're doing any kind of R&D or experiments you should always use a small and simple data set up until you're so good at it that it's not challenging anymore and even then when you're exploring new ideas you should explore them on small simple data sets first so after I drew the various things what I like to do is one thing I found challenging about working with your class Tunisian because I find when stuff is inside a class it's harder for me to explore so I copied and pasted it and before batch contents and called it Noisify and so one of the things it's fun to do that is it forces you to figure out what are the actual parameters to it and so now that I rather than putting in the class now that I've got all of my you know various things to do with so these are the three parameters to the ddpm callbacks in it so then these things we can calculate from that so with those then actually all we need is yeah what's the image that we're going to Noisify and then what's the alpha bar which I mean we can get from here but it would be more general if you can pass in your alpha bar so yeah this is just copying and pasting from the class but the nice thing is then I could experiment with it so I can call Noisify on my first 25 images and with a random T and so I can print out the T and then I could actually use those as titles and so this lets me I thought this was quite nice I might actually rerun this because actually none of these look like anything because as it turns out in this particular case all of the T's are over 200 and as Tanishk mentioned once you're over 200 it's almost impossible to see anything so we just rerun this and see if we get a better there we go there's a better one so with a T of 7 right so remember T0 T equals 0 is the pure image so T equals 7 it's just a slightly speckledy image and by 67 it's a pretty bad image and by 94 it's very hard to see what it is at all and by 293 maybe I can see a pair of T's I'm not sure I can see anything so yeah by the way there's a handy little I think we've looked at map before in the course there's an extended version of map in Fastcore and one of the nice things is you can pass it a string and it basically just calls this format string if you pass it a string rather than a function and so this is going to stringify everything using its representations this is how I got the titles out of it just by the way so yeah I found this useful to be able to draw a picture of everything and then I wanted to yeah look at what what else can I do so then I took not so you won't be surprised to see I took the sample method and turned that into a function and I actually decided to pass everything that it needs even I mean you could actually calculate pretty much all of these but I thought since I've calculated them before it was passed them in so this is all copied and pasted from Janish's version and so that means the callback now is tiny because before batch is just Noisify and the sample method just calls the sample function now what I did do is I decided just to I wanted to try like as many different ways of doing this as possible partly as an exercise to help everybody like see all the different ways we can work with our framework so I decided not to inherit from trainCB but instead I inherited from callback so that means I can't use Janish's nifty trick of replacing predict so instead I now need some way to pass in the two parts of the first element of the tuple add separate things to the model and return the sample so how else could we do that well what we could do is we could actually inherit from unit2dmodel which is what Janish queues directly unit2dmodel and we could replace the model and so we could replace specifically the forward function that's the thing that gets called and we could just call the original forward function but rather than passing an x we're passing star x and rather returning that we'll return that dot sample okay so if we do that then we don't need the trainCB anymore and we don't need to predict and so if you're not working with something as beautifully flexible as many AI you can always do this to replace your model so that it has the interface that you needed to have so now again we did the same as Janish create the callback and now when we create the model we'll use unit plus which we just created I wanted to see if I can make things faster I tried dividing all of Janish's channels by two and I found it worked just as well one thing I noticed is that it uses groupNorm in the unit which we have briefly learned about before and in groupNorm it splits the channels up into a certain number of groups and I needed to make sure that those groups had more than one thing in so you can actually pass in how many groups do you want to use in the normalization so that's what this is for if you're going to be a little bit careful of these things I didn't think of it at first and I ended up I think the Narm groups might have been 32 and I got an error saying you can't split 16 things into 32 groups but it also made me realize actually if in Janishx maybe probably you had 32 in the first with 32 groups and so maybe the groupNorm wouldn't have been working as well so they're little subtle things to look out for so now that we're not using anything inherited from trainCB that means we either need to use trainCB itself or just use our trainLearner and then everything else is the same as what Janishx had so then I wanted to look at the results of Noisify here and we've seen this trick before which is we call fit but don't call the training part of the fit and use the single batch CB callback that we created way back when we first created learner and now learn.batch will contain the tuple of tuples which we can then use that trick to show so I mean obviously we'd expect it to look the same as before but I always like to draw pictures of everything all along the way because it's very, very off. The first six to seven times I do when a thing I do it wrong so given that I know that I might as well draw a picture to try and see how it's wrong until it's fixed. It also tells me when it's not wrong. Isn't there a show batch function now that does something similar? Yes, you wrote that show image batch didn't you? I can't quite remember Yeah, we should remind ourselves how that worked that's a good point Thanks for reminding Okay, so then I'll just go ahead and do the same thing that Tanish did but then the next thing I looked at was I looked at the how am I going to make this train faster I want a higher learning rate and I realized oddly enough the diffusers code does not initialize anything at all they use the defaults which just goes to show the experts at Huggingface that don't necessarily think like oh maybe the high torch defaults aren't perfect for my model of course they're not because they depend on what activation function do you have and what resbox do you have and so forth So I wasn't exactly sure how to initialize it I partly by chatting to Kat Crowley who's the author of K-Diffusion and partly by looking at papers and partly by thinking about my own experience I ended up doing a few things one is I did do the thing that we talked about a while ago which is to take every second convolutional layer and zero it out you could do the same thing with using batch norm which is what we tried and since we've got quite a deep network you know that seemed like it might you know it helps basically by having the the non ID path in the resnets do nothing at first so they can't cause problems we haven't talked about orthogonalized weights before and we probably won't because you would need to take our computational linear algebra course to learn about that which is a great course Rachel Thomas did a fantastic job of it I highly recommend it but I don't want to make it a prerequisite but Kat mentioned she thought that using orthogonal weights for the downsamplers was a good idea and then all the up blocks they also set the second comms to zero and something Kat mentioned she found useful which is also from I think it's from the Darrow while Google paper is to also zero out the weights of basically the very last layer and so it's going to start by predicting zero as the noise which is you know something that can't hurt so that was that's how I initialized the weights so called in at ddpm on my model something that I found made a huge difference is I replaced the normal atom optimizer with one that has an epsilon of one in five the default I think is one in eight and so to remind you this is when we divide by the kind of exponentially weighted moving average of the squared gradients when we divide by that if that's a very very small number then it makes the effective learning rate huge and so we add this to it to make it not too huge and it's nearly always a good idea to make this bigger than the default I don't know why the default so small and I found until I did this anytime I tried to use a reasonably large learning rate somewhere around the middle of the one cycle training it would explode so that makes a big difference so this way I could train I could get 0.16 after five epochs and then sampling so it looks all pretty similar we got some pretty nice textures I think so then I was thinking how do I get faster so one way we can make it faster is we can take advantage of something called mixed precision so currently we're using 32 bit floating point values that's the defaults and also known as single precision and GPUs are pretty fast at doing 32 bit floating point values but they're much much much faster at doing 16 bit floating point values so 16 bit floating point values aren't able to represent a very you know wide range of numbers or much precision at the difference between numbers and so they're quite difficult to use but if you can you'll get a huge benefit because modern GPUs, modern NVIDIA GPUs specifically have special units that do matrix multiplies of 16 bit values extremely quickly you can't just cast everything to 16 bit because then there's not enough precision to calculate gradients and stuff properly so we have to use something called mixed precision depending on how enthusiastic I'm feeling I guess we ought to do this from scratch as well we'll see we do have an implementation from scratch because we actually implemented this before NVIDIA implemented it in an earlier version of fast AI anyway we'll see so but basically the idea is that we use 32 bit for things where we need 32 bit and we use 16 bit for things where we use 16 bit so that's what we're going to do is we're going to use this mixed precision but for now we're going to use NVIDIA's you know semi-automatic or fairly automatic code to do that for us actually we had a slight change of plan at this point when we realized this lesson was going to be over three hours in length and we should actually split it into two so we're going to wrap up this lesson here and we're going to come back and implement this mixed precision thing in lesson 20 so we'll see you then