 Hello, Jono. Hello, Tanishk. Are you guys ready for lesson 21? Ready? Yep. I'm excited. I don't know what I would have said if you had said no. So good. I'm actually particularly excited because I had a little bit of a peak preview of something that Jono has been working on, which I think is a super cool demo of what's possible with very little code with mini AI. So let me turn it over to Jono. Great. Thanks, Jeremy. Yeah, so as you'll see when it's back to Jeremy to talk through some of the experiments and things we've been doing, we've been using the fashion MNIST dataset at a really small scale and really rapidly try out these different ideas and see some maybe nuances or things that we'd like to explore further. And so as we were doing that, I started to think that maybe it was about time to explore just ramping up a level, like seeing if we can go to the next slightly larger dataset, slightly harder difficulty. Just to double check that these ideas still hold for longer training runs and different more difficult data. That's a really good idea because I feel like pretty confident that the learnings from fashion MNIST are going to move across. Most of the time, these things seem to, but sometimes they don't and it can be very hard to predict. So this seems like a very wise choice. Yeah. And so we'll keep ramping up, but as a next step, one above fashion MNIST, I thought I'd look at this data called CIFAR-10. And so CIFAR-10 dataset is a very popular dataset originally for things like image classification, but also now for any paper-on-generative modeling. It's kind of like the smallest dataset that you'll see in these papers. And so, yeah, if you look at the classification results, for example, pretty much every classification paper since they started tracking has reported results on CIFAR-10 as well as their larger datasets. And likewise with image generation, very, very popular. All of the recent diffusion papers will usually report CIFAR-10 and then maybe ImageNet and then whatever large, massive dataset they're training on. So we were somewhat notable in 2018 for managing to train. So for CIFAR-10, 94% classification is kind of the benchmark. So there was a competition a few years ago where we managed to get to that point at a cost of like 26 cents worth of AWS time, I think, which won a big global competition. So I actually hate CIFAR-10, but we had some real fun with it a few years ago. Yeah, and it's good. It's a nice dataset for quickly testing things out, but we'll talk about why we also, like us as a group, don't like it at all and we'll pretty soon move on to something better. So one of the things you'll notice in this notebook, I'm basically using all of the same code that Jeremy's going to be looking at and explaining, so I won't go into too much. But the dataset's also on HuggingFace, so we can load it just like we did with the fashion MNIST and the images are three-channel rather than single-channel, so then the shape of the data is slightly different to what we've been working with. That's weird. Yeah, so we have instead of a single-channel image, we have a three-channel right-green and blue image and this is what a batch of data looks like. And you've got the same two images in your batch, so that's batch by channel by height by width, right? Yeah, that's why channel by height and width. That was a little confused by the 32 by 32. Oh, yeah. I got it now. That's in the arbitrary. So if you plot these, one of the things if you look at this, okay, I can see these are different classes. I know this is an airplane, a frog, an airplane, but it's actually a puzzle with an airplane on the cover, a bird, a horse, a car. That one, you squint, you can tell it's a deer, but only if you really know what you're looking for. And so when we started to talk about generating these images, this is actually quite frustrating. Like this, if I generated this, I'd say this might be the model doing really a bad job. But it's actually that this is a boat, this is a dog. It's just that this is what the data looks like. So I've actually got something that can help you out. I'll show later today, which is, for something like this, it's really actually hard to see whether it's good because the images are bad. It can be helpful to have a metric to generate that can see how good samples are. So I'll be showing a metric for that later today. Yeah, and that'll be great. And I hope to have like automated, but anyway, I just wanted to flag, like for visually inspecting these, it's not great. And so we don't really like CFAR 10 because it's hard to tell, but it's still a good one to test with. So the Noisify and everything, I'm following what Jeremy is going to be showing exactly. The code works without any changes because we're adding random noise in the same shape as our data. So even though our data gets in our S3 channels, the Noisify function still works fine. If we try and visualize the most images because we're adding noise in the red, pink, blue channels and some of that's, you know, quite extreme values. Yeah, it looks slightly different. It looks all crazy RGB. But you can see, for example, this frog doesn't have as much noise and it's vaguely visible. But it is. It's a many impossible task to look at this and tell what image is hiding under all of that noise. So I think this is really neat that you could use the same Noisify. Yeah, and it still works thanks to, it's not just that shape thing, but I guess just thanks to kind of PyTorch's broadcasting kind of stuff. This often happens. You can kind of change the dimensions of things that just keeps working. Exactly. And we've been paying attention to those broadcasting rules and the right dimensions and so on. Cool. So I'm going to use the same sort of approach to loading the unit. Except that now obviously I need to specify three input channels and three output channels because we're working with three channel images. But I did want to explore for this demo like, okay, how could I maybe justify wanting to do this kind of experiment tracking thing that I'll talk about. And so I'm bumping up the size of the model substantially. I've gone from, this is the default settings that we were using for Fashion End List. But the diffuser's default unit has what, many, 20 times as many parameters. 274 million versus 15 million. So we're going to try a larger model. We're going to try some log bit training. And so I could just do the same training that we've always done, just in the notebook. Set up a learner with ProgressCV to kind of plot the loss and subtract some metrics. But yeah, I don't know about you, but once it's beyond a few minutes training, I quickly get a patient. And I have to wait for it to finish before we can sample. So I'm doing the DDPM sample, but I have to, I actually interrupted the training to say, I just want to get a look at what it looks like initially and to plot some samples. And again, the sampling function works without any modification, but I'm passing in my size to be a three channel image. Yeah, and so this is like, we could do it like this. But at some point I would like to, A, keep track of what experiments I've tried. And B, be able to see things as it's going over time, including like, I'd love to see what the samples look like if you generated after the first epoch, after the second epoch. And so that's where my little callback that I've been playing with comes in. Just before you do that, I'll just mention, there are simple ways you could do that, right? One popular way a lot of people do is that they'll save some sample images as files, every epoch or two. Or we could like the same way that we have a updating plot. As we train with fast progress, we could have an updating set of sample images. So there's a few ways we could solve that. That wouldn't handle the tracking that you mentioned of like looking over time at how different changes have improved things or made them worse, whatever that would, I guess, would require you kind of like saving multiple versions of a notebook or keeping some kind of research journal or something. That'd be a bit fiddly. It is. And all of that's doable, but I also find like I'm a little bit lazy sometimes. Maybe I don't write down what I'm trying or, yeah, I've saved untitled members 37 notebooks. So, yeah, the idea that I wanted to show here is just that there are lots of other solutions for this kind of experiment tracking and logging. And one that I really like is called weights and biases. So I'll explain what's going on in the code here, but I'm running a training with this additional weights and biases callback. And what it's doing is it's allowing me to log whatever I'd like. So I can log samples at different sizes. Okay, so you're switching to a website here called wandb.ai. So that's where your callback is sending information to. Yeah, so weights and biases accounts are free for personal and academic use. And it's very, very, like I don't think I know anyone weights and biases, but it's a very nice service. You sign in and you log in on your computer or you get a little authentication token. And then you're able to log these experiments and you can log into different projects. And what it gives you is for each experiment, anything that you call weights and biases.log at any step in the training, that's getting logged and sent to their server and stored somewhere where you can later, like access it and display it. You can also have these plots that you can visualize easily and you can also share them very easily and these reports that integrate this data sort of interactively. And why that's nice is that later, like you can go and look at, so this is now the project that I'm logging to, you can log multiple runs with different settings. And for each of those, you have all of these things that you've tracked, like your training loss and validation, you can also track your learning rate, if you've been in learning rate schedule and you can save your model as an artifact and it'll get saved on their server so you can see exactly what run and produced, what model. It logs the code if you set that to, you can save code equals true and then it creates a copy of your whole Python environment, what libraries were installed, what code you ran. So for being able to come back later and say, oh, these images here, these look really good. Let's go back and see, oh, that was this experiment here. I can check what settings I used. In the initialization, you can log whatever configuration details you'd like in any comments. And yeah, there's other frameworks for this. Yeah, in some ways it's kind of, initially when I first saw it, some biases, it felt a bit weird to me, actually, like sending your information off to an external website because, I mean, before weights and biases existed, the most popular way to do this was something called TensorBoard, which Google provides, which is actually a lot like this, but it's a little server that runs on your computer. And so like when you log things, it just puts it into this little database on your computer, which is totally fine. But I guess actually there are some benefits to having somebody else run this server service, you know, instead of running your own little TensorBoard or whatever server, you know, one is that you can have multiple people working on a project collaborating. So I've done that before where we will each be sending like different sets of hyper-premeters, and then they'll end up in the same place. Or if you want to be really anti-social, you know, you can interrupt your romantic dinner and look at your phone to see how your training is going. So like, yeah, I'm not going to say it's like always the best approach to doing things, but I think there's definitely benefits to using this kind of service. And it looks like you're showing us that you can also create reports for sharing this, which is also pretty nifty. Yeah, yeah. So I like for working with other people or like you want to show somebody the final results and being able to, yeah, like pull together the results from some different runs or just say, oh, look, by the way, here's the set of examples from my two most recent. And things track the different steps. What do you think of this? And yeah, being able to have this like place where everyone can go and make them inspect the different loss curves. For any run, they can say, oh, you know, what was the batch size for this? Let me go look at the info there. Okay, I didn't log it, but I logged how many epochs and the learning rates. So yeah, I find it quite nice, especially in a team or if you're doing lots and lots of experiments to be able to like have this permanent report that somebody else deals with and may have the storage and the tracking. Yeah, it's quite nice. Wait, is this all the code yet to write? That's amazing. Yeah. So this is using the callback system. The way Waits and Bices works is that you start an experiment with this 1db.init and you can specify any like configurational settings that you've used there. And then anything you need to log is 1db.log and you pass in whatever the name of your baddie is. Okay, I'm logging the loss and then the baddie. And once you've done 1db.finish and that syncs everything up and sends it to the server. Well, this is why of the way you've inherited from MetricCV and you replaced that underscore log that we previously we used to allow fast progress to do the logging and you've replaced it to allow Waits and Bices to the logging. So yeah, it's really sweet. Yeah, yeah. So this is using the callback system. I wanted to do the things that MetricCV normally does, which is tracking different metrics that you pass in. So this will still do that. And I just offload to the super like the original MetricCV method for things like the after batch. But in addition to that, I'd also like to log the Waits and Bices. And so before I fit, I initialize the experiments. Every batch, I'm going to load the loss after every epoch. The default MetricCV callback is going to accumulate the metrics and so on. And then it's going to call this underscore log function. So I chose to modify that to say, I'm going to log my training loss. I'm going to log my validation loss if I'm doing validation. And I'd like to log some samples. And Waits and Bices is quite flexible in terms of what you can log. You can create images or videos or audio or whatever. But it also takes a map of the figure. And so I'm generating some samples and plotting them with the true image and splitting back that map of the figure, which I can then log. And that becomes these pretty pictures that you can see over time. Every time that log function runs, which is after every epoch, you can go in and see what the images look like. Make your code even simpler in the future. If we had to show images, maybe it could have an optional return fig parameter that returns the figure. And then we could replace those four lines of code with one, I suspect. Yeah. And I mean this, just sort of through this together, it's quite early still. You could also, what I've done in the past, is usually just create a PIL image where you can make a grid or overlay a text or whatever else you'd like and then just log that as 1db.image. Otherwise, like apart from that, I'm just passing in this callback as an extra callback to my set of callbacks for the learner instead of a metric callback. And so when I call that a fit, I still get my little progress bar. I still get this printed out version because my log function still also prints those metrics just for debugging. But instead of having to watch the progress in the notebook, I can set this running disconnect from the server, go have dinner and then I can check on my phone or whatever. What do the samples look like? Okay, cool. They're starting to look like less than random nonsense, but still not necessarily recognizable. Maybe we need to train for longer. That can be the next experiment. What I should probably do next is get some extra metrics, but Jeremy is going to talk about that. So for now, that's pretty much all I had to show is just to say, yeah, it's worth, as you move to these longer 10 minutes, one hour, 10 hours, these experiments, it's worth setting up a bit of infrastructure for yourself so that you know what were the settings I used. Maybe you're saving the model so you have the artifact as results. And yeah, I like this, which advice is the approach, but there's lots of others. The main thing is that you're doing something like creating pretty different versions of your notebook. I love it. One thing I was going to note that, I don't know if many people know, but weights and biases can also save the exact code that you used to run for that run. So if you make any changes to your code and then you don't know which version of your code you used for this particular experiment, so then you can figure out exactly what code you used. So it's all completely reproducible. And so I love weights and biases, all these different features it has. I use weights and biases all the time for my own research almost daily. I had to put it on just last night and check on it today morning. So I use it all the time for my own research. And yeah, I use it especially to just know like, oh, this run had this particular config. And then the models go straight into weights and biases. And then if I want to run a model on the test set, I literally actually take it off of weights and biases, like download it from weights and biases and run it on the test set. So I use it all the time so just having the ability to have everything reproducible and know exactly what you're doing is very convenient instead of having to manually track it in some sort of I guess a big Excel sheet or some sort of journal or something like that. Sometimes this is a lot more convenient, I feel. So yeah. Lest we get into too much billing from weights and biases, I'm going to put a slightly alternative point of view, which is I don't use it or any experiment tracking framework myself, which is not to say maybe I could get some benefits by doing so, but I fairly intentionally don't because I don't want to make it easy for myself to try a thousand different hyperparameters or do kind of like ill-directed, you know, sampling of things. I like to be very, like, directed, you know. And so that's kind of the workflow I'm looking for is one that allows that to happen. I'm constantly going back and refactoring and thinking what did I learn and how do I change things from here and never kind of doing like 17 learning rates and six architectures and whatever. Now, obviously, that's not something that Jono is doing at the moment. I don't be so easy for him to, you know, if you want to. I can only script that just does 100 runs with different models and different tasks. Yeah. And then I can look at my weaknesses and say filter by the best loss. Yeah. Which is very tempting. So I would say to people like, yeah, definitely be aware that these tools exist. And I definitely agree that as we do this, which is early 2023, weights and biases is by far the best one I've seen. It has by far the best integration with fast AI. And as of today, if Jono's pushed yet, it has by far the best integration with mini AI. I think also fast AI is the best library for using with weights and biases. It works in both ways. So yeah, but no, it's there. Consider using it, but also consider not going crazy on, on experiments because, you know, I think experiments have their place clearly, but also carefully thought out hypotheses, testing them, changing your code is overall the approach that I think is best. Well, thank you, Jono. I think that's awesome. I got some fun stuff to share as well. Or at least I think it's fun. And what I wanted to share is like, well, the first of all I should say, we had said, we all had said that we were going to look at units this week. We are not going to look at units this week, but we have a good reason, which is that we had said we're going to go from foundations to stable diffusion. Well, that was also a lie because we're actually going beyond stable diffusion. And so we're actually going to start showing today some new research directions. I'm going to describe the process that I'm using at the moment to investigate some new research directions. And we're also going to be looking at some other people's research directions that have gone beyond stable diffusion over the past few months. So we will get to units, but we haven't quite finished, you know, as it turns out, the training and sampling yet. Now, one challenge that I was having as I started experimenting with new things was started getting to the point where actually the generated images looked pretty good. And it felt like, you know, almost like being a parent, you know, each time a new set of images would come out, I would want to convince myself that these were the most beautiful. And so, yeah, and like when they're crap, it's obvious they're crap, you know, but when they're starting to look pretty good, it's very easy to convince yourself you're improving. So I wanted to, you know, have a metric which could tell me how good they were. Now, unfortunately, there is no such metric. There's no metric that actually says, would these images look to a human being like pictures of clothes? Because only talking to a person can do that. But there are some metrics which give you an approximation of that. And as it turns out, these metrics are not actually, they're not actually a replacement for human beings looking at things. But they're a useful addition. And I certainly found them useful. So I'm going to show you the two most common, well, there's really the one most common metric, which is called FID. And I'm going to show another one called KID or KID. So let me describe and show how they work. And I'm going to demonstrate them using the model we trained in the last lesson, which was in DDPM 2. And you might remember we trained one with mixed precision and we saved it as fashion DDPM MP for mixed precision. Okay, so this is all the usual imports and stuff. This is all the usual stuff. But there's a slight difference this time, which is that we're going to try to get the FID for a model we've already trained. So basically to get the model we've already trained to get its FID, we can just torch.load it and then .cuda to pop it on the GPU. So I'm going to call that the S model, which is the model for samples, the samples model. And this is just a copied and pasted DDPM from the last time. So that's for our sampling. So we're going to do sampling from that model. And so once we've sampled from the model, we're then going to try and calculate this score called the FID. Now, what the FID is going to do is it's not going to say how good are these images. It's going to say how similar are they to real images. And so the way we're going to do that is we're going to actually look specifically at four of the images that we generated in these samples. We're going to look at some statistics of some of the activations. So what we're going to do, we've generated these samples and we're going to create a new data looter, which contains no training batches and it contains one validation batch, which contains the samples. It doesn't actually matter what the dependent variable is, so I just put in the same dependent variable that we already had. And then what we're going to do is we're going to use that to extract some features from a model. Now, what do we mean by that? So if you remember back to notebook 14, we created this thing called summary. And summary shows us at different blocks of our model, there are various different output shapes. In this case, it's a batch size of 1024. And so after the first block, we had 16 channels, 28 by 28, and then we had 32 channels, 14 by 14 and so forth, until just before the final linear layer, we had the 1024 batches and we had 512 channels with no height and width. Now, the idea of fit and kit is that the distribution of these 512 channels for a real image has a particular kind of like signature, right? It looks a particular way. And so what we're going to do is we're going to take our samples. We're going to run it through a model that's learnt to predict fashion classes and we're going to grab this layer, right? And then we're going to average it across a batch, right? To get 512 numbers and those are going to represent the mean of each of those channels. So those channels might represent, for example, does it have a pointed collar? Does it have, you know, smooth fabric? Does it have sharp heels and so forth, right? And you could recognise that something's probably not a normal fashion image if it says, oh, yes, it's got sharp heels and flowing fabric. It's like, oh, that doesn't sound like anything we recognise, right? So there are certain kind of like sets of means of these activations that don't make sets. So that's... This is a metric for... It's not a metric for an individual image, necessarily, but it's across a whole lot of images. So if I generate a bunch of fashion images and I want to say, does this look like a bunch of fashion images? If I look at the mean, like maybe X percent have this feature and X percent have that feature. So if I'm looking at those means as like comparing the distribution within all these images I generated, do roughly the same amount have sharp collars as those in... Yeah, that's a very good point, too. Yeah. And it's actually going to get even more sophisticated than that. But let's just start at that level, which is this features.bin. So the basic idea here is that we're going to take our samples and we're going to pass them through a pre-trained model that has learned to predict what type of fashion something is. And of course, we train some of those in this notebook. And specifically, we trained a nice 20 epoch one in the data augmentation section, which had a 94.3% accuracy. And so if we pass our samples through this model, we would expect to get some useful features. One thing that I found made this a bit complicated, though, is that this model was trained using data that had gone through this transformation of subtracting the mean and dividing by the standard deviation. And that's not what we're creating in our samples. And so generally speaking, samples in most of these kind of diffusion models tend to be between negative one and one. So I actually added a new section to the very bottom of this notebook, which simply replaces the transform with something that goes from negative one to one and just creates those data loaders and then trains something that can classify fashion and I saved this as not data auger, but data auger too. So this is just exactly the same as before, but it's a fashion classifier where the inputs are expected to be between minus one and one. Having said that, it turns out that our images, our samples are not between minus one and one. But actually, if you go back and you look at ddpm2, we just use tf.toTensor and that actually makes images that are between zero and one. So actually, that's a bug. Okay, so our images have a bug, which is they go between zero and one. So we'll look at fixing that in a moment, but for now, we're just trying to get the fit of our existing model. So let's do that. So what we need to do is we need to take the output of our model and we need to model play by two. So that'll be between zero and two and subtract one. So that'll change our samples to be between minus one and one and we can now pass them through our pre-trained fashion classifier. Okay, so now how do we get that, the output of that pooling layer? Because that's actually what we want to remind you. We want the output of this layer. So just to kind of flex our, you know, pytorch muscles, I'm going to show a couple of ways to do it. So we're going to load the model I just trained, the data.org tree model. And what we could do is of course we could use a hook. And we have a hooks callback. So we could just create a function which just depends the output. So very straightforward. Okay, so that's what we want. We want the output. And specifically it's, so we've got, these are all sequentials. So we can just go through and go 0, 1, 2, 3, 4, and five layer that we want. Okay, and so that's the module that we want to hook. So once we've hooked that, we can pass that as a callback and we can then, it's a bit weird calling fit, I suppose, because we're saying train equals false, but we're just basically capturing. This is just to make one batch go through and grab the outputs. So this means now in our hook, there's now going to be a thing called outP because we put it there. And we can grab, for example, a few of those to have a look. And yep, here we've got a 64 by 512 set of features. Okay, so that's one way we can do it. Another way we could do it is that actually sequential models are what's called in Python collections. They have a certain API that they're expected to support. And out of something a collection can do, like a list is you can call dell to delete something. So we can delete this layer and this layer and be left with just these layers. And once we do that, that means we can just call CapturePreds because now they don't have the last two layers. So we can just delete layers 8 and 7 call CapturePreds. And one nice thing about this is it's going to give us the entire 10,000 images in the test set. So that's what I ended up deciding to do. There's lots of other ways I played around with which worked, but I decided to do these two as being two pretty good techniques. Okay, so now we've got what do a thousand real images look like at the end of the pooling layer? So now we need to do the same for our sample. So we'll load up our DDPM MP, we'll call sample. Let's just grab 256 images for now. Make them go between minus one and one. Make sure they look okay. And as I described before, created data loaders where the validation set just has one batch, which contains our samples. And call CapturePreds. Okay, so that's going to give us our features and the reason why is because we're passing the sample to model and model is the the classifier. Okay, which we've deleted the last two layers from. So that's going to give us our 256 by 512. So now we can get the means. Now, that's not really enough to tell us whether something is, looks like real images. So maybe I should draw here. So we started out with our batch of 256, 256, and our channels of 512. And we squished them by taking their mean. So it's now just 256. A vector. So this is the, sorry, wrong way around. We squished them this way. 512. Because this is the mean for each channel. Okay, and we did exactly the same thing for the much bigger, you know, full set of real images. So this is our samples. And this is our real. But when we squish it, there's 10,000. No, 512. We get again, 512. So we could now compare these two. Right, but you know, you could absolutely have some samples that don't look anything like images, but have similar averages for each channel. So we do a second thing, which is we create a covariance matrix. Now, if you've forgotten what this is, you should go back to our previous lesson where we looked at it, but just to remind you, a covariance matrix says, in this case, we do it across the channel. So it's going to be 512 by 512. So it's going to take each of these columns, and it says in each cell, so here's cell 11, basically it says, what's the difference between, it basically is saying, what's the difference between each row, each element here, and the mean of the whole column multiplied by the exactly the same thing for a different column. Now on the diagonal, it's the same column twice. So that means that these in the diagonal have the variance. But more interestingly, the ones in the off diagonal, like here, is actually saying, what's the relationship between column 1 and column 2? So if column 1 and column 2 are uncorrelated, then this would be 0. If they were identical, then it would be the same as the variance in here. That's how correlated are they? Why is this interesting? Well, if we do the same, exactly the same thing for the reals, that's going to give us another 512 by 512. And it's going to say things like, so let's say this first column was kind of like, it doesn't have pointy heels. Sorry, heels. And the second one might be doesn't have flowing fabric. And this is where we say, okay, if generally speaking, you would expect these to be negatively correlated. So over here in the reals, this is probably going to have a negative. Whereas if over here it was like 0 or even worse if it's positive, it'd be like, oh, those are probably not real. Because it's very unlikely you're going to have images that have both pointy heels are positively associated with the flowing fabric. So we're basically looking for two datasets where their covariance matrices are kind of the same and their means are also kind of the same. All right. So there are ways of comparing these, basically comparing two sets of data to say, are they from the same distribution? And you can broadly think of it as being like, oh, do they have pretty similar covariance matrices? Do they have pretty similar mean vectors? And so this is basically what the fresh air inception distance does. Does that make sense so far, guys? Yes. When he's striking me now from the similarity as to when we were talking about the style loss and those kinds of things. The types of features that occur together without worrying about which items of the data you do the same thing. The Gram-Schmidt matrices or whatever. Exactly. Now the particular way of comparing so I've got the means and I've got the covariance for my samples and I've actually just created these little calc stats. I'm showing you how I build things, not just things that I built. I always create things step by step and check their shapes. Then I paste them into a merge the cells, copy the cells and merge them into functions. So here's something that gets the means and the covariance matrix. So then I basically do a call that, both for my sample features and for my features of the actual data set or the test set and the data set. Now what I'd now do with that if they have those features I can calculate this thing called the fresh air inception distance which is here. And basically what happens is we multiply together the two covariance matrices and that's now going to make them like bigger. So we now need to basically scale that down again. Now if we were working with non-matrices if you kind of like multiply two things together then to kind of bring it back down to the original scale you know you could kind of take the square root, right? So particularly if it was by itself you took the square root you get back to the original and so we need to do exactly the same thing to re-normalize these matrices. The problem is that we've got matrices and we need to take the matrix square root. Now the matrix square root you might not have come across this before but it exists and it's the thing where the matrix square root of the matrix A times itself is A. Now I'm going to slightly cheap because we've used the float square root before and we did not re-implement it from scratch because it's in the python standard library and also it wouldn't be particularly interesting. They're basically the way you can calculate the float square root from scratch is by using there's lots of ways but you know the classic way that you might have done in high school is to use Newton's method which is where you basically can solve if you're trying to calculate you know A equals the root x then you're basically saying A squared equals x which means you're saying A squared minus x equals 0 and that's an equation that you can solve and you can solve it by basically taking the derivative and taking a step along the derivative a bunch of times. You can basically do the same thing to calculate the matrix square root and so here it is it's the Newton method but because it's from HSCs it's slightly more complicated so it's a shorts method and I'm not going to go through it it's basically the same or you go through up to 100 iterations and you basically do something like travelling along that kind of derivative and then you say okay well the result times itself ought to equal the original matrix so let's subtract the matrix times itself and see whether the absolute value is small okay so that's basically how we do a matrix square root so we do that's that and so now that we have strictly speaking information from scratch we're allowed to use the one that already exists my torch doesn't have one sadly so we have to use the one from scipy scipy.minelk so this is basically going to give us a measure of similarity between the two covariance matrices and then here's the measure of similarity between the two main matrices which is just the sum of squared errors and then basically for reasons that are interesting but it's just normalizing we subtract what's called the trace which is the sum of the diagonal elements and we subtract two times the trace of the this thing and that's called the fresh A inception distance so a bit hand wavy on the math because I don't think it's particularly relevant to anything but you can see a number which represents how similar is you know this for the samples to this for some real data now it's weird it's called fresh A inception distance when we've done nothing to do with inception well the reason why is that people do not normally use the fast.ai part to custom fashioned datalog2.pickle they normally use a more famous model they normally use the inception model which was an image net winning model from Google Brain from a few years ago there's no reason whatsoever that inception is a good model to use for this it just happens to be the one which the original paper used and as a result everybody now uses that not because they are sheep but because you want to be able to compare your results with other people's results perhaps we actually don't we actually want to compare our results from our other results and we're going to get a much more accurate metric if we use a model that's good specifically at recognizing fashion so that's why we're using this so very very few people bother to use this most people just hip install python feed or whatever it's called and use inception but it's actually better to use unless you're comparing to papers it's better to use a model that you've trained on your data and you know it's good at that so I guess this is not a fed it's a well maybe fed now stands for fashion fashion MNIST I don't know what it stands for something I wanted to bring up two other caveats of FID especially like in papers is like the other thing is that FID is dependent on the number of samples that you use so as the number of samples they use for measuring FID it's it's more accurate if you use more samples and it's less accurate if you use less samples actually biased so if you use less samples it's too high specifically yeah so in papers you'll see them report how many samples they used and so even comparing to other papers and comparing between different models and different things you want to make sure that you're comparing with the same amount of samples otherwise you know it might just be high because they just use less number of samples or something like this so you want to make sure that's comparable and then the other thing that is because I guess it's kind of a side effect of using the inception network in these papers is the fact that all of these are at size 299 by 299 which is like the size that the inception model was trained so actually when you're applying this inception network for measuring this distance you're going to be resizing your images to 299 by 299 which in different cases that that may not make much sense so like in our case we're working with 32 by 32 or yeah 32 by 32 or 28 by 28 images these are very small images every resize it to 299 or in other cases this is now kind of an issue with some of these latest models you have these large 512 by 512 or 1024 by 1024 images and then you're kind of shrinking these images to 299 by 299 and you're losing a lot of that detail and quality in those images so actually it's kind of become a problem with some of these latest papers when you look at the FID scores and how they're comparing them and then visually when you see them you can kind of notice oh yeah these are much better images but the FID score doesn't capture that as well because you're actually using these much smaller images so there are a bunch of different caveats and so FID you know it's very good for like yeah it's nice and simple for this sort of comparison but you have to be aware of all these different caveats of this metric as well so excellent segue because we're going to look at exactly those two things right now and in fact there is a metric that compares the two distributions in a way that is not biased so it's not necessarily higher or lower if you use more or less samples and it's called the KID or KID which is the kernel inception distance it's actually significantly simpler to calculate than the fresh air inception distance and basically what you do is you create a bunch of groups a bunch of partitions and you go through each of those partitions and you grab a few of your axes at a time and a few of your y's at a time and then you calculate something called the MMD which is here which is basically the again the details don't really matter we basically do a matrix product and we actually take the cube of it this K is the kernel and we basically do that for the first sample by it's compared to itself the second compared to itself and the first compared to the second and we then normalize them in various ways and add the two with themselves together and subtract the with the other one and this one actually does not use the stats it doesn't use the means and covariance metrics it uses the features directly and the actual final result is basically the mean of this calculated across different little batches again the math doesn't really matter as to exactly why all these are exactly what they are but it's going to give you again a measure of the similarity of these two distributions at first I was confused as to why more people weren't using this because people don't tend to use this and have this a nasty bias problem and now that I've been using it for a while I know why which is that it has a very high variance which means when I call it multiple times with just like samples with different random seeds I get very different values and so I actually haven't found this use tool at all so we left in the situation which is yeah we don't actually have a good unbiased metric and I think that's the truth of where we are there are the best practices and even if we did all I would tell you is like how similar distributions are to each other it doesn't actually tell you whether they look any good really so that's why or pretty much all good papers they have a section on human testing but I've definitely found this fairly useful for me for like comparing fashion images which particularly like humans are good at looking at like faces at a reasonably high resolution and be like oh that eye looks kind of weird but we're not good at looking at 28 by 28 fashion images so it's particularly helpful for stuff that our brains aren't quite out so I basically wrapped this up into a class which I call image eval for evaluating images and so what you're going to do is you're going to pass in a pre-trained model for a classifier and your data loaders which is the thing that we're going to use to to basically calculate the the real images so that's going to be you know the data loaders that were in this learn so the real images and so what it's going to do in this class then again this is just copying and pasting the previous lines of code and putting them into a class this is going to be then something that we call capture prets on to get our features for the real images and then we can also calculate the stats for the real images and so then we can call FID by calling calc FID which is the thing we already had passing in the stats for the real images and calculated stats for the features from our samples where the features the thing that we've seen before we passed in our samples any random y value is fine so I just have a single tensor there and call capture prets so we can now create an image of our mod object passing in our classifier passing in our data loaders with the real data any other callbacks you want and if we call FID takes about a quarter of a second and 33.9 is the FID for our samples so something that I think ok then KIT KIT is going to be a very different scale it's only .05 so KIT is generally much smaller than FID so I mainly going to be looking at FID and so here's what happens if we call FID on sample 0 and then sample 50 and then sample 100 and so forth all the way up to 900 and then we also do samples 975, 990 999 and so you can see over time our samples FID's improved so that's a good little test there's something curious about the fact that they stopped improving about here so that's interesting I have not seen anybody plot this graph before I don't know if Jono or Tanishk if you guys have I feel like it's something people should be looking at because it's really telling you it's your sampling making consistent improvements and to clarify this is like the predicted denoised sample at the different stages during sampling right yes exactly if I was to stop something on just go straight to the predicted X error what would the FID be so I just want to check our samples yeah we reset we add the X knot hat at each time yep yep exactly same for KID and I was hoping that they would look the same and they do so that's encouraging that KID and FID are basically measuring the same thing and then something else that I haven't seen people do but I think it's a very good idea is to take the FID of an actual batch of data okay and so that tells us how good we could get now that's a bit unfair because I think the different sizes our data is 512 our sample is 256 but anyway it's it's a pretty huge difference and then yeah the second thing that Tanishk talked about which I thought I'd actually show is what does it take to use you know to get a real FID this inception network so I didn't particularly feel like re-elementing the inception network so I guess I'm cheating here I'm just going to grab it from iTorch FID but there's absolutely no reason to study the inception network because it's totally obsolete at this point and as Tanishk mentioned it wants 299 by 299 images which actually you can just call resize import to have that done for you it also expects three channel images so what I did is I created a wrapper for an inception v3 model that when you call forward it takes your your batch and replicates the channel three times so that's basically creating a three channel version of a black and white image just by replicating it three times so with that and again this is good like flexing of your pyotorch muscles you know try to make sure you can replicate this that you can yeah get an inception model working on your fashion MNIST samples and yeah then from there we can just plus that to our image eval instead and so on our samples that gives a 63.8 and on a real batch of data it gets 27.9 and like I find this like a good sign that this is much less effective than our real fashion MNIST classifier because like that's only at you know difference of a ratio of three or so the fact that our FID for real data using a real classifier was 6.6 I think that's pretty encouraging you know yeah so that is that and we now have a FID more specifically we now have an image eval class did you guys have any questions or comments about that before we keep going no pretty much every other FID GC reported is going to be you know set up for CIFAR 10 tiny 32 by 32 pixels resized up to 2.99 and fed to Inception that was aimed on imaging not CIFAR you know but bearing in mind that once again this is a slightly weird metric and even things like the types that emit like the image resizing algorithms like ultra-intensive flow might be slightly different or you know if you saved your images as JPEGs and then reloaded them your FID might be twice as bad yeah it makes a big difference so just what this will like the takeaway from all of this that I get is that it's really useful everything's the same like using the same backbone model using the same approach the same number of samples then you can compare it's apples to apples but yeah for one set of experiments after the 30 might be good because it's the way everything's set up and for another that might be terrible so kind of compare to paper or whatever is this easy so I'm going to maybe the approach is that like if you're doing your own experiments you know these sorts of metrics are good but then if you're going to compare to other models it's best to rely on human studies if you're comparing to other models and that yeah I think that's kind of the approach or mindset that we should be having when it comes to this yeah or both you know but yeah so we're going to see this is going to be very useful for us and we're just going to be using the same pretty much all the time we're going to use the same set number of samples and we're going to use the same fashion MNAS specific classifier so the first thing I wanted to do was fix our bug and to remind you the bug was that we we were feeding into our unit in ddpm v2 and the original ddpm images that go from 0 to 1 and yeah that's wrong like nobody does that everybody feeds in images that are from minus 1 to 1 so that's very easy to fix you just just to ask like why is that a bug why is it a bug I mean it's like everybody knows it's a bug because that's what everybody does like I've never seen anybody do anything else and it's very easy to fix so I fixed it by adding this to ddpm v2 and I re-ran it and it didn't work it made it worse and this was the start of you know a few horrible days of pain because like when you you know fix a bug and it makes things worse that generally suggests there's some other bug somewhere else that somehow is offset your first bug and so I had to go I basically went back through every other notebook at every cell and I did find at least one bug elsewhere which is that we hadn't been shuffling our training sets the whole time so I fixed that but it's got absolutely nothing to do with this and I ended up going through everything from scratch three times rerunning every three times checking every intermediate output three times so days of depressing and annoying work and made no progress at all at which point I then asked Jono's question to myself more carefully and provided a less flipping response to myself which was well I don't know why everybody does this actually so I asked to Nishik and Jono and I couldn't Petro and I was like if you guys seen any math papers whatever that's based on this particular input range and yeah you guys are both like no I haven't it's just what everybody does so at that point it raised the possibility that like okay maybe maybe what everybody does is not the right thing to do and is there any reason to believe it is a right thing to do I've given that it seemed like fixing the bug later worse maybe not and then but then it's like well okay we are pretty confident from everything we've learned and discussed that having centered data is better than unscented data so having data that go from 0 to 1 clearly seems weird so maybe the issue is not that we've changed the center but that we've scaled it down so rather than having a range of 2 it's got a range of 1 so at that point you know I did something very simple which was I did this I subtracted 0.5 so now rather than going from 0 to 1 it goes from 0.5 to 0.5 and so the theory here then was okay if our hypothesis is correct which is that the negative 1 to 1 range has no foundational reason for being and we've accidentally hit on something which is that a range of 1 is better than a range of 2 and this should be better still because this is a range of 1 and it's centered properly and so this is ddpmv3 and I ran that and yes it appeared to be better and this is great because now I've got fit I was able to run fit on ddpmv2 and on ddpmv3 and it was dramatically dramatically dramatically better and in fact I was running a lot of other experiments at the time which we'll talk about soon and like all of my experiments have totally fallen apart when I fixed the bug and once I did this all the things that I thought weren't working suddenly started working so this is often the case I guess is that bugs can highlight accidental discoveries and the trick is always to be careful enough to recognize when that's happened some people might remember the story this is how the noble gases were discovered a chemistry experiment went wrong and left behind some strange bubbles at the bottom of the test tube and most people would just be like whoops bubbles but people who are careful enough actually went no there shouldn't be bubbles there let's test them carefully it's like they don't react again most people would be like oh that didn't work the reaction failed but you know if you're really careful you'll be like oh maybe the fact they don't react is the interesting thing so yes being careful was not fair for the jury when you say things like it didn't work or it was worse when you first showed us this thing I kind of said the images looked fine the fit was slightly worse but it was okay and if you trained it longer it eventually got better there were some things that sampling occasionally went wrong one image in 100 or something like that but it was like this isn't like everything completely fell apart women's were slightly worse than expected and if you were doing the run and gun try a bunch of things it's like oh well I just doubled my training time instead of a few runs going and looked at the weights and biases stats later and oh that seems like it's better now we just needed to train for longer and we have internet GPUs and lots of money you would notice this so yeah it wasn't like the fact that you picked up on it showed that you had this deep intuition for where it should be at this stage in training versus where it was, what the samples looked like and you had the fit as well to say okay I would have expected a fit of 9 and I'm getting 14 what's up here and that was enough to start asking these questions and you'll jump down to think where this came from I mean definitely I drive people crazy that I work with I don't know why you guys aren't crazy yet but with this kind of like no I need to know exactly why you know this is not exactly what we expected but yeah this is why to find when something's mysterious and weird it means that there's something you didn't understand and that's an opportunity to learn something new so that's what we did and so that was quite exciting because yeah going minus 0.5 to 0.5 made the fit better still and I was definitely in yeah I moved from this frame of mind from like you know total depression I was so mad I still remember when I spoke to Jotto I was just so upset and that I suddenly like oh my gosh we're actually onto something so started experimenting more and you know a bit more confidence at this point I guess and one thing I started looking at our schedule you know we'd always been copying and pasting this standard again set of stuff and I started questioning everything why is this the standard like why are these numbers here you know and I don't see any particular reason why those numbers were there and I thought we should maybe experiment with them so to make it easier I created a little function that would return a schedule now you could create a new class or a schedule but something that's really cool is there's a thing in Python called SimpleMintNamespace which is a lot like a struct in C basically lets you wrap up a little bunch of keys and values as if it's an object so okay to this little SimpleNamespace which contains our alphas, our alpha bars and our sigmas for our normal this is .02 this is what we always do and then yeah there's another paper which mentions an alternative approach which is cosine schedule which is where you basically set alpha bar equal to t as a fraction of big t times pi over 2 cosine of that squared and if you make that your alpha bar you can then basically reverse back out to calculate what alpha must have been and so we can create a schedule for this cosine schedule as well and yeah this cosine schedule is I think pretty recognized as being better than this linear schedule and so I thought okay it would be interesting to look at how they compare and in fact really all that matters is the alpha bar the alpha bar is the total amount of noise that you're adding so in dtpm when we do noisify you know it's alpha bar that we're actually using the amount of the image and 1-alpha bar exactly yeah so I just printed those out for the normal linear schedule and this cosine schedule and you can really see the linear schedule it really sucks badly it's got a lot of time steps where it's basically about zero and that's something we can't really do anything with you know whereas the cosine schedule is really nice and smooth and there's not many steps which are nearly zero or nearly one so I was kind of inclined to try using the cosine schedule but then I thought well it would be easy enough to get rid of this big flat bit by just decreasing beta max that would be another thing we can do so I tried oh sorry first of all I should mention that the other thing that's really important is the slope of these curves because that's how much things are stepping during the sampling process and so here's the slope of the Linn and the cosine and you can see the cosine slope really nice right you have this nice smooth curve whereas the linear is just a disaster so yeah if I change beta max to 0.01 that actually gets you nearly the same curve as the cosine so I thought that was very interesting it kind of made me think like why on earth does everybody always use 0.02 as a default and so we actually talked to Robyn who was one of the two lead authors on the stable diffusion paper and we talked about all of these things and he said oh yeah we noticed not exactly this but we experimented with everything and we noticed that when we decreased beta max we got better results and so actually stable diffusion uses beta max at 0.012 I think that might be a little bit higher than they should have picked but it's certainly a lot better than the default so it's interesting talking to Robyn to see like all of these kinds of experiments and like things that we tried out they had been there as well and noticed the same things but the inputs range as well they have this magical factor of 0.1802 to whatever they scale the latents by and if you ask why they're like oh yeah we wanted the latents to be roughly uniform range or whatever but that's also like that's reducing the range of your inputs to reasonable value I think exactly they we independently discovered and they independently discovered this this idea yeah exactly um yeah exactly so we'll be talking more about like what's actually going on with that maybe next lesson um um anyway so here's the curves as well they're also pretty close so at this point I was kind of thinking well I'd like to like change as little as possible so I'm going to keep using a linear schedule but I'm just going to change beta max to 0.01 um for my next you know version of ddpm so that's what I've got here linear schedule beta max 0.01 and so that I wouldn't really have to change any of my code I then just put those in the same variable names that I've always used and the noise of fire is exactly the same as it always has been so now I just repeat everything that I've done before so now would I show a batch of data I can already see that there's you know more actually recognizable images which I think is very encouraging previously they like almost all of them had been pure noise which is not a good sign so okay so now I just train it exactly the same as ddpmv2 um and so save this as fashion ddpm3 oh and then the other things I've done here is um you know this just had to work pretty well um I actually decided let's keep going even further so I actually doubled all of my channels from before and I also increased the number of epochs by 3 um because things are going so well I was like how well could they go so we've got a bigger model trained for longer so maybe it takes a few minutes um that's what the 25 here is the number of epochs so samples exactly the same as it always has been so um create 512 samples and here they are and they definitely look to me you know great like I'm not sure I could recognize whether these are real samples or generated samples um but luckily we know we can test them um so we can load up our data org to delete the last two layers pass that to image val and get a fit for our samples and it's 8 and then um I choose 512 for a reason because that's our batch size so then I can compare that like with like for the fit for the actual data at 6.6 so this is like hugely exciting to me we've got down to a fit that is nearly as good as real images so I feel like this is you know in terms of image quality for small unconditional sampling I feel like we're done you know pretty much um and so at this point I was like okay well can we make it faster you know at the same quality and I just wanted to experiment with a few things like really obvious ideas and in particular I thought we're calling this a thousand times which means we're calling um this a thousand times which is running the model and that's slow and most of the time you just move a tiny bit so the model is pretty much the same you know the noise being predicted is pretty much the same so I just did something really obvious well I thought it was really obvious which is I decided let's only call the model every third time you know and maybe also just the last 15 to help it fine tune I don't know if that's necessary other than that it's exactly the same so now this is basically three times faster and yeah samples look basically the same so the third is 9.78 versus 8.1 and this is like within the normal variance of fear just so I don't know like you'd have to run this a few times or use bigger samples but this is basically saying like yeah you probably don't need to call the model a thousand times I did something else slightly weird which is I basically said like oh let's create a different like schedule for how often we call the model which is I created this thing called sample out it basically said when you're for the first few time steps just do it every 10 and then for the next two every 9 and then for the next year every 8 and so forth and just for the last 100 do it everyone so that makes it even faster samples look good this is you know it's definitely worse though now but it's still not bad so yeah I kind of felt like alright this is encouraging and this this stuff before we fixed the minus one to one thing was they looked really bad you know that's why I was thinking that my code was full of bugs so at this point I'm thinking okay okay we can create extremely high quality samples using DDPM what's the like you know best paper out there for doing it faster and the most popular paper for doing it faster is DDIM so I thought we might switch to this next so we're now at the point where we're not actually going to retrain our model at all if you noticed with these different sampling approaches I didn't retrain the model at all we're just saying okay we've got a model the model knows how to estimate the noise in an image how do we use that to call it multiple times to denoise using iterative refinement as Jono calls it and so DDIM is a another way of doing that so what we're going to do so again I'm going to show you how I built my own DDIM from scratch and I kind of cheated which is a there's already an existing one in Diffuses so I decided I'll use that first make sure everything works and then I'll try and reimplement it from scratch myself so that's kind of like when there's an existing thing that works you know that's what I like to do and it's been really good to have my own DDIM from scratch because now I can modify it you know and I've made it much more concise code than the Diffuses version so now we had created this class called UNET which asked the tuple of X's through as individual parameters and returned the dot sample but not surprisingly the given that this comes from Diffuses and we want to use the Diffuses schedulers the Diffuses schedulers assume this has not happened it wants the X as a tuple and it expects to find the thing called dot sample so here's something crazy when we save this thing it doesn't really know anything about the code it just knows that it's from a class called UNET so we can actually lie we can say oh yeah that class called UNET it's actually the same as the UNET 2D model with no other changes and Python doesn't know or care so we can now load up this model and it's going to use this unit so this is where it's useful to understand how Python works behind the scenes it's a very simple programming language so we've now got a model which we've trained but it's just going to use the dot sample on it so that means we can use it directly with the Diffuses schedulers so we'll start by actually repeating what we already know how to do which is use the DDPM scheduler so we have to tell it what beta we used to train and so we can grab some random data and so we could say okay we're going to start at time step 999 so let's create a batch of data and then predict the noise and then this is the way the Diffuses thing works is you call scheduler.step and that's the thing which does those lines that's the thing that calculates XT given noise so that's what scheduler.step does so that's why you pass in XT and the time step and the noise and that's going to give you a new set and so I ran that as usual first cell by cell to make sure I understood how it will work to then copy those cells and merge them together and chuck them in a loop so this is now going to go through all the time steps use a progress bar to see how we're going get the noise call step and append so this is just DDPM but using Diffuses and not surprisingly it gives us you know basically the same results as you know nice results very nice results that we got from our own DDPM and so we can now use the same code we've used before to create our image evaluator and I decided yeah we're now going to go right up to 2048 images at a time so it's now this is the size I found it's big enough that it's recently stable and so we're now down to 3.7 for our FID where else the data itself has a FID of 1.9 so again it's showing that our DDPM is basically very nearly unrecognisably different from real data using its distribution of those activations so then we can switch to DDIM by just saying DDIM scheduler and so with DDIM you can say I don't want to do all thousand steps I just want to do 333 steps to every third so that's basically a bit like a bit like this sample skip of doing every third but DDIM as we'll see does it in a smarter way and so here's exactly the same code basically as before but I put it into a little function okay so I can basically pass in my model I can visualize the scheduler and then there's a parameter called eta which is basically how much noise to add so just add all the noise and so this is now going to take three times faster and yeah the FID is basically the same that's encouraging so we're down to 200 steps FID is basically the same 100 steps at this point okay the FID is getting worse and then 50 steps we're still 25 steps it's interesting like when you get down to 25 steps like what does it look like and you can see that they're kind of like they're too smooth they don't have interesting fabric swells so much or logos or patterns as much as the these ones they've got a lot more texture to them so that's kind of what tends to happen so you can still like get something out pretty fast but that's kind of how they suffer so okay so how does DDIM work well DDIM it's nice it's actually in my opinion it makes things a lot easier than DDPM so there's basically an equation from the paper which Tanishka will explain shortly but basically what you do is I've actually grabbed the sample function from from here and I split it out into two bits one bit that says one of the time steps creates that random starting point loops through finds what my current A bar is gets the noise and then basically does the same as shed.step it calls some function right and then that's been pulled out so this allows me to now create my own different steps so I created a DDIM step and basically all I did was I look this equation and I turned it into code actually this one is a second equation from the paper now it's a bit confusing which is that the notation here is different DDPM what it calls alpha bar this paper calls alpha so you've got to look out for that so basically you'll see I basically go I've got here XT XT minus okay one minus alpha bar is we've got to call that beta bar so beta bar dot square root times noise this here is the this is the neural net so this here is the noise okay and here I've got my next XT is oh sorry yes here's my A bar T1 square root times this and you can see here it says predicted X naught so here's my predicted X naught plus beta bar T1 minus sigma squared square root again here's noise that's the same thing as here okay and then plus a bit of random noise which we only do if you're not at the last tip so yeah I can call that so I just did it for so I rather than saying 100 steps I said skip every skip 10 steps so do 10 steps at a time so it's basically going to be 100 steps and so you can see here actually this has happened to do a bit better for my 100 steps it's not bad at all so yeah I mean this has been getting to this point it's been a bit of a life saver to be honest because I can now run a batch of 2048 samples I can sample them in under a minute which doesn't feel painful and so now at a point where I can actually get a pretty good measure of how I'm doing in a pretty reasonable amount of time and I can easily compare it and I got to admit the difference between a fit of 5 and 8 and 11 I can't necessarily tell the difference so for fashion I think fit is better than my eyes for this as long as I use a consistent sample size so yeah Tanish did you want to talk a bit about the ideas of why we do this or where it comes from or what the notation means can I say a little bit before we do that which is just what you have there Jeremy which is like a screenshot from the paper and then the code that is supposed to be as possible tries to follow that the difference that makes for people is huge like I I've got a little research team that I'm doing some contract work with and the fact that like it's called alpha in the ddim paper and alpha in the elsewhere and then in the code that they were copying and pasting from it was called A and B for alpha and B done and it's like you can get things kind of working by copying and pasting into things and it's all just sort of kind of works but just spending that time to literally take two screenshots from equation 14 and 16 from the paper and put them in there and rewrite the code so that it you know with some comments and things to say like this is what this is this is that part from the equation is like the look of pain on their face when I said by the way did you notice that like it's called alpha there and alpha by there they're like yes how could they do that you know it's just like you could just tell how many hours have been spent you know like grinding text and saying what's wrong here and doing notebooks yeah and building this stuff in notebooks is such a good idea like we're doing MIDI AI because the you know the next engineer to come along and work on that and see the equation right there and you can add grows and stuff so I think you know NBDev works particularly well for this this kind of development yeah yeah before before I talk about this I just wanted to briefly in the context of all of these differing notations I recently created this meme which I thought was was valid in terms of like each paper basically has a different diffusion model of the notation so it's just like this they all try to come up with their own universal notation and it's just just keeps proliferating just to me we should all use it yes exactly we need to implement diffusion models in APL somehow alright so yeah the paper that were Jeremy had implemented was this denoising diffusion implicit model paper and if you look at the paper again you can see like the notation could be again a little bit intimidating but when we walk through it we'll see it's not too bad actually so I'll just bring up I guess some of the important equations and also comparing and contrasting you know DDPM and the notation of DDPM and the equations with DIM not only is it not too bad I actually discovered it's making life a lot the DDIM notation and equations are a lot easier to work with than DDPM so I found my life is better since I discovered DDIM yes yes I think a lot of people prefer to use DDIM as well so yeah basically let's see here so where are we so in both in both DDIM and both DDIM and DDPM we have this same sort of equation this equation is exactly the same this is telling us the predicted denoised image so we predict our but basically we predict the you can see my pointer right just want to confirm by the way the little double headed arrow in the top right does that if you click that do you get more room for us to see what's going on double headed arrow just oh yeah yeah okay that works much better so we have our predicted noise so our model is predicting the noise in the image it is also passed in the time step but this is just omitted it basically kind of is given in the XT but you know our model also takes in the time step and so it's predicting the noise in this XT our noisy image and we are trying to remove the noise that's what this one minute this whole term here is remove the noise so because our noise that we're predicting is unit variance noise so we have to scale the variance of our noise appropriately to remove it from our noisy image so we have to scale the noise and subtract it out of the original image and that's how we would get our our predicted denoised image and I think we have to write this one before by looking at the equation for XT in the noisify function and rearranging it to solve for X0 and that's what you yes that's basically what this is yes that's basically what this is so basically the idea is okay instead of yeah we're starting out with X0 and some noise and get an XT we're doing the opposite where we have some noise and we have XT so how can we get X0 so that's what this equation is so that's the predicted X0 or our predicted clean image and this equation is the same for both DDPM and DDIM but these distributions are what's different between DDPM and DDIM so we have these distribution which tells us okay so if we have XT which is our current noisy image and X0 which is our clean image then we find out what some sort of intermediate noisy image is in that process and that's XT-1 so we have a distribution for that and so that tells us how to get such an image and so this is in the DDPM paper they define some distribution and explain the math behind it but yeah basically they have some equations so you have again a Gaussian distribution with some sort of mean and variance but it's again some form of you have this sort of interpolation between your original clean image and your noisy image and that gives you your intermediate slightly less noisy image so that's what this is giving giving a clean image and a noisy image your slightly less noisy image and so the sampling procedure that we do with DDPM basically is predict the noisy predict the X0 and then plug it into this distribution to give you your slightly less noisy image so maybe it's worth finding that out so like if we had let's say some sort of like I don't know I'm just making some sort of I don't know maybe a lot some sort of better yeah some sort of so then in this case I'm showing a one dimensional example let's say you have some sort of a point so it's kind of a one dimensional example that's still in the sort of 2D space but let's say you have some a point on this it represents an actual image that you want so this is you know where your distribution of actual images would lie and you want to estimate this so when this sort of algorithm that we've been seeing here says that okay if we take some random point this is some random point that we choose you know when we start out and what we do is we learn this function the score function to take us to this manifold but it's only going to be accurate in some space so it's going to be accurate you know it's going to be accurate in some area so we get an estimate of the score function and it tells us the direction to move in and it's going to give us the direction to predict our our denoised image right so basically like let's say let's say this let's say you actually your score function sorry so you so let's say your score function is actually in reality some curve okay so it's in reality some curve that points to your oops it points here so that's your score function and you know the value here that's what score function basically means your gradient yeah yes yes it's a gradient so you know we are again doing some form of in this case I guess you would say graded ascent because you're not really minimizing the score you're maximizing it you want so you're maximizing your your the likelihood of that data point being actually a point you want to go towards it so you're doing this sort of gradient ascent process so you're following the graded to get to that so so when we estimate epsilon and predict our noise what we're doing is we're getting the score value here and then so we can you know follow that and we follow it to some point and being kind of exaggerating here but this point will now represent our x0 hat yeah so yeah our x0 hat and in reality you know that's not maybe that's not going to be some point that is an actual point it wouldn't be next to the distribution so you know it's not going to be a very good estimate of a clean image at the beginning but you know we only have that estimate at the beginning at this point and we have to follow it all the way to some place so this is where we follow it to and then we want to find some sort of x t-1 so that's what our next point is and so that's what our second distribution tells us and it basically takes us all the way back to maybe some point here and now we can re-estimate the score function or you are waiting over there do this prediction of noise and it may be more accurate of a score function and maybe we go somewhere here and then we re-estimate and get another point and then we follow it and so that's kind of this iterative process where we're trying to follow the score function to your own point and in order to do so we first have to estimate our x0 hat and then basically add back some noise and to get a little bit and get a new estimate and keep following and add back a little bit more noise and keep estimating so that's what we're doing here in these two steps we have our x0 hat and then we have this distribution and that's how we do it with regular DDPM and it's that's maybe where the sort of um breaking it up in two steps and I don't think the DDPM paper really clarifies that it really talks about it too much but the DDPM paper also really hammers that point home I think and especially in their update equation so that's the DDPM but then with DDIM okay go ahead DDPM just the one thing is that you look at your prediction use that to make a step but you also add back some additional noise that's always fixed there's no parameter to control how much extra noise you add back at each step right exactly so so let's see here basically you won't be exactly at this point you could be you're in that general vicinity adding that noise also helps with you know you don't want to fall into specific modes where it's like oh you know this is the most likely data point you want to add some noise where you can like explore other data points as well so yeah there's some the noise also can help and that's something you really can't control with DDPM and that's something that DDIM explores a little bit further is it in terms of the noise and even trying to get rid of the noise altogether in DDIM so with the DDIM paper the main difference is literally this one equation that's all really it is in terms of changing this distribution where you predict the less noisy equation the less noisy image and basically as you can see it's just you have this additional parameter now which is sigma and the sigma controls how much noise like we were just mentioning is when we're part of this process and you can actually for example if you want you can set sigma to zero and then you can see here now you have a variance that would be zero and so this becomes a completely deterministic process so if you want this could be completely deterministic so that's one aspect of it and then the opposite the other the reason it's called DDIM is just not DDPM because it's not probabilistic anymore it can be made deterministic so the name was changed for that reason but the other thing is like you would think that you've kind of changed the model altogether with a new distribution altogether and so you think oh wouldn't you have to like trade a different model for this purpose but it turns out the math works out where the same model objective would work well with this distribution as well in fact I think that's what they were setting out from the very beginning is what kind of other models can we get with the same objective and so this is what they're able to do is you can have this new parameter that you can introduce in this case kind of controlling the stochasticity of the model and it still can be you can still use the same exact trained model that you had so what this means is that this actually is just a new sampling algorithm and not anything new with the training itself and this is just like we talked about a new way of sampling the model so yeah this is how given now this equation then you can rewrite your xt minus one term and again we're doing the same sort of thing where we split it up into predicting the x0 and then adding back to go back to your xxt and also if you need to add a little bit of noise back in like Jonathan was saying you can do so you have this extra term here and the sigma controls that term and again like we said you have to be again looking at the ddpm equation versus the ddpm equation you have to be careful of the alphas here are referring to alpha bars in the ddpm equation so that's the other caveat so yeah and you have this sigma t set to this particular value will give you back ddpm so sometimes instead they will write basically Jeremy mentioned this sort of I guess eta which is equal to basically yeah so it's just basically sigma is equal to eta times this coefficient so sorry let me just go back and so basically in reality you take first okay you have an eta here so it's like yeah this is where your eta would go so if it's one it becomes regular ddpm and if it's zero of course that's a deterministic case so this is where the eta that you know all these APIs and in the code that we have also the code that Jeremy was showing they have eta equals to one which is course which they say is corresponding to regular ddpm this is actually where the eta would go in the equation so finally it's like yeah you could pass in sigma right like if you weren't trying to match it you could just oh well we have this parameter sigma that controls the amount of noise so let's just take in sigma l as an argument but for convenience they said let's create this new thing eta where zero means sigma is equal to zero which if you look at the equation that works one means we match the ddpm amount of noise that's in the like vanilla ddpm and so then that gives you like a nice slice so you could say eta equals two or eta equals 0.7 or whatever but it's like a meaningful unit of one equals the same as this previous reference work well it's also convenient because it's sigma t which is to say different time steps unless you choose eta equals zero which doesn't matter different time steps you probably want different amounts of noise and so here's a reasonable way of scaling that noise then the last thing of importance which is of course one of the reasons that we were exploring this is the first place is to be able to do this sort of rapid sampling so the basic idea here is that you can define a similar distribution where again the math works out similarly where now you have let's say you have some subset of diffusion steps so in this case it uses tau variable so for example if you say let's say a subset of diffusion steps so if it's like 10 diffusion steps then tau one would just be zero then tau two would be 10 you just keep going all the way up to say a thousand but you've only got sorry tau two would be a hundred and then you go all the way up to a thousand and so you'd get 10 diffusion steps so that's what they're referring to when they have this I guess this tau variable here and so you can do this sort of similar equation and similar derivation to show that this distribution here again meets the sort of objective that you use for training and you can now use this for faster sampling where basically all you have to do is you have to just select the appropriate alpha bar and sorry this one I've written out so this one actually the alpha bar is the regular alpha bar that we're talking about but basically it's a little bit confusing switching between different notations but basically you have this distribution and then you just have to select the appropriate alpha bars and it follows the math the same in terms of you have appropriate sampling process so yeah and I guess that's it makes it a lot simpler in terms of doing this I guess accelerated sampling yeah I guess with any other note maybe are there comments that maybe you guys had or was this the key for me is that in this equation we just have one we only need one parameter which is the alpha bar or alpha depending on which notation it is had everything else being calculated from that and so we don't have the dTPM calls the alpha or beta anymore and that's more convenient for doing this kind of smaller number of steps because we can just jump straight from time step to alpha bar and we can also then as particularly convenient with the cosine schedule because you can calculate the inverse of the cosine schedule function which means you can also go from an alpha bar to a t so it's really easy to say what would alpha bar be 10 time steps previously to this one it's just called a function we don't need anything else and so actually the original cosine schedule paper has to fuss around with various like kind of epsilon style small numbers that they add to things to avoid getting weird numerical problems and so when we only deal with alpha bar all that stuff also goes away so looking at the dDAM code it's simpler code with less parameters than our dDPM code and of course it's dramatically faster and it's also more flexible because we've got this header thing we can play with yes that's the other thing this idea of like controlling stochasticity I think that's something that's interesting to explore and we've been exploring that a bit now and I think we'll continue to explore that in terms of deterministic versus stochasticity so it's worth talking about the sigma in the middle equation you've got there so you've got the sigma t eta t adding the random noise and intuitively it makes sense that if you're adding random noise there you would need to have less you want to move less back towards xt which is your noisy image so that's why and you've got the 1-alpha t-1 minus sigma squared and then you're taking the square root of that so basically that's just sigma the square root of the square so you're subtracting sigma t from the direction pointing to xt adding it to the random noise or vice versa so everything's there for a reason yes and the predicted x0 that entire equation we've derived previously and it remains the same in pretty much any diffusion model methodology well as long as we'll be talking about actually some places where it's going to change probably next week there's another thing where you're predicting the noise yes if you're predicting the noise yes it'll be so I think we'll probably let's wrap it up here so that we leave ourselves plenty of time to cover the kind of new research directions next lesson in more detail like in terms of where we're at just like we hit a kind of like okay we can really predict classes for fashion MNIST a few weeks ago where I think we're there now and like we can do stable diffusion sampling and units except for the unit architecture for unconditional generation now we basically can do fashion MNIST almost so it's unrecognisably different to the real samples and DDIM is the scheduler that the original stable diffusion paper used so we're actually about to go beyond stable diffusion for our sampling and unit trading now so I think we've definitely meeting our stretch goals so far and all from scratch with weights and biases experiment logging if you wanted to have fun there's no reason you couldn't like have a little call back that instead logs things into a SQL like database and then you could write a little front end to show your experiments you know that'd be fun as well yeah I mean you could do all sorts send you a text message when the loss gets good enough uh huh yeah alright well thanks guys that was really fun thanks everybody alright bye okay talk to you later then goodbye