 All right. Hi, gang. And here we are in lesson 21 and joined by the legends themselves, Chano and Tinesh. Hello. Hello. And today, you'll be shocked to hear that we are going to look at a Jupyter Notebook. Amazing, right? We're going to look at Notebook 22. This is a pretty quick, just, you know, improvement, pretty simple improvement to our ddpm slash ddim implementation for fashion MNIST. And this is all the same so far that what I've done is I've made some one quite significant change. And some of the changes we'll be making today, you're all about making life simpler. And they're kind of reflecting the way the papers have been taking things. And it's interesting to see how the papers have not only made things better, they made things simpler. And so one of the things that I've noticed in recent papers is that there's no longer a concept of n steps, which is something we've always had before and always bothered me a bit, this capital T thing. You know, this T over T, it's basically saying this is time step number, say 500 out of 1000. So it's time step 0.5. Why don't just call it 0.5? And the answer is, well, we can. So we talked last time about the cosine scheduler. We didn't end up using it because I came up with an idea which was, you know, simpler and nearly the same, which is just to change our beta max. But in this next notebook, we're going to say, let's use the cosine scheduler, but let's try to get rid of the n steps thing and the capital T thing. So here is a bar again. And now I've got rid of the capital T. So now I'm going to assume that your time step is between 0 and 1, and it basically represents what percentage of the way through the diffusion process are you? So 0 would be your noise and 1 would be, or no, sorry, the other way around. 0 would be all clean and 1 would be your noise. So what, how far through the forward diffusion process. So other than that, this is exactly the same equation we've already seen. And I realized something else, which is kind of fun, which is you can take the inverse of that. So you can calculate T. So we would basically first take the square root, and we would then take the inverse cos, and we would then divide by 2 over pi, or times pi over 2. So we can both, so it's interesting now we don't, the alpha bar is not something we look up in a list. It's something we calculate with a function from a float. And so yeah, interestingly that means we can also calculate T from an alpha bar. So noiseify has changed a little. So now when we get the alpha bar through our time step, we don't look it up, we just call it call the function. And now the time step is a random float between 0 and 1. Actually between 0 and 0.999, which actually I'm sure there's a function I could have chosen to do a float in this range, but I just clapped it because I was lazy, couldn't be bothered looking it up. Other than that, noiseify is exactly the same. Right, so we're still returning the xt, the time step, which is now a float, and the noise. That's the thing we're going to try and predict, dependent variable, this tuple there is our inputs to the model. All right, so here is what that looks like. So now when we look at our input to our unit training process, you can see, you know, we've got a T of 0.05. So 5% of the way through the forward diffusion process, it looks like this, and 65% through what looks like this. So now the time step, and basically the processes are more of a kind of a continuous time step and a continuous process. Rather, before we were having these discrete time steps here, we get just any random value that could be between 0 and 1. Yeah, that's also something more convenient, you know, to have a function to call. Yeah, I find this life a little bit easier. So the model is the same, the callbacks are the same, the fitting process is the same. And so something which is kind of fun is that we could now, again, when we do now, create a little denoise function. So we can take, you know, this batch of data that we generated, the noise-affide data, so here it is again, and we can denoise it. So we know the T for each element, obviously. So remember, T is different for each element now. And we can therefore calculate the alpha bar for each element. And then we can just undo the noisification to get the denoised version. And so if we do that, here's what we get. And so this is great, right? It shows you what actually happens when we run a single step of the model on variously partially-noised images. And this is something you don't see very often because I guess not many people are working in these kind of interactive notebook environments where it's really easy to do this kind of thing. But I think this is really helpful to get a sense of like, okay, if you're 25% of the way through the forward diffusion process, this is what it looks like when you undo that. If you're 95% of the way through it, this is what happens when you undo that. So you can see here, it's basically like, oh, I don't really know what the hell's going on. So at least a noisy mess. Yeah, I guess my feeling from looking at this is I'm impressed, you know, like this 45% noise thing, it looks all noise to me. It's found the long-sleeved top. And yeah, it's actually pretty close to the real one. I looked it up, or it might see it later. It's a little bit of a more of a pattern here, but it even gives a sense of the pattern. So it shows you how impressive this is. So this is 35%. You can kind of see there's a shoe there, but it's really picked up the shoe nicely. So these are very impressive models in one step, in my opinion. So, okay, so sampling is basically the same, except now, rather than starting with using the range function to create our time steps, we're using linspace to create our time steps. So our time steps start at, you know, if we did 1000, it would be 0.999. And they end at 0. And then they're just linearly spaced with this number of steps. So other than that, you know, A bar we now calculate. And the next A bar is going to be whatever the current step is, minus 1 over step. So if you're doing 100 steps, then you'd be minus 0.01. So this is just stepping through linearly. And yeah, that's actually it for changes. So if we just do ddim for 100 steps, you know, that works really well. We get a fit of three, which is actually quite a bit better than we had on 100 steps for our previous ddim. So this test definitely seems like a good sampling approach. And I know Jono is going to talk a bit more shortly about, you know, some of the things that can make better sampling approaches. But yeah, definitely we can see it making a difference here. Did you guys have anything you wanted to say about this before we move on? No, but it is a nice transition towards some of the other things we'll be looking at to start thinking about how do we frame this. And it's also good, like the idea, so the original ddpm paper has this 1000 times tapes and a lot of people follow that. But the idea that you don't have to be bound to that, and maybe it is worth breaking that convention, I know Tanish made that meme about, you know, this 15 competing different standards for notation. And but yeah, sometimes it's helpful to reframe it. Okay, time goes from 0 to 1. That can simplify some things may have complicates others. But yeah, it's nice to think how you can reframe stuff. So yeah, and in fact, where we will head today, by the time we get to notebook 23, we will see, you know, even simpler notation. And yeah, simpler notation generally comes, I think what happens is over time people understand better what's the essence of the problem and the approach, and then that gets reflected in the in the notation. So okay, so the next one I wanted to share is something which is an idea we've been working on for a while, and it's some new research. So partly, I guess this is an interesting like insight into how we do research. So this is 22 noise bread. And the basic idea of this was, well, actually, I got to take you through it to see what the basic idea is. So what I'm going to do is I'm going to create a fashion MNIST as before. But I'm going to create a different kind of model. I'm not going to create a model that predicts the noise given the noise image and T. Instead, I'm going to try to create a model which predicts T given the noise image. So why did I want to do that? Well, partly, well, entirely, because I was curious. I felt like when I looked at something like this, I thought it was pretty obvious roughly how much noise each image had. And so I thought, why are we passing noise when we call the model? Why are we passing in the noise image and the amount of noise or the T? Given that I would have thought the model could figure out how much noise there is. So I wanted to check my intention, which is that the model could figure out how much noise there is. So I thought, okay, well, let's create a model that would try and figure out how much noise there is. So I created a different noisify now. And this noisify grabs an alpha bar T randomly. And it's just a random number between 0 and 1. You don't want 1 per isom in the batch. And so then after just randomly grabbing an alpha bar T, we then noisify in the usual way. But now our independent variable is the noise image. And the dependent variable is alpha bar T. And so we're going to try to create a model that can predict alpha bar T given a noise image. Okay, so everything else is the same as usual. And so we can see an example. You've got alpha bar T dot squeeze dot log it. Oh, yeah, that's true. So the alpha bar T goes between 0 and 1. So we've got a choice. Like, I mean, we don't have to do anything. But you know, normally, if you've got some between 0 and 1, you might consider putting a sigmoid at the end of your model. But I felt like the difference between 0.999 and 0.99 is very significant, you know. So if we do log it, then we don't need the sigmoid at the end anymore. It won't naturally cover the full range of kind of, though, you don't want to be centered at zero. We'll cover all the normal kind of range of numbers. And it also will treat equal ratios as equally important at both ends of the spectrum. So that was my hypothesis was that using log it would be better. I did test it and it was actually very dramatically better. So without this log it here, my model didn't work well at all. And so this is like an example of where thinking about these details is really important. Because if you had, if I hadn't have done this, then I would have come away from this bit of research thinking like, oh, I was wrong, we can't predict noise, noise amount. Yeah, so thanks for pointing that out, China. Yeah, so that's why in this example of a mini batch you can see that the numbers are, can be negative or positive. So zero would represent noise, the alpha bar of 0.5. So here 3.05 is not very noise at all. Where else negative one is pretty noisy. So the idea is that, yeah, given this image you would have to try to predict 3.05. So one thing I was kind of curious about is like, it's always useful to know is like, what's the baseline? Like what counts as good? You know, because often people will say to me like, oh, I created a model and the MSE was 2.6. And I'd be like, well, is that good? Well, it's the best I can do. But is it good? Or is it better than random? Or is it better than predicting the average? So in this case, I was just like, okay, well, what have we just predicted? Actually, this is slightly out of date. I should have said zero here, rather than 0.5, but never mind close enough. So this is before I did the log at thing. So I basically was looking at like, what's the, you know, loss if you just always predicted a constant, which as I said, I should have put zero here. Have an updated it. And so it's like, oh, that would give you a loss of 3.5. Or another way to do it is you could just put MSE here and then look at the MSE loss between 0.5 and your various, just a single mini batch, which we, yeah, mini batch of alphabets, uh, logits. Yeah, so, you know, we wanted to get some, you know, if we're getting something that's about three, then we basically haven't done any better than, than random. And so in this case, this, this, this model, it doesn't actually have anything to learn. It always returns the same thing. So we can just call fit with trade equals false just to find the loss. So this is just a couple of ways of getting quickly finding a loss for a baseline naive model. One thing that thankfully PyTorch will warn you about is if you try to use MSE and your inputs and targets have different shapes, it will broadcast and give you, probably not the result you would expect, and it will give you a warning. So one way to avoid that is just to use dot flatten on each. So I, this kind of flattened MSE is useful to avoid both, avoid the warning and also avoid getting weird errors or weird, sorry, weird results. So we use that for our loss. So the model is the model that we always use. So it's kind of nice. We just use our same old model. Nothing changes. Even though we're doing something totally different. Oh, well, okay, that's not quite true. One difference is that our output, we just have one output now, because this is now a regression model that's just trying to predict a single number. And so our learner now uses MSE as a loss. Everything else is the same as usual. So we can go ahead and trade it and you can see, okay, the loss is already much better than three. So we're definitely learning something. And we end up with a 0.075 mean squared error. That's pretty good considering, you know, there's a pretty wide range of numbers we're trying to predict here. So we're going to save that as noise prediction on Sigma. So save that model. And so we can take a look at how it's doing by grabbing our one batch of noise temperatures, putting it through our t-model. Actually, it's really an alpha bar model that never mind. Call it a t-model. And then we can take a look to see what it's predicted for each one. And we can compare it to the actual for each one. And so you can see here, it said, oh, I think this is about 0.91. And actually, it is 0.91. So here it looks like about 0.36. And yeah, it is actually 0.36. So, you know, you can see overall 0.72, it's actually 0.72 well, it's actually right. This one's 0.02 off. But yeah, my hypothesis was correct, which is that we, you know, we can predict the thing that we were putting in manually as input. So there's a couple of reasons I was interested in checking this out. The first was just like, well, yeah, wouldn't it be simpler if we weren't passing in the T each time? You know, why not pass in the T each time? But it also felt like it would open up a wider range of kind of how we can do sampling. The idea of doing sampling by like precisely controlling the amount of noise that you try to remove each time and then assuming you can remove exactly that amount of noise each time feels limited to me. So I want to try to remove this constraint. So having built this model, I thought, okay, well, you know, which is basically like, okay, I think we don't need to pass T in. Let's try it. So what I then did is I replicated the 22 cosine notebook. I just copied it, pasted it in here. But I made a couple of changes. The first is that Noisify doesn't return T anymore. So there's no way to cheat. We don't know what T is. And so that means that the unit now doesn't have T. So it's actually going to pass zero every time. So it has no ability to learn from T because it doesn't get T. So it just kind of doesn't really matter what we pass in. We could have changed the unit to like, remove the conditioning on T. But for research, this is just as good, you know, for finding out. And it's good to be lazy when doing research. There's no point doing something a fancy way when you can do it quick and easy way before you even know if it's going to work. So yeah, that's the only change. So we can then train the model and we can check the loss. So the loss here is 0.034. And previously it was 0.033. So interestingly, you know, maybe it's a tiny bit worse at that, you know, but it's very close. Okay, so we'll save that model. And then for sampling, I've got exactly the same DDIM step as usual. And my sampling is exactly the same as usual, except now when I call the model, I have no T to pass in. So we just pass in this. I mean, I still know T because I'm still using the usual sampling approach, but I'm not passing it to the model. And yeah, we can sample and what happens is actually pretty garbage. 22 is our fit. And as you can see here, you know, some of the images are still really noisy. So I totally failed. And so that's always a little discouraging when you think something's going to work and it doesn't. But my reaction to that is like, if I think something's going to work and it doesn't, is to think, well, I just going to have to do a better job of it. You know, I like it ought to work. So I tried something different, which is I thought like, okay, since we're not passing in the T, then we're basically saying like, how much noise should you be removing? It doesn't know exactly. So it might remove a little bit more noise that we want or a little bit less noise than we want. And we know from the, you know, testing we did that sometimes it's out by like this case 0.02. And I guess if you're out consistently, sometimes it's, yeah, I got to end up not removing all the noise. So the change I made was to the DDAM step, which is here. And let me just copy this and get rid of the I'm entered out sections just to make it a bit easier to read. Okay. So the DDAM step, this is the normal DDAM step. Okay. And so step one is the same. So don't worry about that because it's the same as we've seen before. But what I did was I actually used my T-model. So I passed the noise image into my T-model, which is actually an alpha bar model to get the predicted alpha bar. And this is remember the predicted alpha bar for each image, because we know from here that sometimes, so sometimes it did a pretty good job, right? But sometimes it didn't. So I felt like, okay, we did a predicted alpha bar for each image. What I then discovered is sometimes that could be like, really too low. Right. So what I wanted to make sure is it wasn't too crazy. So I then found the median for a mini batch of all the predicted alpha bars, and I clamped it to not be too far away from the median. And so then what I did when I did my X naught hat is rather than using alpha bar T, I used the estimated alpha bar T for each image clamped to be not too far away from the median. And so this way it was updating it based on the amount of noise that actually seems to be left behind rather than the assumed amount of noise that should be left behind, you know, if we assume it's removed the correct amount. And then everything else is the same. So when I did that, so whoa, made all the difference. And here it is. They are beautiful pieces of clothing. So 3.88 versus 3.2. That's possibly close enough. Like I'd have to run it a few times, you know, my guess is maybe it's a tiny bit worse, but it's pretty close. But like this definitely gives me some encouragement that, you know, even though this is like something I just did in a couple of days, where else the kind of the with T approaches have been developed since 2015, and we're now in 2023, I, you know, I would expect it's quite likely that these kind of like no, no T approaches could eventually surpass the T based approaches. And like one thing that definitely makes me think that there's room to improve is if I plot the fit or the kid or each sample during the reverse diffusion process, it actually gets worse for a while. I'm like, okay, well, that's that's a bad sign. I have no idea what that's happening. But it's a sign that you know, if we could improve each step that one would assume we could get better than 3.8. So yeah, today's coach, I don't have any thoughts about that or questions or comments or and maybe to just like, to highlight that the research process a little bit, it wasn't like this linear thing of like, oh, here's this issue, not for me as well as we thought, oh, here's the fix we just kept this, you know, this was like multiple days of like discussing and like Jeremy saying, like, you know, I'm tearing my hair out, you guys have any ideas and oh, what about this and oh, I noticed in the team paper, they do this camping, maybe that'll help. You know, so there's a lot of back and forth and also a lot of like, you saw the code that was commented out there, prints x t dot men x t dot max alpha bar pred, you know, just like seeing, oh, okay, you know, my average prediction is about what I expect. But sometimes the middle of the max goes, you know, two, three, eight, 1650, 12 million infinity, you know, if you like one or two little baddies that would just kind of get out. Yeah. And so that kind of like, debugging and exploring and printing the results. Our additional discussions about this idea, I kind of said to you guys, before lesson one of part two, I said, like, it feels to me like we shouldn't need the t thing. And so it's actually been like one leg away in the background for the months. Yeah. Yeah. And I guess I mean, we should also mention we have tried this, like a friend of ours trained a no t version stable diffusion for us. And we did the same sort of thing. I trained a pretty bad t predictor and it sort of generates samples. So we're not like focusing on that large scale stuff yet. But it is fun to like every night and again, I've got this idea from fashion innist, we are, you know, trying these out on some bigger models and seeing, okay, this does seem like maybe it'll work. And so down the line that future plan is to say that's actually, you know, spend the time training a proper model and see, yeah, see how well that does. If it seems interesting. You say a friend of ours, we can be more specific. It's Robert, one of the two lead authors of the stable diffusion paper who actually has been fine tuning a real stable, stable diffusion model, which is without t and it's looking super encouraging. So yeah, that'll be fun to play with with this new, you know, we'll have to train a t predictor for that. See how it looks. Yeah. All right. So I guess the other area we've been talking about kind of doing some research on is this weird thing that came up over the last few weeks, where our bug in the DDPM implementation, where we accidentally weren't doing it from minus one to one for the input range. It turned out that actually being from minus one to one wasn't a very good idea anyway. And so we ended up centering it as being and from minus 0.5 to 0.5. And Jono and Tanishk have managed to actually find a paper, well, I say find a paper, a paper has come out in the last 24 hours, which has coincidentally cast some light on this and is also cited a paper that we weren't aware of, which was not released in the last 24 hours. So Jono, are you going to tell us a bit about that? Yeah. Sure. I can do that. So it's funny. This was such perfect timing because I actually got up early this morning planning to run with the different input skilings and the cosine schedule that Jeremy was showing and some of the other schedulers we look at. It might be nice for the lesson to have a little plot of like what is the fit with these different solvers and input skilings, but it was going to be a lot of work. I was not looking forward to doing the grant work. And then Tanishk sent me this paper, which AK had just tweeted out because he reviews anything that comes up on archive every day on the importance of noise scheduling for diffusion models. This is by a researcher at the Google Brain team who's also done a really cool recent paper on something called a recurrent interface network outside of the scope of this lesson, but also worth checking out. Yeah. So this paper, they're hoping to study this noise scheduling and the strategies that you take for that. And they want to show that number one, noise scheduling is crucial for performance. And the optimal one depends on the tasks. When increasing the image size, the noise scheduling that you want changes and scaling the input data by some factor is a good strategy for working with this. And that's what we've been talking about, right? Yeah. That's what we've been doing where we said, oh, do we scale from minus 0.5 to 0.5, or minus 1 to 1, or do we normalize? And so they demonstrate the effectiveness by training a really good high-resolution model on ImageMet, so class condition model. They look great. Yeah. Amazing examples. They'll show you one later. So I really like this paper. It's very like short and concise, and it just gives all the information across. And so they introduced this here. We have this noising process on Noisify function where we have square root of something times x plus square root of 1 minus that something times the noise. And here they use gamma, gamma of t, which is often used for the continuous time case. So instead of the alpha bar and the beta bar schedule for a thousand time saves, there'll be some function gamma of t that tells you what your alpha bar should be. Okay. So that's our function is actually called a bar, but it's the same thing. Yeah. Same thing. Takes in a time set from 0 to 1, and then that's used to noise the image. Interestingly, what they're showing here actually is something that we had discovered, and I've been complaining about that my DTIMs with an eater of less than one weren't working, which is to say, when I added extra noise to the image, it wasn't working. And what they're showing here is like, oh, yeah, duh, if you use a smaller image, then adding extra noise is probably not a good idea. Yeah. And so they use a lot of reference in this paper to like information be destroyed and signal to noise ratios. And that's really helpful for thinking about because it's not something that's obvious, but at 64 by 64 pixels, adjacent pixels might have much less in common, versus the same amount of noise added at a much higher resolution, the noise kind of averages out and you can still see a lot of the image. So yeah, that's one thing they highlight is that the same noise level for different image sizes might be a harder or easier task. And so they investigate some strategies for this. They look at the different noise schedule functions. So we've seen the original version from the DDGM paper, we've seen the cosine schedule, and we've seen, I think we might look at or the next thing that Jeremy's going to show us a sigmoid based schedule. So they show the continuous time versions of that and they plot how you can change various parameters to get these different gamma functions or in our case, the alpha bar, where we're starting at all image, no noise at T equals zero, moving to all noise, no image at T equals one, but the path that you take is going to be different for these different classes of functions and parameters. And the signal to noise ratio, that's what this or the log signal to noise ratio is going to change over that time as well. And so that's one of the knobs we can tweak. We're saying our diffusion model isn't training that well. We think it might be related to the noise schedule and so on. One of the things you can do is try different noise schedules, either changing the parameters in one class of noise schedule or switching from a linear to a cosine to a sigmoid. And then the second strategy is kind of what we were doing in those experiments, which is just to add some scaling factor to x zero. We were accidentally using B of 0.5. Exactly. And so that's a second dial that you can tweak is to say keeping your noise schedule fixed, maybe just scale x zero, which is going to change the ratio of signal to noise. And that's why I think there's four in C there is what we were accidentally doing. Yes. Yeah, exactly. And so let's see if we can get to... Oh yeah, so that again changes the signal to noise for different scalings you get. And so that's fine. So they have a compound, they have a strategy that combines some of those things. And this is the important part, they do their experiments. And so they have a nice table of investigating different schedules, cosine schedules and sigmoid schedules. And in bold are the best results and you can see for 64 by 64 images versus 128 versus 256, the best schedule is not necessarily always the same. And so that's like important finding number one, depending on what your data looks like, using a different schedule might be optimal. There's no one true best schedule, there's no one value of, you know, beta min and beta max, that's just magically the best. Likewise, for this input scaling at different sizes, with whatever schedules they tested, and different values were kind of optimal. And so yeah, it's just a really great illustration, I guess, that this is another design choice that's implicit or explicitly part of your diffusion model training and sampling is how are you dealing with this noise schedule, what schedule are you following, what scaling are you doing with your inputs. And by using this thinking and doing these experiments, and they come up with a kind of rule of thumb for how to scale the image based on image size, they show that they can, as they increase the resolution, they can still maintain really good performance. Where previously it was quite hard to train a really large resolution pixel space model, and they're able to do that, they get some advantage from their fancy recurrent interface network, but still, it's kind of cool that they can say, look, we get state of the art, high quality, and 512 by 5N call or 1024 by 1024, samples on class condition image net, and using this approach to really like consider how well do you train, how many steps do we need to take, one of the other things in this table is that they compare it to previous approaches, oh, we used a third of the training steps for the same other settings and we get better performance, just because we've chosen that input scaling better. So that's the paper, really, really nice great work to the team, and that was... I love that you got up in the morning and thought, oh, it's going to be a hassle training all these different models I need to train for different input scalings and different sampling approaches. I just look at Twitter first, and then you looked at Twitter, and there was a paper saying like, hey, we just did a bunch of experiments for different noise schedules and input scaling. Yeah. Does your wife always work that way each other? It seems quite blessed. Yeah, it's very lucky like that. Yeah, if you wait long enough, someone else will do it. That's why it shows that the time when the AK starts posting on Twitter is like my favorite hour of the day. It's just for all the papers to be posted. Oh, well, thank you for that. So let me switch to notebook 23, because this notebook is actually largely an implementation of some ideas from this paper that everybody tends to just call it Keras. It's unfair because there's other people, but I will do it anyway, Keras paper. And the reason we're going to look at this is because in this paper, the authors actually take a much more explicit look at the question of input scaling. Their approach was not apparently to accidentally put a bug in their code and then take it out and find it worked worse and then just put it back in again. Their approach was actually to think, how should things be? So that's an interesting approach to doing things, and I guess it works for them. So that's fine. I think our approach is more inflating. No, exactly. Our approach is much more fun because you never quite know what's going to happen. And so yeah, in their approach, they actually tried to say like, okay, given all the things that are coming into our model, how can we have them all nicely balanced? So we will skip back and forth between the notebook and the paper. So the start of this is all the same, except now we are actually going to do it minus one to one, because we're not going to rely on accidental bugs anymore, but instead we're going to rely on the Keras papers carefully designed scaling. I say that except that I put a bug in this notebook as well. One of the things that's in the Keras paper is what is the standard deviation of the actual data, which I calculated for a batch. However, this used to say minus 0.5. I used to do the minus 0.5 to 0.5 thing. And so this is actually the standard deviation of the data before I, when it was still minus 0.5. So this is actually half the real standard deviation. For reasons I don't yet understand, this is giving me better scaled results. So this actually should be 0.66. So there's still a bug here and the bug still seems to work better. So we've still got some mysteries involved. So we're going to leave this. So it's actually not 0.33. It's actually 0.66. So the basic idea of this paper, actually I'll come back. Well, let me have a little think. Okay, now we'll start here. The basic idea of this paper is to say, you know what, sometimes maybe predicting the noise is a bad idea. And so like you can either try and predict the noise or you can try and predict the clean image. And each of those can be a better idea in different situations. If you're given something which is nearly pure noise, you know, the model's given something which is nearly pure noise and is then asked to predict the noise, that's basically a waste of time because all things noise. If you do the opposite, which is you try to get it predict the clean image, well then if you give it a clean image that's nearly clean and try to predict the clean image, that's nearly a waste of time as well. So you want something which is like regardless of how noisy the image is, you want it to be kind of like an equally difficult problem to solve. And so what Keras do is they basically use this new thing called C-skip which is a number which is basically saying like, you know what we should do for the training target is not just predict the noise all the time, not just predict the clean image all the time, but predict kind of a lerped version of one or the other depending on how noisy it is. So here Y is the clean image and N is the noise. So Y plus N is the noise image. And so if C-skip was zero, then we would be predicting the clean image. And if C-skip was one, we would be predicting Y minus Y, we would be predicting the noise. And so you can decide by picking a different C-skip whether you're predicting the clean image or the noise. And so as you can see from where they've written it, they make this a function. They make it a function of Sigma. Now this is where we've got to a point now where we've kind of got a fairly much simpler notation. There's no more Alpha bars, no more Alpha's, no more Beta's, no more Beta bars. There's just a single thing called Sigma. Unfortunately Sigma is the same thing as Alpha bar used to be. Right. So we've simplified it, but we've also made things more confusing by using an existing symbol for something totally different. So this is Alpha bar. Okay. So there's going to be a function that says depending on how much noise there is, we'll either predict the noise or we'll predict the clean image or we'll predict something between the two. So in the paper, they showed this chart where they basically said like, okay, let's look at the loss to see how good are we with a trained model at predicting when Sigma is really low. So when there's very small Alpha bar or when Sigma is in the middle or when Sigma is really high. And they basically said, you know what, when it's nearly all noise or nearly no noise, we're basically not able to do anything at all. You know, we're basically good at doing things when there's a medium amount of noise. So when deciding, okay, what, what Sigma's are we going to send to this thing? The first thing we need to do is to, is to figure out some Sigma's. And they said, okay, well, let's pick a distribution of Sigma's that matches this red curve here. As you can see. And so this is a normally distributed curve where this is on a log scale. So this is actually a log normal curve. So to get the Sigma's that they're going to use, they picked a normally distributed random number. And then they expert it. And this is called a log normal distribution. And so they used a mean of minus 1.2 and a standard deviation of 1.2. So that means that about one third of the time, they're going to be getting a number that's bigger than zero here. And E to the zero is one. So about one third of the time, they're going to be picking Sigma's that are bigger than one. And so here's a histogram I drew of the Sigma's that we're going to be using. And so it's nearly always, you know, less than five. But sometimes it's way out here. And so it's quite hard to read these histograms. So this really nice library called Seaborn, which is built on top of mapplotlib has some more sophisticated and often nicer looking plots. And one of them they have is called a KDE plot, which is a kernel density plot. It's a histogram, but it's smooth. Okay. And so I clipped it at 10 so that you could see it better. So you can basically see that the vast majority at the time it's going to be somewhere, you know, about 0.4 or 0.5. But sometimes it's going to be really big. So our noiseify is going to pick a Sigma using that log normal distribution. And then we're going to get the noise as usual. But now we're going to calculate C skip, right? Because we're going to do that thing we just saw, we're going to find something between the plane image and the noise input. So what do we use for C skip? We calculate it here. And so what we do is we say, what's the total amount of variance at some level of Sigma? Well, it's going to be Sigma squared. That's the definition of the noise. But we also have the Sigma of the data itself, right? So if we add those two together, we'll get the total variance. And so what the Keras paper said to do is to do the variance of the data divided by the total variance and use that for C skip. So that means that if your total variance is really big, so in other words, it's got a lot of noise, then C skip is going to be really small. So if you've got a lot of noise, then this bit here will be really small. So that means if there's a lot of noise, try to predict the original image, right? That makes sense because predicting the noise would be too easy. If there's hardly any noise, then this will be total variance will be really small, right? So C skip will be really big. And so if there's hardly any noise, then try to predict the noise. And so that's basically what this C skip does. So it's a kind of slightly weird idea is that our target, the thing we're trying to do actually is not the input image, sorry, the original image, it's not the noise, but it's somewhere between the two. And I found the easiest way to understand that is to draw a picture of it. So here is some examples of noise input, right? With various amounts of, with various sigmas, so remember, sigma is alpha bar, right? So here's an example with very little noise, 0.06. And so in this case, the target is predict the noise, right? So that's the hard thing to do, is predict the noise. Whereas here's an example, 4.53, which is nearly all noise. So for nearly all noise, the target is predict the image, right? And then for something which is a little bit between the two, like here, 0.64, the target is predict some of the noise and some of the image. So that's the idea of Paris. And so what this does is it's making the, you know, problem to be solved by the unit equally difficult, regardless of what sigma is. It doesn't solve our input scaling problem. It solves our kind of difficulty scaling problem. To solve the input scaling problem, they do it. I just wanted to make one quick note. And so like this sort of idea of, like is also interpolating between the noise and the image is similar to what's called the V-objectives as well. So there's also a similar kind of, it's, yeah, it's very quite similar to what Carrots of Dell has, but that's also now been used in a lot of different models, like for example, Staple Diffusion 2.0 was trained with this sort of V-objective. So people are using this sort of methodology and getting good results. And yeah, so it's an actual practical thing that people are doing. So I just want to make a note of that. Yeah, as is the case of basically all papers created by NVIDIA researchers, of which this is one, it flies under the radar, and everybody knows it. The V-objective paper came from the senior author was Tim Salamance, which is Google, right? Yeah. And so anything from Google and OpenAI, everybody listens to. So yeah, although Carrots I think has done the more complete version of this, and in fact, the V-objective was almost like mentioned in passing in the distillation paper. But yeah, that's the one that everybody has ended up looking at. But I think this is the more... Yeah, I think what happened with the V-objective is not many people paid attention to it. I think folks like Kat and Robin and these sorts of folks were actually getting attention to that V-objective in that Google brain paper. But then also this paper did a much more principled analysis of this sort of thing. So yeah, I think it's very interesting how sometimes even these sort of side notes in papers that maybe people don't pay much attention to, they can actually be quite important. Yeah, yeah. So okay, so the noise input as usual is the input image plus the noise times the sigma. But then, and then as we discussed, we decide how to kind of decide what our target is. But then we actually take that noise input and we scale it up or down by this number. And the target, we also scale up or down by this number. And those are both calculated in this thing as well. So here's C out and here's C in. Now I just wanted to show one example of where these numbers come from because for a while they all seemed pretty mysterious to me. And I felt like I'd never be smart enough to understand them, particularly because they were explained in the mathematical appendix of this paper, which are always the bits I don't understand, until I actually try to and then it tends to turn out they're not so bad after all, which is certainly the case here. I think it was up, it was B something, I think. So the B6, I think, is that the one? Oh yeah. So an appendix B6, which does look pretty terrifying. But if you actually look at, for example, what we're just looking at, C in, it's like how do they calculate? So C in is this. Now this is the variance of the noise, this is the variance of the data, add them together to get the total variance, square roots, the total standard deviation. So it's just the inverse of the total standard deviation, which is what we have here. Where does that come from? Well, they just said, you know what, the inputs for a model should have unit variance. Now we know that we've done that to dare in this course. So they said, we're right. So while the inputs to the model is the clean data plus the noise times some number we're going to calculate. And we want that to be one. Okay. So the variance of the clean images plus the noise is equal to the variance of the clean images plus the variance of the noise. Okay. So if we want that to be, if we want variance to be one, then divide both sides by this and take the square root. And that tells us that our multiplier has to be one over this. That's it. So it's like literally, you know, classical math. The only bit you have to know is that the variance of two things added together is the variance of the two things added together, which is not rocket science either. And in this context, like why we want to do this, when we looked at those sigmas that you're potting like the distribution, you've got some that are fairly low, but you've also got somewhere the standard deviation sigma is like 40. Right. So the variance is super high. Yes. And so we don't want to feed something with standard deviation 40 into our model. We would like it to be closer to unit variance. So we're thinking, okay, well, if you divide by roughly 40, that would scale it down. But then we've also got some extra variance from our data. So it's like yeah, exactly. 40 plus variance to the data of a little bit. We want to, you know, scale back down by that to get unit variance. Yeah. I mean, I love this paper because it's basically just doing what we spent weeks doing of like, I feel like everything that we've done that's improved every model has always been one thing, which is can we get mean zero variance one inputs to our model and for all of our activations. And, and then the only other thing is include enough compute by adding enough layers and enough activations. Those two things seem to be all that matters. Basically, well, I guess, Resnitz added an extra cool little thing to that, which is to make it a, you know, make it even smoother by giving us kind of like identity path. So yeah, basically trying to make things as smooth as possible and as equal everywhere as possible. So yeah, this is what they've done. So they did that for the inputs, and then they've also done it for the outputs. And for the outputs, you know, it's basically the same idea, you know, they have basically the same kind of analysis to show that. And so with this, so now, yeah, we basically, we've got our noise to input. We've got the, you know, kind of linear version somewhere between X naught and the noise to input. We've got the scaling of the output and we've got the scaling of the input. So now for the inputs to our model, we're going to have the scaled noise. We're going to have the Sigma and we're going to have the target, which is somewhere between the image and the noise. And so yeah, so I've, you know, never seen anybody draw a picture of this before. So it was really cool when, you know, being in a notebook, being able to see like, oh, that's what they're doing, you know. So yeah, have a good look at this notebook to see exactly what's going on, because I think it gets you a really good intuition around what problem it's trying to solve. So then I actually checked the noise to input has a standard deviation of one. The mean's not zero. And of course, why would it be? We didn't do anything, you know, the only thing Karras cared about was having the variance one. We could easily adjust the input and output to have a mean of zero as well. And that's something I think we or somebody should try. Because I think it does seem to help a bit as we saw with that generalized values stuff we did. But it's less important than the variance. And so same with the target. It's got the one. And yeah, this is where if I change this to the correct value, which is 0.66, then actually it's slightly further away from one both here and here, quite a lot further away. And maybe that's because actually the data's, well, we know the data's not Gaussian distributed, pixel data definitely isn't Gaussian distributed. So this bug turned out better. Okay, so the unit's the same, the initialization's the same. This is all the same. Train it for a while. We can't compare the losses, right, because our target's different. So, but what we can do is we can create a D noise that just takes the thing that as per usual, the thing we had in Noisify, right, and solve for x0. So we've got to multiply by c out and then add c skip by noise input. Here it is, multiply by c out, add noise input, c skip. Okay, so we can do noise. So let's grab our sigmas from the actual batch we had. Let's calculate c skip, c out and c in, or the sigmas in our mini batch. Let's use the model to predict the target given the noise input and the sigmas, and then denoise it. And so here's our noise input, which we've already seen, and here's our predictions. And these are absolutely remarkable, in my opinion. Yeah, like this one here, I can barely see it. Do you know what it's really found? Look at the shirt. There's the shirt here. It's actually really finding the little thing on the front. And let me show you, here's what it should look like. Right. And in cases where the sigma's pretty high, like here, you can see it's really like saying like, I don't know, maybe it's shoes, but it could be something else. Is it shoes? Yeah, it wasn't shoes. But at least it's kind of got the, you know, the bulk of the pixels in the right spot. Yeah, something like this one is 4.5. It has no idea what it is. It's like, oh, maybe it's shoes, maybe it's pants, you know, and turn that it is shoes. Yeah. So I think that's fascinating how well it can do. And then the other thing I did, which I thought was fun, was I just created, so I just, you did a sigma of 80, which is actually what they do when they're doing sampling from pure noise. That's that's what they consider the pure noise level. So I just created some pure noise. And denoised it just for one step. And so here's what happens when you denoise it for one step. And you can see it's kind of overlaid all the possibilities. It's like I can see a pair of shoes here, a pair of pants here at top here. And sometimes it's kind of like more confident that the noise is actually a pair of pants. And sometimes it's more confident that it's actually shoes. But you can really get a sense of how like from pure noise, it starts to make a call about like what this noise is actually covering up. And this is also the bit which I feel is like I'm the least convinced about when it comes to diffusion models. This first step of going from pure noise to something. And like trying to have a good mix of all the possible some things. I don't know, it feels a bit hand-wavy to me. It clearly works quite well, but I'm not sure if it's like we're getting the full range of possibilities. And I feel like some of the papers we're starting to see are starting to say like, you know what, maybe this is not quite the right approach. Then maybe later in the course, we'll look at some of the ones that look at what we call VQ models and tokenized stuff. Anyway, I thought this is pretty interesting to see these pictures, which I don't think, yeah, I've never seen any pictures like this before. So I think this is a fun result from doing all this stuff in notebooks step by step. Okay, so sampling. So one of the nice things with this is the sampling becomes much, much, much simpler. And so, and I should mention a lot of the code that I'm using, particularly in the sampling section, is heavily inspired by, and some of it's actually copied and pasted from Kat's K diffusion repo, which is, I think I mentioned before, some of the nicest generative modeling code or maybe the nicest generative modeling code I've ever seen. It's really great. So before we talk about the actual sampling, the first thing we need to talk about is what Sigma do we use at each reverse time step? And in the past, we've always, well, nearly always done something which I think has always felt sketchy as well, which is we've just linearly gone down the Sigma's or the alpha bars or the T's. So here when we're sampling in the previous notebook, we used Linspace. So I always felt like that was questionable. And I felt like at the start, you're probably like, it was just noise anyway. So who cared? Who cares? So I, in DDPM v3, I experimented with something that I thought intuitively made more sense. I don't know if you remember this one, but I actually said, oh, let's, for the first 100 time steps, let's actually only run the model every 10 times. And then for the next 100, let's run it nine times, the next 100, let's run it every eight times. So basically at the start, be much less careful. And so Keras actually ran a whole bunch of experiments. And they said, yeah, you know what? At the start of training, you know, you can start with a high Sigma, but then like step to a much lower Sigma in the next step, and then a much lower Sigma in the next step. And then the longer, the more you train, step by smaller and smaller steps, so that you'd spend a lot more time fine tuning carefully at the end, and not very much time at the start. Now, this has its own problems. And in fact, a paper just came out today, which we probably won't talk about today, but maybe another time, which talked about the problems is that in these very early steps, this is the bit where you're trying to create a composition that makes sense. Now for fashion, MNIST, we don't have much composing to do. It's just a piece of clothing. But if you're trying to do an astronaut riding a horse, you know, you've got to think about how all those pieces fit together. And this is where that happens. And so I do worry that with the Karras approach, it's not giving that maybe enough time. But as I've said, that's really the same as this step. That whole piece feels a bit wrong to me. But aside from that, I think this makes a lot of sense, which is that, yeah, the sampling, you should jump, you know, by big steps early on, and small steps later on, and make sure that the fine details are just so. So that's what this function does, is it creates this lot. Now it's this schedule of reverse diffusion sigma steps. It's a bit of a weird function in that it's the the row root of sigma, where row is seven. So the seventh root of sigma is basically what it's scaling on. But the answer to why it's that is because they tried it, and it turned out to work pretty well. Do you guys remember where this was? This is the truncation error analysis, D1. That's very. So this image here, so thanks for Tunis reminding me where this is, shows FID as a function of row. So it's basically what the what's root are we taking. And they basically said like, if you take the fifth root up, it seems to work. Basically. So yeah, so that's a perfectly good way to do things is just to try things and see what works. And you'll notice they tried things just like we love on small data sets, not as small as us, because we're the king of small data sets. But small is just I far 10 image net 64. That's the way to do things. So that's all like, might have even been the CEO of hugging face the other day tweets, something saying only people with huge amounts of GPUs can do research now. And I think it totally misunderstands how research is done, which is research is done on very small data sets. That's, that's the actual research. And then when you're all done, you scale it up at the end. I think we're kind of pushing the envelope in terms of like, yeah, how much can you do? And yeah, we've like, re covered this kind of main substantive path of diffusion models history step by step, showing every improvement and seeing clear improvements across all the papers using nothing but fashion MNIST running on a single GPU in like 15 minutes of training or something per model. So yeah, definitely don't need lots of models. Anyway, okay, so this is the Sigma we're going to jump to. So the denoising is going to involve calculating the C skip C out and C in and calling our model with the C in scale data and the Sigma and then scaling it with C out and then doing the C skip. Okay, so that's just undoing the Noisify. So check this out, this is all that's required to do one step of denoising for the simplest kind of scheduler, sorry, the simple is kind of sampler, which is called Euler. So we basically say, okay, what's the Sigma at time step I, what's the Sigma two at time step I, and now when I'm talking about time step, I'm really talking about like the step from this function, right? So this is, this is a sampling step, yeah. Okay, so then denoise using the function and then we say, okay, well, just send back whatever you were given plus move a little bit in the direction of the denoised image. So the direction is X minus denoised, so it's the noise that's the gradient as we discussed right back in the first lesson of this part. So we'll take the noise, if we divide it by Sigma, we get a slope, it's how much noise is there per Sigma? And then the amount that we're stepping is Sigma two minus Sigma one. So take that slope and multiply it by the change, right? So that's the distance to travel towards the noise at this fraction, you know, or you could also think of it this way, and I know this is a very obvious algebraic change, but if we move this over here, you could also think of this as being, oh, of the total amount of noise, the change in Sigma we're doing, what percentage is that? Okay, well that's the amount we should step. Right, so there's two ways of thinking about the same thing. So again, this is just you know, high school math. Well, I mean actually, my seven-year-old daughter has done all these things, it's plus minus divide and times. So we're going to need to do this once per sampling step. So here's a thing called sample, which does that, it's going to go through each sampling step, call our sampler, which initially we're going to do sample Euler, right, with that information, add it to our list of results, and do it again. So that's it, that's all the sampling is. And of course, we need to grab our list of Sigma's to start with. So I think that's pretty cool. And at the very start, we need to create our pure noise image. And so the amount of noise we start with is got a Sigma of 80. Okay, so if we call sample using sample Euler, and we get back some very nice looking images, and believe it or not, our Fed is 1.98. So this extremely simple sampler, three lines of code, plus a loop has given us a bit of 1.98, which is clearly substantially better than our coastline. Now, we can improve it from there. So one potential improvement is to, you might have noticed, we added no new noise at all, right, this is a deterministic scheduler, right, there's no RAND anywhere here. So we can do some important ancestral Euler sampler, which does add RAND. So we basically do the denoising in the usual way, but then we also add some RAND. And so what we do need to make sure is given that we're adding a certain amount of randomness, we need to remove that amount of randomness from the step that we take. So I will go into the details, but basically there's a way of calculating how much new randomness and how much just going back in the existing direction do we do. And so there's the amount in the existing direction and there's the amount in the new random direction. And you can just pass in eta, which is just going to, when we pass it into here, is going to scale that. So if we scale it by half, so basically half of it is new noise and half of it is going in the direction that we saw we should go, that makes it better still, again with 100 steps. And just make sure I'm comparing to the same, yep, 100 steps. Okay, so that's fair, like with like. Okay, so that's adding a bit of extra noise. Now then, the something that I think we might have mentioned back in the first lesson of this part is something called Heun's method. And Heun's method, that's something which we can pictorially see here to decide where to go, which is basically we say, okay, where are we right now? What's the, you know, at our current point, what's the direction? So we take the tangent line, the slope, right? That's basically all it does is it takes the slope. So it's, oh, here's the slope, you know, okay. And so if we take that slope, and that would take us to a new spot. And then at that new spot, we can then calculate a slope at the new spot as well. And at the new spot, the slope is something else. So that's it here, right? And then you say like, okay, well, let's go halfway between the two. And let's actually follow that line. And so basically it's saying like, okay, each of these slopes is going to be inaccurate. But what we could do is calculate the slope of where we are, the slope of where we're going, and then go halfway between the two. It's, I actually find it easier to look at in code, personally. I'm just going to delete a whole bunch of stuff that's totally irrelevant to this conversation. So take a look at this compared to Euler. So here's our Euler, right? So we're going to do the same first line exactly the same. Right? Then the denoising is exactly the same. Right? And then this step here is exactly the same. I've actually just done it in multiple steps for no particular reason. And then they say, okay, well, if this is the last step, then we're done. So actually the last step is Euler. But then what we do is we then say, well, that's okay, for an Euler step, this is where we'd go. Well, what does that look like if we denoise it? So this calls the model the second time. Right? And where would that take us if we took an Euler step there? And so here, if we took an Euler step there, what's the slope? And so what we then do is we say, oh, okay, well, it's just, just like in the picture, let's take the average. Okay, so let's take the average and then use that, the step. So that's all the Hewne sampler does is just take the average of the slope where we're at and the slope where the Euler method would have taken us. And so if we now notice that it called the model twice for a single step. So to be fair, since we've been taking 100 steps with Euler, we should take 50 steps with Hewne, right? Because it's going to call the model twice. And still that is now, whoa, we beat one, which is pretty amazing. And so we keep going. Check this out. We can even go down to 20. This is actually doing 40 model evaluations. And this is better than our best Euler, which is pretty crazy. Now, something which you might have noticed is kind of weird about this or kind of silly about this is where we're calling the model twice, just in order to average them. But we already have two model results, like without calling it twice, because we could have just looked at the previous time step. And so something called the LMS sampler does that instead. And so the LMS sampler, if I call it with 20, it actually literally does 20 evaluations. And actually it beats Euler with 100 evaluations. And so LMS, I won't go into the details too much, it didn't actually fit into my little sampling very well. So basically largely copied and pasted the cat's code. But the key thing it does is look, it gets the current sigma, it does the denoising, it calculates the slope, and it stores the slope in a list, right? And then it grabs the first one from the list. So it's kind of keeping a list of up to, in this case, four at a time. And so it then uses up to the last four to basically, yes, out of the curvature of this and take the next step. So that's pretty smart. And yeah, so I think if you wanted to do super fast sampling, it seems like a pretty good way to do it. And I think, Jono, you were telling me that, well, maybe it's Petra, it was saying that currently people have started to move away, that this was very popular, but people started to move towards a new sampler, which is a bit similar called the DPM plus, plus sampler, something like that. Yeah. Yeah. Yeah. But I think it's the same idea. I think it kind of keeps a, let's say, keep a list of recent results and use that. I'll have to check it more closely. Let's look at the code. Yeah. That's a similar idea. It's like, if it's done more than one step, then it's using some history to infer the next thing. Yeah. This history here, anything doesn't make a huge amount of sense, I guess, from that perspective. I mean, still, works very well. This makes more sense. So then we can compare if we use an actual mini batch of data, we get about 0.5. So yeah, I feel like this is quite a stunning result to get close to, very close to real data, this in terms of FID, really with 40 model evaluations. And the entire, nearly the entire thing here is by making sure we've got unit variance inputs, unit variance outputs, and kind of equally difficult problems to solve in our loss function. Yeah. Plus having that different schedule for sampling, that's completely unrelated to the training schedule. I think that was one of the big things with Karestel's paper was, they also could apply this to like, oh, existing diffusion models that have been trained by other papers, we can use our sampler and then fewer steps get better results without any of the other changes. And, and yeah, I mean, they do a little bit of rearranging equations to get the other papers versions into their C-Skips, C-N-C Alps framework. But then, yeah, it's really nice that these ideas can be applied to, so for example, I think stable diffusion, especially version one was trained, DDPM style training, epsilon objective, whenever, but you can now get these different samplers and different sampling schedules and things like that and use that to sample it and do it in 15, 20 steps and get pretty nice samples. Yeah. You know, and another nice thing about this paper is they, you know, in fact, it's the name of the paper, elucidating the design space of diffusion based models, you know, they're looked at various different papers and approaches and trying to set like, oh, you know what, these are all doing the same thing when we kind of parameterize things in this way. And if you fill in these parameters, you get this paper and these parameters, you get that paper, you know, and then so we found a better set of parameters which it was very nice to code because, you know, it really actually ended up simplifying things a whole lot. And so if you look through the notebook carefully, which I hope everybody will, you'll see, you know, that the code is really clear and simple compared to previous, all the previous ones in my opinion. Like, I feel like every notebook we've done from DDPM onwards, the code's got easier to understand and results. And just to clarify, like how this connects with some of the previous papers that we've looked at, also like, for example, with the BDIM, the deterministic, that's again, this sort of deterministic approach that's similar to the Euler method sampler that we were just looking at, which was completely deterministic. And then something like the Euler ancestral that we were looking at is similar to the standard DDPM approach that was kind of a more stochastic approach. So again, there's just all these sorts of connections that then are kind of nice to see again, the sorts of connections between the different papers and how they change it, how they could be expressed in this common framework. Yeah. Thanks, Tanish. So we definitely now are at the point where we can show you the unit next time. And so I think we're, unless any of us come up with interesting new insights on the unconditional diffusion sampling, training and sampling process, we, you know, might be putting that aside for a while. And instead, we're going to be looking at trading a good quality unit from scratch. And we're going to look at a different dataset to do that. As we're starting to scale things up a bit, as Jono mentioned in the last lesson. So we're going to be using a 64 by 64 pixel image net subset called tiny image net. So we'll start looking at some three channel images. So I'm sure we're all sick of looking at black and white shoes. So now we get to look at shift dwellings and trolley buses and koala bears and yeah, 200 different things. So that'll be nice. Yeah. All right. Well, thank you, Jono. Thank you, Tanish. That was fun as always. And yeah, next time we'll be less than 22. Bye. Less than 22. This was less than 22. Oh, no way. Okay. You're right. Thank you. See ya. Bye-bye.