 Hi everybody and welcome to the last lesson of part two. Greetings, Jono and greetings to Nish. How are you guys doing? Good thanks. Doing well, excited for the last lesson. It's been an interesting fun journey. Yeah, I should explain. We're not quite completing all of stable diffusion in this part of the course. There's gonna be one piece left for the next part of the course, which is the clip embeddings. And that's because clip is NLP. And so the next part of the course, we will be looking at NLP. So we will end up finishing stable diffusion from scratch. But we're gonna have to have a significant diversion. And what we thought was, given everything that's happened with GPT-4 and stuff since we started this course, we thought it makes more sense to delve into that quite deeply more soon. And delay clip as a result. So hopefully people will feel comfortable with that decision. But I think we'll have a lot of, yeah, exciting NLP material coming up. So that's the rough plan. All right, so I think what we might do is maybe start by looking at a really interesting and quite successful application of pixel level diffusion by applying it not to pixels that represent an image that pixels that represent a sound, which is pretty crazy. So maybe Jono, of course it's gonna be Jono, he does the crazy stuff, which is great. So Jono, show your crazy and crazily successful approach to diffusion for pixels of sounds, please. Sure thing. Right, so this is gonna be a little bit of a show and tell. Most of the code in the notebook is just copied and pasted from I think notebook 30. But we're gonna be trying to generate something other than just images. So specifically I'm gonna be loading up a dataset of bird calls. These are just like short samples of, I think 10 different classes of birds calling. And so we need to understand like, okay, well, this is a totally different domain, right? This is audio. If you look at the data, like let's look at an example of the data. This is coming from a hugging face data set. So that line of code will download it automatically if you haven't got it before, right? Yeah, yeah. So this will download it into a cache and then sort of handle a lot of it. You created this dataset, right? Did you, is this already a dataset you found somewhere else or you made it or what? This is a subset that I made from a much larger dataset of longer call recordings from an open website called Xeno Canto. So they collect all of these sound recordings from people. They have experts who help identify what birds are calling. And so all I did was find the audio peaks like where is there most likely to be a bird call and clip around those just to get a smaller dataset of things where there's actually something happening. Not a particularly amazing dataset in terms of like the recordings have a lot of background noise and stuff but a fun small audio one to play with. Yeah, and so when we talk about audio you've got a microphone somewhere it's reading like a pressure level essentially in the air with these sound waves and it's doing that some number of times per second. So we have a sample rate and in this case the data has a sample rate of 32,000 samples per second. So every second... This is a waveform that's being approximated as lots of little apocross, apocross, apocross kind of things basically, is that right? Yeah, and so that's great for capturing the audio but it's not so good for modeling because we now have 30,000 values per second in this one big 1D array. And so yeah, you can try and find models that can work with that kind of data but what we're gonna do is a little hack and we're instead gonna use something called a spectrogram. So illustrating... The original data is the main issue it's too big and slow to work with. It's too big but also you have some... Like some sound waves are at 100 hertz, right? So they're going up and down 100 times a second and some are at 1,000 and some are at 10,000 and often there's background noise that can have extremely high frequency components. And so if you're looking just at the waveform there's lots and lots of change second to second and there's some very long range dependencies of like, oh, it's generally high here, it's generally low there and so it can be quite hard to capture those patterns. And so part of it is it's just a lot of samples to deal with but part of it also is that it's not like an image where you can just do like convolution and things nearby each other that tend to be related or something like that. It's quite tricky to disentangle what's going on. And so we have this idea of something called a spectrogram. This is a fancy 3D visualization but it's basically just taking that audio and mapping time on one axis. So you can see as time goes by we're moving along the X axis and then on the Y axis is frequency. And so the peaks here show like intensity at different frequencies. And so if I make a pure note you can see that that is being mapped in the frequency domain. But when I'm talking, there's lots and lots of peaks and that's because our voices tend to produce a lot of overtones. So if I go, you can see there's a main note but there's also these subsequent notes. And if I play something like a chord you can see there's maybe three main peaks and then each of those have these harmonics as well. So it captures a lot of information about the signal. And so we're gonna turn our audio data into something like this where even just visually, if I'm a bird you can see this really nice spatial pattern. And the hope is if we can generate that and then if we can find some way to turn it back into audio then we'll be off to the races. And so yeah, that's what I'm doing in this notebook. We have, I'm leaning on the diffusers, pipelines.audiodiffusion.mall class. And so within the realm of spectrograms there's a few different ways you can do it. So this is from the torch audio docs that this notebook is from the Hugging Face Diffusion Models class. So we had that waveform, that's those raw samples and we'd like to convert that into what they call the frequency domain which is things like these spectrograms. And so you can do a normal spectrogram or a power spectrogram or something like that but we often use something called a male spectrogram which is exactly the same. It's actually probably what's being visualized here and it's something that's designed to map the frequency ranges into a range that's like tied to what human hearing is based on. And so rather than trying to capture all frequencies from zero hertz to 40,000 hertz a lot of which we can't even hear it focuses in on the range of values that we tend to be interested in as humans. And also it does like a transformation into kind of like a log space. So that the intensities like highs and lows correspond to loud and quiet for human hearing. So it's very tuned for the types of audio information that we actually might care about rather than tens of thousands of kilohertz that only bats can hear. Okay, so we're gonna rely on a class to abstract this away but it's gonna basically give us a transformation from waveform to spectrogram. And then it's also gonna help us go from spectrogram back to waveform. And so let me show you my data. I have this two image function. It's gonna take the audio array. It's going to use the male class to handle turning that into spectrograms. And the class also does things like it splits it up into chunks based on you can set like a desired resolution. I'd like 128 by 128 spectrogram. It says, okay, great. I know how many, I know you need 128 frequency bins for the frequency axis and 128 steps on the time axis. So it kind of handles that converting and resizing. And then it gives us these audio sliced image. So that's taking a chunk of audio and turning it into the spectrogram. And it also has the inverse. So our data set is fairly simple. We just referencing our original audio data sets but we're calling that to image function and then we're turning it into a tensor and we're mapping it to minus 0.5 to 0.5. Similarly to what we've done with like the grayscale images in the past. So if you look at a sample from that data we now have instead of an audio waveform of 32,000 or 64,000 if it's two seconds samples. We now have this 128 by 128 pixel spectrogram which looks like this. And it's grayscale. So this is just map out Libs colors. But we can test out going from the spectrogram back to audio using the image to audio function that the male class has. And that should give us a sense of what it sounds like in third time. Now this isn't perfect because the spectrogram shows the intensity at different frequencies but with audio you've also got to worry about something called the phase. And so this image to audio function is actually behind the scenes doing a kind of iterative approximation with something called the Griffin-Lyn algorithm. So I'm not gonna try and describe that here but it's just approximating it's guessing what the phase should be. It's creating a spectrogram it's comparing that to the original. It's updating that it's doing sort of like iterative very similar to like an optimization thing to try and generate an audio signal that would produce the spectrogram which we're trying to invert. So just to clarify so my understanding what you're saying is that the spectrogram is a lossy conversion of the sound into an image and specifically it's lossy because it tells you the kind of intensity at each point but it's not, it's kind of like is it like the difference between a sine wave and a cosine wave? Like they're just shifted in different ways and we don't know how much it's shifted. So coming back to the sound you do have to get that shifting the phase correct. And so it's trying to guess something and it sounds like it's not doing a great guess from the thing you showed. The original audio is also not that amazing but yes, the spectrogram back to audio task these dotted lines are like highlighting this is, yeah, it's an approximation and there are deep learning methods now that can do that better or at least that sound much higher quality because you can train a model somehow to go from this image like representation back to an audio signal but we just use the approximation for this notebook. Okay, so now that we can represent our data as like a grayscale 128 by 128 pixel image everything else becomes very much the same as the previous diffusion models examples. We're gonna use this noiseify function to add different amounts of noise. And so we can see now we have our spectrograms but with varying amounts of noise added we can create a simple diffusion model and just copying and pasting the results but with one extra layer just with very few channels to go from 128 to 64 to six, I mean to 16 by eight to eight no attention, just I think pretty much copied and pasted from notebook 30 and train it for, in this case, 15 epochs it took about, I don't know. Oh, this is interesting, are you using simple diffusion? Yes, so specifically this is the simple diffusion model that you I think I've already introduced, maybe not. Yeah, so how's the number of... I think we briefly looked at it so that you remind ourselves of what it does, yeah. Oh, okay, yeah. So we have some number of down blocks with a specified number of channels and then the key insight from simple diffusion was that you often want to concentrate the compute in the sort of middle at the low resolution. So that's these mid blocks. And they're transformers. Yes, yeah. And so we can stack some number of those and then the corresponding up path, and this is a unit. So we're passing in the features from the down path as we go through those up blocks. And so we're gonna go take an image and time step. We can embed the time step. We're going to go through our down blocks saving the results. We're gonna go through the mid blocks. There we go, through the mid blocks. Yeah, and before that, you've also got the embedding of the locations at self.le is the learnable embeddings using scale and shift, I remember. Right, so this is preparing it to go through the transformer blocks by adding some learnable embeddings. Cool. Right, and then we're reshaping it to be effectively a sequence since that's how we'd written our transformer to expect a 1D sequence of embeddings. And so once you've gone through those mid blocks, we reshape it back, and then we go through the up blocks, passing in and also our saved outputs from the down path. Yeah, so it's a nice model. You can really control how much parameters and compute you're doing just by setting like what are the number of features or channels at each of those down block stages and how many mid blocks are you going to stack. And so if you want to scale it up, it's quite easy to do. I mean, just add more mid blocks, maybe I'll add more channels to the down and up paths. And there's a very easy model to tweak to get a larger or smaller model. One fun thought, I know, is simple diffusion only came out a couple of months ago. And I don't think, I think ours might be the first publicly available code for X, I don't think the authors released the code. I suspect this is probably the first time maybe it's ever been used to generate audio before. Possibly, yeah, I guess. I know a couple of people who've at least privately done their implementations when I asked the author if he was releasing code, he said, oh, but it's simple. It's just a bunch of transformer blocks. You can get yourself, I'll release it eventually. No, maybe not, I don't align them, but they were like, oh, you can see the pseudo code. It's pretty easy. Yeah, it's pretty easy. Yeah, cool. So trains, the last goes down, as we hope. Sampling is exactly the same as generating images normally. And that's gonna give us the spectrograms. So I'm using DLAMS sampling with a hundred steps. And to actually listen to these samples, we then are just going to use that image to audio function again to take our grayscale image. And in this case, actually, it expects a PIL image. So I first converted it to PIL and then turn that back into audio. And so we can play some of the generated samples. Wow. That's so cool. I don't know that I could guarantee what bird is making these calls and some of them are better than others, like some of them are better than others. Yep. Some of the original samples sound like that, so. Exactly, yeah. So, yeah, that's generating and fake bird calls with spectrogram diffusion. There's projects that do this on music. So the Refusion projects, it's actually based on text. And yeah, there's various other like pre-trained models that do diffusion on spectrograms to produce music clips or voice or whatever. I may have frozen. Refusion is actually this stable diffusion model that's fine tuned specifically for the spectrogram generation, which I find very impressive. It's like a model that was originally for text to image is instead can also generate these spectrograms. I guess there's still some useful information in the sort of text to image model that kind of generalizes or you can still be used for text to audio. So I found that a very interesting, impressive application as well. Also, Refusion is an awesome name. Indeed it is, yeah. And I guess since it's a latent model that leads us onto the next topic, right? I was just gonna say we've got a natural segue there, yes. So if we want to replicate Refusion, then we'll need Latents. Yeah, so the final non-NLP part of stable diffusion is this ability to use the more compressed representation created by a VAE called Latents instead of Pixels. So we're gonna start today by creating a VAE, taking a look at how it works. So to remind you as we learn back in the first lesson of this part of part two, the VAE model converts the 256 by 256 pixel three channel into a, is it 64 by 64 by four? It'll be 32 if it's 256. It's 512 to 64. Oh, 512 to 64, okay. So do a 32 by 32 by four. So dramatically smaller, which makes life so much easier, which is really nice. Having said that, simple diffusion does the first few, in fact, all the down sampling pretty quickly and all the hard work happens at a 16 by 16 anyway. So maybe it's with simple diffusion, it's not as big a deal as it used to be, but it's still, it's very handy, particularly because for us folks with more normal amounts of compute, we can take advantage of all that hard work that the stability.ai computers did for us by creating the stable diffusion VAE. So that's what we're gonna do today. But first of all, we're gonna create our own. So let's do a VAE using fashion MNIST. So the first or the first stuff is just the normal. One thing I am gonna do for this simple example though, is I'm gonna flatten the fashion MNIST pixels into a vector to make it as simple as possible. Okay, so we're gonna end up with vectors of length 784 because 28 by 28 is 784. We're gonna create a single hidden layer MLP with 400 hidden and then 200 outputs. So here's a linear layer. So it's a sequential containing a linear and then an optimal activation function than an optional normalization. We'll update in at weights so that we initialize linear layers as well. So before we create a VAE, which is a variational auto encoder, we'll create a normal auto encoder. We've done this once before and we didn't have any luck. In fact, we were so unsuccessful that we decided to go back and create a learner and come back a few weeks later once we knew what we were doing. So here we are, we're back. We think we know what we're doing. So we're just gonna recreate an auto encoder just like we did some lessons ago. So there's gonna be an encoder, which is a sequential, which goes from our 768 inputs to our 400 hidden and then a linear layer with our 400 hidden and then an output layer from the 400 hidden to the 200 outputs of the encoder. So there we got our latents. And then the decoder will go from those 200 latents to our 400 hidden, have our hidden layer and then come back to our 768 inputs. All right, so we can optimize that in the usual way using Adam and we'll do it for 20 epochs, runs pretty quickly because it's quite a small data set and quite a small model. And so what we can then do is we can grab a batch of our X or actually grab the batch of X earlier way back here. So I've got a batch of images and we can put it through our model, pop it back on the CPU and we can then have a look at our original mini batch and we have to reshape it to 28 by 28 because we previously had flattened it. So there's our original and then we can look at the result after putting it through our model and there it is. And as you can see, it's very roughly regenerated. And so this is not a massive compression. It's compressing it from 768 to 200. And it's also not doing an amazing job of recreating the original details. But this is the simplest possible autoencoder. So it's doing, it's a lot better than our previous attempt. So that's good. So what we could now do is we could just generate some noise and then we're not even going to do diffusion. We're just going to go and say like, okay, we've got a decoder. So let's just decode that noise and see what it creates. And the answer is not anything great. I mean, I could kind of recognize that might be the start of a shoe. Maybe that's the start of a bag. I don't know, but it's not doing anything amazing. So we have not successfully created an image generator here. But there's a very simple step we can do to make something that's more like an image generator. The problem is that these 200, this vector of length 200 we're creating, there's no particular reason that things that are not in the data set are going to create items of clothing. We haven't done anything to try to make that happen. We've only tried to make this work for the things in the data set. And therefore when we just randomly generate a bunch of, you know, a vector of length 200 or 16 vectors of length 200 in this case, and then decode them, there's no particular reason to think that they're going to create something that's recognizable as clothing. So the way a VAE tries to fix this is by we've got the exact same encoder as before, except it's just missing its final layer. Its final layer has been moved over to here. I'll explain why there's two of them in a moment. So we've got the inputs to hidden, the hidden to hidden, and then the hidden to the latents. The decoder is identical, okay. Latents to hidden, hidden to hidden, hidden to inputs. And then just as before, we call the encoder. But we do something a little bit weird next, which is that we actually have two separate final layers. We've got one called mu for the final of the encoder and one called LV, which stands for log variance. So our encoder has two different final layers. So we're going to call both of them, okay. So we've now got two encoded, 200 long lots of latents. What do we do with them? What we do is we use them to generate random numbers. And the random numbers have a mean of mu. So when you take a random zero one, so this creates zero one random numbers, it means zero standard deviation one. So if we add mu to it, they now have a mean of mu or approximately. And if you multiply the random numbers by half of log variance e to the power of that, right. So given this log of variance, this is going to give you standard deviation. So this is going to give you a standard deviation of e to the half LV and a mean of mu. Why the half? It doesn't matter too much, but if you think about it, standard deviation is the square root. So the variance is squared. So when you take the log, you can move that half into the multiplication because of the log trick. That's why we just got the half here instead of the square root, which would be to the power of a half. So this is just, yeah, this is just the standard deviation. So you've got the standard deviation times normally distributed random noise plus mu. So we end up with normally distributed numbers. We're going to have 200 of them for each element of the batch where they have a standard deviation of the result of this final layer and a variance, which is the result or log variance of the result of this final layer. And then finally, we passed that through the decoder as usual. I explained why we passed back three things, but for now we're just worried about the fact we passed back the result of the decoder. So what this is going to do is it's going to generate the result of calling encode is going to be a little bit random. On average, it's still generating exactly the same as before, which is the result of a sequential model with MLP with one hidden layer. But it's also going to add some randomness around that, right? So here's the bit which is exactly the same as before. This is the same as calling encode before, but then here's the bit that adds some randomness to it. And the amount of randomness is also itself random. Okay, so then that gets run through the decoder. Okay, so if we now just trained that, right? Using the result of the decoder and using, I think we didn't use MSE loss. We used binary cross entropy loss, which we've seen before. So if you've forgotten, you should definitely go back and re-watch that. But really part one, or we've done a bit of it in part two as well, binary cross entropy loss. With logits means that you don't have to worry about doing the softmax. It does the softmax for you. So if we just optimize this using BCE now, you would expect, and it would, I believe I haven't checked, that it would basically take this layer here and turn these all into zeros as a result of which it would have no variance at all. And therefore it would behave exactly the same as the previous sort of encoder. Does that sound reasonable to you guys? Yeah, okay. So that wouldn't help at all because what we actually want is we want some variance. And the reason we want some variance is we actually want to have it generate some latents which are not exactly our data. They're around our data, but it's not exactly our data. And then it generates latents that are around our data. We want them to decode to the same thing. We want them to decode to the correct image. And so as a result, if we can train that, right, something that it does include some variation and still decodes back to the original image, then we've created a much more robust model. And then that's something that we would hope then when we say, okay, well now decode some noise, it's gonna decode to something better than this. So that's the idea of a VAE. So how do we get it to create a log variance which doesn't just go to zero? Well, we have a second loss term. It's called the KL divergence loss. We've got a key called KLD loss. And what we're gonna do is our VAE loss is gonna take the brining across entropy between the actual decoded bit. So that's input zero and the target. Okay, so this is exactly the same as before as this binary cross entropy. And we're going to add it to this KLD loss, KL divergence. Now KL divergence, the details don't matter terribly much. What's important is when we look at the KLD loss, it's getting past the input and the targets. But if you look, it's not actually using the targets at all. So if we pull out the input into its three pieces, which is our predicted image, our mu and our log variance, we don't use this either. So the BCE loss only uses the predicted image and the actual image. The KL divergence loss only uses mu and log variance. And all it does is it returns a number which says for each item in the batch is mu close to zero and is log variance close to one. How does it do that? Well, for mu, it's very easy, mu squared. So if mu is close to zero, then minimizing mu squared does exactly that, right? If mu is one, then mu squared is one. If mu is minus one, mu squared is one. If mu is zero, mu squared is zero. That's the lowest you can get for a squared. Okay, so we've got a mu squared piece here. And we've got a dot mean. So we're just taking, that's just basically taking the mean of all the mu's. And then there's another piece, which is we've got log variance minus e to the power of log variance. So if we look at that, so let's just grab a bunch of numbers between neg three and three and do number minus e to the power of that number. And I'm just gonna pop in the one plus and the 0.5 times as well, they matter much. And you can see that's got a minimum of zero. So when that's a minimum of zero, e to the power of that, which is what we're gonna be using, actually half times e to the power of that, but that's okay, we're gonna be using in our dot forward method. That's gonna be e to the power of zero, which is gonna be one. So this is gonna be minimized where log variance X equals one. So therefore this whole piece here will be minimized when mu is zero and Lv is also zero and so therefore Lv, e to the power of Lv is one. Now, the reason that it's specifically this form is basically because there's a specific mathematical thing called the KL divergence, which compares how similar to distributions are. And so the normal distribution can be fully characterized by its mean and its variance. And so this is actually more precisely calculating the similarity that's specifically the KL divergence between the actual mu and Lv that we have and a distribution with a mean of zero and a variance of one. But you can see hopefully why conceptually we have this mu dot power two and where we have this Lv dot X, Lv minus Lv dot X here. So that is our VAE loss. Did you guys have anything to add to any of that description? So maybe to highlight the objective of this is to say rather than having it so that the exact point that an input is encoded to decodes back to that input we're saying number one, the space around that point should also decode to that input because we're gonna try and force some variance. And number two, the overall variance should be like, yeah, the overall space that it uses should be roughly zero mean and units and variance. Right, so instead of able to like map each input to like an arbitrary point and then decode only that exact point to an input, we now mapping them to like a restricted range and we're saying that not just each point but its surroundings as well should also decode back to something that looks like that image. And that's trying to like condition this latent space to be much nicer so that any arbitrary point within that range will hopefully map to something useful. Which is a harder problem to solve, right? So we would expect given that this is exactly the same architecture we would expect its ability to actually decode would be worse than our previous attempt because it's a harder problem that we're trying to solve which is we've got random numbers in there as well now but we're hoping that this ability to generate images will improve. Thanks, Jono. Okay, so I actually asked Bing about this which is more of an example of like, I think for now that we've got GPT4 and Bing and stuff I find they're pretty good at answering questions that like I wanted to explain to students what would happen if the variance of the latents was very low or what if they were very high? So why do we want them to be one? And I thought like, oh gosh, this is hard to explain. So maybe Bing can help. So I actually thought it's pretty good. So I just say what Bing said. So Bing says if the variance of the latents are very low then the encoder distribution would be very peaked and concentrated around the main. So that was the thing we were describing earlier if we had trained this without the KLD loss at all, right? It would probably make the variance zero. And so therefore the latent space would be less diverse and expressive and limit the ability of the decoder to reconstruct the data accurately, make it harder to generate new data that's different from the training data which is exactly what we're trying to do. And if the variance is very high then the encoder would be very spread out and diffuse. It would be more, the latents would be more noisy and random make it easier to generate new data that's unrealistic or nonsensical. Okay, so that's why we want it to be exactly at a particular point. So when we train this, we can just pass VAE loss as our loss function but it'd be nice to see how well it's going at reconstructing the original image and how it's going at creating zero, one distribution data separately. So what I ended up doing was creating just a really simple thing called funk metric which I derived from the capital M mean class in the torch, just trying to find it here from the torch eval.metrics. So they've already got something that can just calculate means. So obviously this stuff's all very simple and we've created our own metrics class ourselves back a while ago. But since we're using torch eval I thought this is useful to see how we can create one, a custom metric where you can pass in some function to call before it calculates the mean. So if you call, so you might remember that the way torch eval works is it has this thing called update which gets past the input in the targets. So I add to the weighted sum the result of calling some function on the input in the targets. So we want two kind of new metrics. One is the, we're going to print it out as KLD which is a funk metric on KLD loss and one which will print out as BCE which is a funk metric on BCE loss. And so the actual, when we call the learner the loss functional uses VAE loss but we're going to pass in as metrics this list of additional metrics to print out. So it's just going to print them out. And in some ways it's a little inefficient because it's going to calculate KLD loss twice and BCE loss twice, one to print it out and one to go into the actual loss function but it doesn't take long for that bit. So I think that's fine. So now when we call learn.fit you can see it's printing them all out. So the BCE that we got last time was 0.26. And so this time, yeah, it's not as good as 0.31 because it's a harder problem and it's got random, randomness in it. And you can see here that the BCE and KLD are pretty similar scale when it starts. That's a good sign. If they weren't, you know I could always in the loss function scale one of them up or down but they're pretty similar to start with. So that's fine. So we train this for a while and then we can use exactly the same code for sampling as before. And yeah, as we suspected its ability to decode is worse. So it's actually not capturing the LEE at all. In fact, and the shoes got very blurry. But the hope is that when we call it on noise called the decoder on random noise, ah, that's much better. We're getting, it's not amazing but we are getting some recognizable shapes. So, you know, VAEs are, you know not generally gonna get to as good a results as diffusion models are, although actually if you train really good ones for a really long time, they can be pretty impressive. But yeah, even in this extremely simple, quick case we've got something that can generate recognizable items of clothing. Did you guys wanna add anything before we move on to the stable diffusion VAE? Okay, so this, yeah, so this VAE is very crappy. And as we mentioned, like one of the key reasons to use a VAE is actually that you can benefit from all the compute time that somebody else has put into training a good VAE. Maybe just also like one thing when we say good VAE, the one that we've trained here is good at generating because it maps down to this like one, let's say 200 dimensional vector and then back in a very useful way. And like if you look at VAEs for generating they'll often have a pretty small dimension in the middle and it'll just be like this vector that gets mapped back up. And so VAE that's good for generating is slightly different to one that's good for compressing. And like the stable diffusion one we'll see has this like spatial components still. It doesn't map it down to a single vector. It maps it down to a 64 by 64 or whatever. I think that's smaller than the original but for generating we can't just put random noise in there and hope like a cohesive image will come out. So it's less good as a generator but it is good because it has this like compression and reconstruction ability. Cool, yeah, so let's take a look. Now to demonstrate this we wanna move to a more difficult task because we wanna show off how using latency let us do stuff we couldn't do well before. So the more difficult task we're going to do is generating bigger images and specifically generate images of bedrooms using the L Sun bedrooms dataset. So L Sun is a really nice dataset which has many, many, many millions of images across 10 scene categories and 20 object categories. And so it's very rare for people to use of the object categories to be honest but people quite often use the scene categories. There are a little more than little it can be extremely slow to download that the website they come from is very often down. So what I did was I put a subset of 20% of them onto AWS. They kindly provide some free dataset hosting for our students and also the original L Suns in a slightly complicated form since I got an LMDB database and so I turned them into just normal images in folders. So you can download them directly from the AWS dataset site that they provided for us. So I'm just using Fastcore to save it and then using Python's SHU tool to unpack the GZipped tar file. Okay, so that's given us once that runs which is going to take a long time. And if it might be even more reliable just to do this in the shell with Wget or Harrier2C or something then doing it through Python. So this will work but if it's taking a long time or whatever maybe just delete it and do it in the shell instead. Okay, so then I thought, all right how do we turn these into latents? Well, we could create a dataset in the usual ways. It's going to have a length. So we're going to grab all the files. So glob is a built-in to Python which we'll search for in this case, star.jpg and if you've got star star slash that's going to search recursively as long as you pass recursive. So we're going to search for all of the jpg files inside our data slash bedroom folder. So that's what this is going to do. It's going to put them all into the files attribute. And so then when we get an item, the ith item it will find the ith file, it will read that image. So this is PyTorch's read image. It's the fastest way to read a jpg image. People often use PAL, but it's quite hard to find a really well optimized PAL version that's really compiled fast. Whereas the PyTorch vision team have created a very, very fast read image. So that's why I'm using theirs. And if you pass in image readmo.rgb it'll automatically turn any one channel black and white images into three channel images for you. Or if there are four channel images with transparency it'll turn those. So this is a nice way to make sure they're all the same. And then this turns it into floats from not to one. And these images are generally very close to 256 by 256 pixels. So I just crop out the top 256 by 256 bit because I didn't really care that much. And we do need them to all be the same size in order that we can then pass them to the VAE stable diffusion VAE decoder as a batch. Otherwise it's going to take forever. So I can create a data loader. It's going to go through a bunch of them at a time. So 64 at a time. And use however many CPUs I have as the number of workers it's going to do it in parallel. And so the parallel bit is the bit that's actually reading the JPEGs which is otherwise going to be pretty slow. So if we grab a batch here it is is what it looks like. Generally speaking they're just bedrooms although we've got one pretty risky situation in the bedroom but on the whole they're not safe for work. This is the first time I've actually seen an actual bedroom scene taking place as it were. All right, so as you can see this mini batch of if I just grabbed the first 16 images it has three channels and 256 by 256 pixels. So that's how big that is for 16 images. So that's seven, two, eight. So 3.145 million floats to represent this. Okay, so as we learned in the first lesson of part two we can grab an auto encoder directly using diffusers using from pre-trained. We can pop it onto our GPU and importantly we don't have to say with torch.nograd anymore if we pass requires grad false. And remember this neat trick in PyTorch if it ends in an underscore it actually changes the thing that you're calling in place. So this is gonna stop it from computing gradients which would take a lot of time and a lot of memory otherwise. So let's test it out. Let's encode our mini batch. And so just like Jono was saying this has now made it much smaller. It's got just in our 16 batch of 16 it's now a four channel 32 by 32. So if we can compare the previous size to the new size it's 48 times smaller. So that's 48 times less memory it's gonna need. And it's also gonna be a lot less compute for a convolution to go across that image. So it's no good unless we can turn it back into the original image. So let's just have a look at what it looks like first. Now it's a four channel image so we can't naturally look at it. But what I could do is just grab the first three channels and then they're not gonna be between naught and one. So if I just do dot sigmoid now they're between naught and one. And so you can see that our risque bedroom scene you can still recognize it, right? Or this bedroom, this bed here you can still recognize it. So there's still that kind of like the basic geometry is still clearly there but it's, yeah it's clearly changed it a lot as well. So importantly we can call decode on this 48 times smaller tensor. And it's really I think absolutely remarkable how good it is. I can't tell the difference to the original. Let me if I zoom in a bit. Her face is a bit blurry. Was her face always a bit blurry? Oh, it was always a bit blurry. First, second, third. Oh, hang on. Did that used to look like a proper ND? Yeah, okay. So you can see this used to say there clearly there's an ND here and now you can't see those letters. So and this is actually a classic thing that's known for this particular VAE. Is it's not able to regenerate writing correctly at small font sizes? I think it's also pretty, it's like, I think we hear that the faces are already pretty low resolution, but if you were at a higher resolution the faces also would probably not be converted appropriately. Yeah, cool. But overall, yeah, it's done a great job. A couple of other things I wanted to note was like, so like you mentioned like a 40, I guess a factor of 48 decrease. Oftentimes people will refer to mostly at the spatial resolution. So since it's going from 256 by 256 to 32 by 32, so that's like a factor of eight. So they sometimes will note like, I think it's like F8 or something like this, they'll note the spatial resolution. So sometimes you may see that written out like that. And of course, the other thing is an eight square decrease in the number of pixels, which is interesting. Right, right. And then the other thing I wanted to note was that the VAE is also trained with a perceptual loss objective, as well as technically like a discriminator again objective. I don't know if you were gonna go into that a little bit later. Yeah, let's talk about that now. So, yeah, so perceptual loss we've already discussed, right? So the VAE is going to, you know, when they trained it. So I think this was trained by Confiz, right? The, you know, Robin and Gang and use stability.ai donated compute for that. And they went to... To be clear, actually, no, the VAE was actually trained separately and it's actually trained on the open images dataset. And it was just this VAE that they trained by themselves on, you know, a small subset of data. But because the VAE is so powerful, it's actually able to be applied to all these other datasets as well. Okay, great. Yeah, so they would have had a KL diversion loss and they would have either had an MSC or BCE loss. I think it might have been an MSC loss. They also had a perceptual loss, which is the thing we learned about when we talked about super resolution, which is where when they compared the output images to the original images, they would have run that through a, you know, ImageNet trained or similar classifier and confirmed that the activations they got through that model were similar. And then the final bit is, as Tanisha was mentioning, is the adversarial loss, which is also known as a GAN loss. So a GAN is a Generative Adversarial Network. And the GAN loss, what it does is it grabs, it's actually more specifically, what's called a PatchWise GAN loss. And what it does is it takes like a little section of an image, right? And what they've done is they train, it's, let's just simplify it for a moment and imagine that they've pre-trained a classifier, right? Where they've basically got something that you can pass it a real, you know, patch from a bedroom scene and a fake patch from a bedroom scene. And they both go into what's called the Discriminator. And this is just a normal, you know, ResNet or whatever, which basically outputs something that either says, yep, the image is real or nope, the image is fake. So sorry, I said it passes in two things. You just, that was wrong, you just pass in one thing and it returns either it's real or it's fake and specifically it's gonna give you something like the probability that it's real. There is another version, I don't think it's what they used to be a pass in two when it tells you which one's relative. Do you remember Tanish? Cause is it a relativistic GAN or a normal GAN? I think it's a normal one. Yeah, so the relativistic GAN is when you pass in two images and it says switch is more real. The one we think that we remember correctly, they use as a regular GAN which just tells you the probability that it's real. And so you can just train that by passing in real images and fake images and having it learn to classify which ones are real and which ones are fake. So now that once you've got that model trained, then as you train your GAN, you pass in the patches, you know, of each image into the discriminator, just call D here, right? And it's gonna spit out the probability that that's real. And so if it's spat out, you know, 0.1 or something, then you're like, oh dear, that's terrible. Our VAE is spitting out pictures of bedrooms where the patches of it are easily recognized as not real. But the good news is that's gonna generate derivatives, right? And so those derivatives, then it's gonna tell you how to change the pixels of the original generated image to make it trick the GAN better. And so what it'll do is it'll then use those derivatives as per usual to update our VAE. And the VAE in this case is gonna be called a generator, right? That's the thing that's generating the pixels. And so the generator gets updated to be better and better at tricking the discriminator. And after a while, what's gonna happen is the generator is gonna get so good that the discriminator gets fooled every time, right? And so then at that point, you can fine-tune the discriminator better by putting in your better generated images, right? And then once your discriminator learns again how to recognize the difference between real and fake, you can then use it to train the generator. And so this is kind of ping-ponging back and forth between the discriminator and the generator. Back when GANs were first created, people were finding them very difficult to train. And actually a method we developed at FastAI, I don't know if we were the first to do it or not, was this idea of kind of pre-training a generator just using perceptual loss and then pre-training a discriminator to be able to fool the generator and then ping-ponging backwards and forwards between them after that. Basically, whenever the discriminator got too good, start using the generator. Anytime the generator got too good, start using the discriminator. Nowadays, that's pretty standard, I think, to do it this way. And so, yeah, this GAN loss, which is basically saying penalized for failing to fool the discriminator, is called an adversarial loss. To maybe motivate why you do this, if you just did it with like a mean square interval, even a perceptual loss with such a high compression ratio, the VAEs tend to produce a fairly blurry output because it's not sure whether there's texture or not in this image or the edges aren't super well-defined where they'll be because it's going from one four-dimensional thing up to this whole patch of the image. And so, it tends to be a little bit blurry and hazy because it's kind of hedging its bets, whereas that's something that the discriminator can quite easily pick up, oh, it's blurry, it must be fake. And so then it's having the discriminator that this adversarial loss is just kind of saying, like, even if you're not sure exactly where this texture goes, rather go with the sharper-looking texture that looks real than with some blurry thing that's going to maximize your MSC. And so, it tricks it into kind of faking this higher resolution-looking sharper output. Yeah, and I'm not sure if we're going to come back and like train our own N at some point, but if you're interested in training your own GAN or a resolution called our GAN, right? I mean, nowadays, we never really just use our GAN. We have an adversarial loss as part of a training process. If you want to learn how to use adversarial loss, like in detail and see the code, the 2019 fast AI course, lesson seven, part one has a walkthrough. So we have sample code there and maybe given time, we'll come back to it. Okay, so quite often people will call the VAE encoder when they're training a model, which to me makes no sense, right? Because the encoded version of an image never changes unless you're using data augmentation and want to do encode augmented images. I think it makes a lot more sense to just do a single run through your whole training set and encode everything once. So naturally the question is then, well, where do you save that? So that it's gonna be a lot of RAM. If you put this, leave it in RAM. And also, as soon as you restart your computer, we've lost all that work. There's a very nifty file format you can use called a memory mapped NumPy file, which is what I'm gonna use to save our latency. A memory mapped NumPy file is basically, what happens is you take the memory in RAM that NumPy would normally be using and you literally like copy it onto the hard disk basically. That's what they mean by memory mapped. There's a mapping between the memory in RAM and the memory on hard disk. And if you change one, it changes the other and vice versa. They're kind of two ways of seeing the same thing. And so if you create a memory mapped NumPy array, then when you modify it, it's actually modifying it on disk. But thanks to the magic of your operating system, it's using all kinds of beautiful caching and stuff to not make that slower than using a normal NumPy array. And it's gonna be very clever at, it doesn't have to store it all in RAM. It only stores the bits in RAM that you need at the moment or that you've used recently. It's really nifty, a kind of caching and stuff. So it's kind of, it's like magic, but it's using your operating system to do that magic for you. So we're gonna create a memory mapped file using np.memmap. And so it's gonna be stored somewhere on your disk. So we're just gonna put it here. And we're gonna say, okay, so create a memory map file in this place. It's gonna contain 32 bit floats. So write the file and the shape of this array is gonna be the size of our data set. So 303,125 images. And each one is four by 32 by 32. Okay, so that's our memory mapped file. And so now we're gonna go through our data loader, one mini batch of 24 at a time. And we're going to VAE encode that mini batch. And then we're gonna grab the means from its latents, right? We don't want random numbers. We want the actual, you know, the midpoints, the means. So this is using the diffusers version of that VAE. So pop that onto the CPU after we're done. And so that's gonna be a mini batch of size 64 as PyTorch. So let's turn that into NumPy because PyTorch doesn't have a memory mapped thing as far as I'm aware, but NumPy does. And so now that we've got this memory mapped array called A, then everything from initially from zero up to 64 or 60, yeah, zero to 64, not including the 64, that whole sub part of the array is gonna be set to the encoded version. So it looks like we're just changing it in memory, but because this is a magic memory map file, it's actually gonna save it to disk as well. So yeah, that's it, amazingly enough. That's all you need to create a memory mapped NumPy array of our latents. When you're done, you actually have to call dot flush. And that's just something that says like anything that's just in cache at the moment, make sure it's actually written to disk. And then I delete it because I just wanna make sure that then I read it back correctly. So that's only gonna happen once if the path doesn't exist. And then after that, this whole thing will be skipped. And instead we're gonna call mp.memmap again with m path, but this time in the same data type and the same shape, this time we're gonna read it. Modicles are means to read it. And so let's check it. Let's just grab the first 16 latents that we read and decode them. And there they are, okay? So this is like not a very well-known technique, I would say sadly, but it's a really good one. You might be wondering like, well, what about like compression? Like shouldn't you be zipping them or something like that? But actually remember these latents are already, the whole point is they're highly compressed. So generally speaking zipping latents from a good VAE doesn't do much because they're like, they almost look a bit random, numberish. Okay, so we've now saved our entire L Sun bedroom. That's a 20% subset, the bit that I've provided. Now latents. So we can now run it through, this is the nice thing. We can use exactly the same process from here on in as usual. Okay, so we've got the Noisify of our usual collated version. Now the latents are much higher than one standard deviation. So if we're about divided by five, that takes it back to a standard deviation of about one. I think in the paper they use like 0.18 or something, but this is close enough to make it a unit standard deviation. So we can split it into a training and a validation set. So just grab the first 90% for the training set and the last 10% for the validation set. So those are our data sets. We use a batch size of 128. So now we can use our data loader's class we created with the getDLs we created. So these are all things we've created ourselves with the training set, the validation set, the batch size and our collation function. So yeah, it's kind of nice. It's amazing, you know, how easy it is. Like, you know, a data set has the same interface as a NumPy array or a list or whatever. So we can literally just use the NumPy array directly as a data set, which I think is really neat. This is why it's useful to know about these like foundational concepts because you don't have to start thinking like, oh, I wonder if there's some torch vision thing to use MemMap NumPy files or something that's like, oh wait, they already do provide a data set interface. I don't have to do anything, I just use them. So that's pretty magical. So we can test that now by grabbing a batch. And so this is being noisified. And so here we can see our noisified images. And so here's something crazy is that we can actually decode noisified images. And so, you know, here's, I guess this one wasn't noisified much because it's a recognizable bedroom. And this is what happens when you just decode random noise, something in between. So I think that's pretty fun. Yeah, this next bit is all just copied from our previous notebook, create a model, initialize it, chain for a while. So this took me a few hours on a single GPU, everything I'm doing is on a single GPU, literally nothing in this course other than the stable diffusion stuff itself is trained on more than one GPU. The loss is much higher than usual. And that's not surprising because it's trying to generate latent pixels which where like it's much more precise as to exactly what it wants. You know, there's not like lots of pixels where the ones next to are really next to each are really similar or the whole background looks the same or whatever, a lot of that stuff, it's being compressed out. It's a more difficult thing to predict latent pixels. So now we can sample from it in exactly the same way that we always have using DDIM. But now we need to make sure that we decode it because the thing that it's predict that it's sampled are latents because the thing that we asked it to learn to predict to latents. And so now we can take a look and we have bedrooms. Ah, and some of them look pretty good. I think this one looks pretty good. I think this one looks pretty good. This one, I don't have any idea what it is. And this one, like clearly there's bedroomy bits, but there's something, I don't know, there's weird bits. So the fact that we're able to create 256 but 256 pixel images where at least some of them look quite good in a couple of hours, I can't remember how long it took to train but it's a small number of hours in a single GPU is something that was not previously possible. And we're in a sense, we're totally cheating because we're using the stable diffusion VAE to do a lot of the hard work for us. But that's fine, you know, because that VAE knows how to create all kinds of natural images and drawings and portraits and oil paintings or whatever. So you can, I think, work in that latent space quite comfortably. Yeah. Do you guys have anything you wanted to add about that? Oh, actually, Tanishka, no, you've trained this for longer. I only trained it for 25 epochs. How many hours did you train it for? Because you did 100 epochs, right? Yes, I did 100 epochs. I didn't keep track exactly, but I think it was about 15 hours on an A100. OK, it's a better GPU. Yes, a single A100. Yeah, I argue, I mean, the results, yeah, I'll show it. It's, it's, it's, yeah, it's, it's, I guess, maybe slightly better. But, you know, I guess you can, I'd say maybe. No, it is definitely slightly better. The good ones are certainly slightly better. Yeah. Yeah. Like the bottom left one is better than any of mine, I think. So it's possible maybe at this point, we just may need to use more data, I guess, because I guess we were using a 20% subset. So maybe having more of that data to provide more diversity or something like that, maybe that might help. Yeah, or maybe have you tried doing the diffusers one for 100? No, I'm using our code here. Yeah, so I've got, all right, so I'll share my screen if you want to stop sharing yours. So I do have, if we get around to this, maybe we can add the results back to this notebook, because I do have a version that uses diffusers. So everything else is identical, 25 epochs, except for the model, for the previous one, I was using our, our own MUNEP model. So I have to change the channels now to four and a number of filters, I think I might have increased it a bit. So then I tried using, yeah, the diffusers unit, just with whatever their defaults were. And so I got, what did I get here, 243 with diffusers, I got a little bit better, 239. And yeah, I don't know if they're obviously better or not. Like this is a bit weird, this is a bit weird. I think like actually another thing we could try maybe is do 100 epochs, but use the diffusers number of channels and stuff that they used for, because I think the defaults that they use actually for diffusers is not the same as stable diffusion. So maybe we could try stable diffusion matched unit for 100 epochs. And if we get any nice results, maybe we can paste them into the bottom to show people. Yeah. Yeah, cool. Yeah, do you guys have anything else to add at this point? All right. So I'll just mention one more thought in terms of like a bit of a interesting project people could play with. I don't know if this is too crazy. I don't think it's been done before, but my thought was like, there was a huge difference in our super resolution. Do you remember a huge difference in our super resolution results when we used a pre-trained model and when we used perceptual loss, but particularly when we used a pre-trained model? I thought we could use a pre-trained model, but we would need a pre-trained latency model, right? We would want something where our, you know, down-sampling backbone was pre-trained model on latency. And so I just wanted to show you what I've done and you guys, you know, if anybody watching wanted to try taking this further, I've just done the first bit for you to give you a sense, which is I pre-trained an ImageNet model, not tiny ImageNet, but a full ImageNet model on latency as a classifier. And if you use this as a backbone, you know, and also try maybe some of the other tricks that we found helpful, like having ResNet on the cross connections. These are all things that I don't think anybody's done before. I don't know, the scientific literature is vast and I might have missed it, but I've not come across anybody do these tricks before. So obviously, like, we're, one of the interesting parts of this, which is designed to be challenging, is that we're using bigger data sets now, but they're data sets that you can absolutely, like, run on a single GPU, you know, a few tens of gigabytes, which fits on any modern hard drive easily. So these, like, are good tests of your ability to kind of, like, move things around. And if you're somewhere that doesn't have access to a decent internet connection or whatever, this might be out of the question, in which case, don't worry about it. But if you can, yes, try this, because it's good practice, I think, to make sure you can use these larger data sets So ImageNet itself, you can actually grab from Kaggle nowadays. So they call it the object localization challenge, but actually this contains the full ImageNet data set, or the version that's used for the ImageNet competition. So I think people generally call it ImageNet one-faced. You just have to accept the terms, because it has, like, some distribution terms. Yeah, exactly. So you've got to kind of sign in and then join the competition and then, yeah, accept the terms. So you can then download the data set, or you can also download it from HuggingFace. It'll be in a somewhat different format, but that'll work as well. So I think I grabbed my version from Kaggle. So on Kaggle, you know, it's just a zip file. You unzip it and it creates an ILSVRC directory, which I think is what they call the competition, yeah, ImageNet, large-scale visual recognition challenge. Okay. So then inside there, there is a data and inside there, there is a CLS lock and that's actually where the, that's where actually everything's going to be. So just like before, I wanted to turn these all into latents. So I created, in that directory, I created a latents sub-directory. And this time partly just to demonstrate how these things work, I want to do it in a slightly different way. Okay. So again, we're going to create our pre-trained VAE, pop it on the GPU, turn off gradients for it. And I'm going to create a data set. Now, one thing that's a bit weird about this is that because this is really quite a big data set, like it's got 1.3 million files, the thing where we go, glob, star star slash star dot jpeg takes a few seconds, you know, and particularly if you're doing this on like, you know, an AWS file system or something, it can take really quite a long time. On mine, it only took like three seconds, but I don't want to wait three seconds. So I, you know, a common trick for these kind of big things is to create a cache, which is literally just a list of the files. So that's what this, this is. So I decided that Z pickle means a G zipped pickle. So what I do is if, if, if the cache exists, we just G zip dot open the files. If it doesn't, we used glob exactly like before to find all the files. And then we also save a G zip file containing pickle dot dump files. So pickle dot dump is what we use in Python to take basically any Python object, list of dictionaries and dictionary of lists, whatever you like and save them. And it's super fast. Right. And I use G zip with compressed level one to basically be like, compress it pretty well, but pretty fast. So this is a really nice way to create a little cache of that. So this is the same as always. And so our get item is going to grab the file. It's going to read it in, turn it into a float. And what I did here was, you know, I'm doing a little bit lazy, but I just decided to center crop the middle, you know, so let's say it was a 300 by 400 file. It's going to center crop the middle 300 by 300 section and then resize it to 256 by 256. So there will be the same size. So yeah, we can now. Oh, I managed to create the VAE twice. Let's delete that one. So I can now just confirm I can grab a batch from that data loader and code it. And here it is. And then decode it again. And here it is. So that's the first category must have been computer or something. So here's, as you can see, the VAE is doing a good job of decoding pictures of computers. So I can do something really very similar to what we did before. If we haven't got that destination directory yet created, go through our data loader encoder batch. And this time I'm not using a mem mapped file. I'm actually going to save separate NumPy files for each one. So go through each element of the batch, each item. So I'm going to save it into the destination directory, which is the latency directory. And I'm going to give it exactly the same path as the original one contained because it contains the, you know, the folder of like what the label is. Make sure that the directory exists that we're saving it to and save that just as a NumPy file. This is another way to do it. So this is going to be a separate NumPy file for each item. Does that make sense so far? Okay, cool. So I could create a thing called a NumPy data set, which is exactly the same as our images data set. But to get an item, we don't have to use, you know, open a JPEG anymore. We just call MP.load. So this is a nice way to like take something you've already got and change it slightly. So it's going to return the... Where did you do this? Is that a new mapped file, Jeremy? Just thought of interest. Sorry? Where did you do this versus the memory mapped file? Was it just to show a different way? Just to show a different way. Yeah. Yeah. Absolutely no particularly good reason, honestly. Yeah. I like to kind of like demonstrate different approaches. And I think it's good for people's Python coding. If you make sure you understand what all the lines of code do. Yeah, they both work fine, actually. It's partly also for my own experimental interest. It's like, oh, which one seems to kind of feel better? Yeah. All right. So create out training and validation data sets by grabbing all the NumPy files inside the training and validation folders. And then I'm going to just create a training data loader for the training data set just to see what the mean and standard deviation is on the channel dimension. So this is every dimension except channel that I mean over. And so there it is. And as you can see there, the mean and standard deviation are not close to zero and one. So we're going to store away that mean and standard deviation such that we then we've seen transformed data set before. This is just applying a transform to a data set. We're going to apply the normalization transform. In the past, we've just we've used our own normalization that TorchVision has one as well. So this is just demonstrating how to just use TorchVision's version. But it's literally just subtracting the mean and dividing by the standard deviation. We're also going to apply some data augmentation. We're going to use the same trick we've used before for images that are very small, which is we're going to add a little bit of padding and then randomly crop our original image size from that. So it's just like shifting it slightly each time. And we're also going to use our random erasing. And it's nice because we did it all with broadcasting. This is going to apply equally well to a four channel image as it is to three or I think we did originally for one. Now. You know, I don't think anybody as far as I know has built classifiers from latency before. So like I didn't even know if this is going to work. So I visualized it. So so we create a Tiffim X and a Tiffim Y. So for Tiffim X, you can optionally add augmentation. And if you do then apply the augmentation transforms. Now, this is going to be applied one image at a time, but augmentation transforms some of them expect a batch. So we create a extra unit access on the front to be a batch of one and then remove it again. And then Tiffim Y very much like we've seen before, we're going to turn those path names into IDs. So there's our validation and training transform data sets. So so that we can look at our results, we need a denormalization. So let's create our data loaders and grab mini batches and show us. And so I was very pleased to see that the random arrays works actually extremely nicely. So you can see you get these kind of like weird patches, you know, weird patches, but they're still recognizable. So this is like something I very, very often do is to answer like, oh, is this like thing I'm doing in computer vision reasonable? It's like, well, can my human brain recognize it? If I couldn't recognize this was a drilling platform myself, then I shouldn't expect a computer to be able to do it or that this is a compass or whatever. I'm so glad they got orders. So cute. And you can see the cropping it's done has also been fine. Like it's a little bit of a fuzzy edge, but basically like it's not destroying the image at all. They're still recognizable. It's also a good example here of how difficult, like this problem is, like the fact that this is C-sure, this surface, you know, but maybe surface is not an image in that category, but yeah. Okay. This could be food, but actually it's refrigerator. Okay. So our augmentation seems to be working well. So then I, yeah, basically, I've just copied and pasted, you know, our basic pieces here. And I kind of wanted to have it all in one place just to remind myself of exactly what it is. So this is the pre-activation version of convolutions. The reason for that is if I want this to be a backbone for a diffusion model or a unit, then I remember that we found that pre-activation works best for units. So therefore our backbone needs to be trained with pre-activation. So we've got a pre-activation conv, got a res block, res blocks model with dropouts. This is orders copied from previous. So I decided like I wanted to try to, you know, use the basic trick that we learned about from simple diffusion of trying to put most of our work in the later layers. So the first layer just has one block, then two blocks, and then four blocks. And then I figured that we might then delete these final blocks. These maybe you're going to just end up being for classification. This might end up being our pre-trained backbone. Well, maybe we keep them. I don't know. It's like, as I said, this hasn't been done before. So anyway, I tried to design it in a way that we've got some, you know, we can mess around a little bit with how many of these we keep. And so also I tried to use very few channels in the first blocks. And so I jump up for the channels that are where the work's going to do. I jump from 128 to 512. So that's why I designed it this way. You know, I haven't even taken it any further than this. I don't know if it's going to be a useful backbone or not. I didn't even know if this is going to be possible to classify. It seemed very likely it was possible to classify, even based on the fact that you can still kind of recognize it almost. Like I could probably recognize it's a computer, maybe. So I thought it was going to be possible. But yeah, this is all new. So that was the model I created. And then I, yeah, trained it for 40 epochs. And you can see after one epoch, it was already 25% accurate. And that's it recognizing which one of a thousand categories is it. So I thought that was pretty amazing. And so after 40 epochs, I ended up at 66%, which is really quite fantastic because a ResNet 34 is kind of like 73 or 74% accuracy when trained for quite a lot longer. You know, so to me, this is extremely encouraging, but you know, this is a really pretty good ResNet at recognizing images from their latent representations without any decoding or whatever. So from here, you know, if you want to, you guys could try, yeah, building a better bedroom diffusion model or whatever you like, let's not be bedrooms. Actually, one of our colleagues, Molly, I'm just going to find it. So one of our colleagues, Molly actually used the, do you guys remember was it the celeb faces that she used? So there's a celeb AHQ data set that consists of images of faces of celebrities. And so what Molly did was she basically, yeah, used this exact notebook, but used this faces data set instead. And this one's really pretty good, isn't it? You know, this one's really pretty good. They certainly look like celebrities, that's for sure. So yeah, you could try this data set or whatever, yeah, try it, yeah, maybe try it with the pre-trained backbone, try it with ResNets and the cross connections, try it with all the tricks we used in SuperRes, try it with perceptual loss. Some folks we spoke to about the perceptual loss think it won't help with latents because the underlying VAE was already trained with perceptual loss, but we should try, you know, or you guys should try all these things. Yeah, so be sure to check out the forum as well to see what other people who've already tried here because it's a whole new world. But it's just an example of the kind of like fun research ideas I guess we can play with. Yeah, what do you guys think about this? Are you like surprised that we're able to quickly get this kind of accuracy from latents or do you think this is a useful research path? What are your thoughts? Yeah, I think it's very interesting. I was going to say the latents are already like a slightly compressed richer representation of an image, right? So it makes sense that that's a useful thing to train on. 66%, I think AlexNet is like 63% or something like that. So, you know, we were already at state-of-the-art, what, eight years ago or whatever. Yeah, it's pretty cool. It might be more like 10 years ago, I know, time passes quickly. Yeah, yeah, I guess next year. Yeah, next year, it is 10 years ago. But yeah, I'm kind of curious with the pre-training, the whole value for me for like using a pre-training network was someone else has done lots and lots of compute on ImageNet to learn some features and I'm going to use that. It's kind of funny to be like, oh, well, let's pre-train for ourselves and then try and use that. I'm curious whether, like, how best you'd allocate that compute whether you should, if you've got 10 hours of GPU, just do 10 hours of training versus like 5 hours of pre-training and 5 hours of training. I mean, based on our super res thing, the pre-training was so much better. So, that's why I'm feeling somewhat hopeful about this direction. Yeah, I'm really curious to see how it goes. I guess I was going to say, it's like, yeah, I think there's just a lot of opportunities for, I guess, the lead 10, doing stuff in the lead 10s and like, I guess maybe like, yeah, you could, I mean, hear your training classifier as a backbone but you could think of like, training classifiers on other things for, you know, guidance or things like this. Yeah. Of course, we've done some experiments with that. I know Jono has his mid-U guidance approach for some of these sort of things. But there are different approaches that you can play around here that, you know, exploring in the latent space can make it computationally cheaper than, you know, having to decode it every time you want to, you know, you have to look at the image and then maybe apply a classifier or apply some sort of guidance on the image. But if you can do it directly in the latent space, there are a lot of interesting opportunities there as well. Yeah. And, you know, now we're showing that indeed. Yeah, yeah. Style transfer. Everything on latents. You can also do some models. Like that's something I've done to make a latent clip is just have it like try and mirror an image space clip. And so for classifiers as well, you could like distill an image net classifier rather than just having the label, you try and like copy the logits. And then that's like an even richer signal. Like you get more value per example. So then you can create your, you know, your latent version of some existing image classifier or object detector or, yeah, like multimodal model like clip. I feel like, I feel funny about this because I'm like both excited about simple diffusion on the basis that it gets rid of latency, but I'm also excited about latency on the basis that it gets rid of most of the pixels. I don't know how I can be cheering for both, but somehow I am. I guess may the best method win. So, you know, the folks that are finishing this course, well, first of all, congratulations because it's been a journey, particularly part two. It's a journey that requires a lot of patience and tenacity. You know, if you've kind of zipped through by binging on the videos, that's totally fine. It's a good approach, but, you know, maybe go back now and do it more slowly and do the, you know, build it yourself and really experiment. But assuming, you know, for folks who have got to the end of this and feel like, okay, I get it, more or less. Yeah, do you guys have any sense of like what kind of things make sense to do now? You know, where would you guys go from here? I think that great opportunities, implementing papers that, I guess, come along these days. And I think at this stage... Wait, there'll be more papers? No way. Yeah. But also at this stage, I think, you know, we're already discussing research ideas. And I think, you know, we're in this solid position to come up with our own research ideas and explore those ideas. So I think that's a real opportunity that we have here. Yeah, I will say, I think that's best done often collaboratively. So I'll just, you know, mention that FastAI has a Discord, which if you've got to this point, then you're probably somebody who would benefit from being there. And yeah, just pop your head in and say, like, there's an introduction straight to say hello. And you don't, you know, maybe say what you're interested in or whatever, because it's nice to work with others, I think. I mean, both Jono and Tanishka, I only know because of the Discord and the forums and so forth. So that would be wonderful. And we also have a, we have a generative channel. So anything related to generative models, that's the place. So for example, Bali was posting some of her experiments in that channel. I think there are other FastAI members posting their experiments. So if you're doing anything generative model related, that's a great way to also get feedback and thoughts from the community. Yeah. I'd also say that like this, if you're at the stage where you finish this course, you actually understand how diffusion models work. You've got a good handle on what the different components and like stable diffusion are. And you know how to wrangle data for training and type all these things. You're like so far ahead of most people who are building in the space. And I've got lots of companies and people reaching out to me to say, do you know anybody who has like, more than just, oh, I know how to like load stable diffusion and make an image. Like, do you know someone who knows how to actually like tinker with it and make it better. And if you've got those skills, like don't feel like, oh, I'm definitely not qualified to like apply or like there's lots of stuff where, yeah, just taking these ideas now and like just simple, sensible ideas that we've covered in the course that have come up and saying, oh, actually maybe I could try that. Maybe I could play with this, you know, take this experimentalist approach. I feel like there's actually a lot of people who would love to have you helping them build the million and one little stable diffusion based apps or whatever that you're working on. Particularly like the thing we always talk about at Fast AI which is particularly if you can combine that with your domain expertise, you know, whether it be from your, your hobbies or your work in some completely different field or whatever, you know, there'll be lots of interesting ways to combine, you know, you probably are one of the only people in the world right now who understand your areas of, of passion or of vocation as well as these techniques. So, and again, that's a good place to kind of get on the forum or the Discord or whatever and start having those conversations because it can be, yeah, it can be difficult when you're at the cutting edge which you now are by definition. All right, well, we better go away and start figuring out how an Earth GPT-4 works. I don't think we're going to necessarily build the whole GPT-4 from scratch, at least not at that scale, but I'm sure we're going to have some interesting things happening with NLP. And Jono, Tanish, thank you so much. It's been a real pleasure. It was nice doing things with the live audience, but I've got to say I really enjoyed this experience of doing stuff with you guys the last few lessons. So thank you so much. Yeah, thanks for having us. It was really, really fun. All right. Bye. See you in part three. Bye.