 Hi we are here for lesson 24 and once again it's becoming a bit of a tradition now We're joined by Jono and Taneesh, which is always a pleasure. Hi Jono. Hi Taneesh. Hello So you guys are great lesson. Yeah, you guys looking forward to finally actually completing Stable diffusion at least the unconditional stable diffusion. Well, I should say no even conditional so conditional stable diffusion except for the Clip bit from scratch. We should be able to finish today time permitting. Oh, that's exciting. That is exciting. Yeah, all right Let's do it Jump in any time You've got things to talk about so we're going to start with a very hopefully named 26 diffusion unit And What we're going to do in 26 diffusion unit is to do Unconditional diffusion from scratch and There's not really too many new pieces if I remember correctly. So all you know all the stuff at the start we've We've already seen And so when I wrote this it was before I had noticed that the keras approach was doing less wealth than the regular Cosine schedule approach. So I'm still using keras noisify, but this is all the same as from the Keras notebook, which was 23 Okay, so We can now create a unit That is based on What diffusers has which is in turn based on lots of other prior art Um, I mean the code's not at all based on it, but the basic the structure is going to be the same as as what you'll get in diffuses um The convolution we're going to use is the same as the final kind of convolution We used for tiny image net, which is what's called the pre activation convolution So the convolution itself happens at the end and the normalization and activation happen first So this is a pre act convolution So then I've got a unit res net block so I kind of wrote this before I actually did the Pre act version of tiny image net. So I suspect this is actually the same Quite possibly exactly the same as the tiny image net one So maybe this is nothing specific about this for unit. This is just really a a pre act con than a pre act res net block So we've got the The the two comms as per usual And the identity con now there is one difference though To what we've seen before for res net blocks Which is that this res net block has no option to do down sampling No option to do a stride This is always stride one Which is our default So the reason for that Is that when we get to the thing that strings a bunch of them together Which will be called down block This is where you have the option to add down sampling And if you do add down sampling, we're going to add a stride two convolution after the res block And that's because this is how Yeah diffuses and stable diffusion does it? Um I haven't studied this closely to niche gift or don't know if either of you have no like where this idea came from or why? Um, I'd be curious, you know, the difference is that Normally we would have the uh, we'd have average pooling Here in this connection Um, but yeah, this different approach is what What we're using a lot of the a lot of the history of the Um diffuses unconditional unit is to be compatible with the ddpm weights that were released And some follow and work from that and I know like then improved ddpm and these others like they all kind of built on that same sort of unit structure even though it's slightly unconventional if you're coming from like a Normal computer vision background And do you recall where the ddpm architecture came from because like some of the ideas came from some of the N Units, but I don't know if ddpm. Yeah, they had something called Efficient unit that was inspired by some prior work that I yeah, I can't remember the like the lineage Um, but anyway, yeah, I just think it's the the the fuses one has since become, you know Like you can add in parameters to control some of this stuff, but yeah, it's it's We shouldn't assume that this is the optimal approach. I suppose Um, uh, but I'll yeah, I will dig into the history and try and find out how much like What ablation studies have been done. So for those of you haven't heard of ablation studies That's where you like try, you know a bunch of different ways of doing things and score which one works better and which one works Less well and kind of create a table of all of those options and so where You can't find ablation studies for something you're interested in often that means that, you know Maybe not many other options were tried because researchers don't have time to try everything Um, okay now The unit if we go back to the unit that we used for um, super resolution We'll just go back to our most basic version Um What we did as we went down through the layers in the down sampling section We stored Um the activations at each point Into a list called layers And then as we went through the up sampling We added those down sampling layers back into the up sampling activations Um, so that's the kind of basic structure of a unit. Um You don't have to add you can also concatenate and actually concatenating is what is It's probably I think it's more common nowadays and I think of the original unit might have been concatenating Although for super resolution just adding seems pretty sensible Um, so we're going to concatenate but what we're going to do is we're going to try to We're going to kind of exercise our python muscles a little bit to try to see interesting ways to make some of this a little easier to Turn different down sampling backbones into units And you also use that as an opportunity to learn a bit more python So what we're going to do is we're going to create Something called A saved res block And a saved convolution And so our downs our down blocks. So these are our res blocks containing a certain number of res block layers Followed by this optional stride 2 conf They're going to use saved res blocks and saved cons and what these are going to do It's going to be the same as a normal convolution and the same as a normal res block the same as a normal unit res block Um, but they're going to remember The activations and the reason for that is that later on In the unit We're going to go through and grab those saved activations all at once into a big list so um So then yeah, we basically don't have to kind of think about it And so to do that we create a class called a save module And all save module does is it calls forward to grab the the res block or con results and stores that Before returning it now that's weird because hopefully you know by now that super Calls the thing in the parent class That saved module doesn't have a parent class So this is what's called a mix in and It's using something called multiple inheritance And mix ins are As it describes here, it's a design pattern. Um, which is to say it's not Particularly a part of python per se. It's a design pattern that uses multiple inheritance. Now what multiple inheritance is Is where you can say oh this class could save res block inherits from two things save module and unit res block And what that does is it means that all of the methods in both of these Will end up in here Now that would be simple enough except we've got a bit of a confusion here Which is that unit res block contains forward and save module contains forward So it's all very well just combining the methods from both of them. But what if they have the same method And the answer is that The one you list first Can call when it calls forward It's actually calling forward in the later one And that's why it's a mixon. It's mixing this functionality Into this functionality. So it's a unit res block where we've customized forward So it calls the existing forward and also saves it Okay, so you see mix ins quite a lot in the python standard library for example the um, the the basic HTTP Staff the basic some of the basic thread stuff, you know with networking users Multiple inheritance using this mix in pattern. So with this approach then The actual implementation of saved res block is Nothing at all. So pass means don't do anything So this is just literally just a class Which has no in nothing no Implementation of its own other than just to be a mix in of these two classes So a save convolution is an nn.com 2d with the save module mixed in So what's going to happen now is that we can Call a saved res block just like a unit res block and a saved con just like an nn.com 2d But that object is going to end up with the activations insights the dot saved attribute So now a down sampling block is just a sequential Of saved res blocks As per usual the very first one Is going to have The number of in channels to start with and it'll always have number nf the number of filters output and then after that The inputs will be also equal to nf because the first ones change the number of channels And we'll do that for however many layers we have And then at the end of that process as we discussed we will add to that sequential A saved conv With stride 2 to do the down sampling if requested. So we're going to end up with a single nn.sequential for a down block um And then an up block Is going to look very similar But instead of using an nn.com 2d with stride 2 Upsampling will be done with a sequence of an upsampling layer And so literally all that does is it just duplicates every pixel four times into little 2x2 grid That's what an upsampling layer does nothing clever and then follow that by a stride 1 convolution So that allows it to You know adjust some of those pixels as if necessary with a simple 3x3 conf So that's pretty similar To a stride 2 down sampling this is kind of the rough equivalent for upsampling um There are other ways of doing upsampling. This is just the one that stable diffusion does So an up block looks a lot like a down block Except that now So as before we're going to create a bunch of Unit res blocks these are not saved res blocks Of course, we want to use the save results in the upsampling path of the unit So we just used normal res blocks But what we're going to do now is as we go through each res net We're going to call it Not just on our activations, but we're going to concatenate that with Whatever Was stored during the down sampling path So this is going to be a list of all of the things stored in the down sampling path It'll be passed to the up block and so pop will grab the last one off that list And concatenate it with the activations and pass that to the res net So we need to know How many Filters there were how many activations there were in the down sampling path So that's stored here. This is the previous number of filters in the down sampling path And so the res block I wanted to add those in in addition to The normal number So that's what's going to happen there Um And so yeah, do that for each layer as before and then at the end Add an upsampling layer if it's been requested. So it's a boolean Uh, okay, so that's the upsampling block. Does that all make sense so far? Yeah, it looks good. Okay Okay, so the unit now is going to look A lot like our previous unit um We're going to start out as we tend to with a convolution Um to now allow us to create a few more channels Um, and so we're passing to our unit That's just, you know, how many channels are in your image and how many channels are in your output image So for normal image normal full color images, that'll be three three How many filters are there for each of those res net blocks? Up blocks and down blocks you've got And in the down sampling, how many layers are there in each block? So we go from the conv will go from in channel. So it'll be three to an f zero which This is the number of filters in the stable diffusion model They're pretty big as you see by default Um, and so that's the number of channels we would create um, which is like Very redundant in the this is a three by three conv So it only contains three by three by three channels equals 27 inputs And 224 outputs. So it's not You know doing computation useful computation in a sense. It's just giving it more space to work with down down the line, which I don't think that makes sense, but I haven't played with it enough to be Sure, um, normally we would do like You know like a few res blocks or something at this level to more gradually increase it because this feels like a lot of wasted effort But yeah, I haven't studied that closely enough to be sure So, Jeremy just to tweak this is the default I think the default settings for the unconditional unit in diffusers But the stable diffusion unit actually has even more channels that has 320 640 and then 1,280 1,280 cool Thanks for clarifying and it's yeah the unconditional one, which is what we're doing right now That's a great point. Um, okay. So then we yeah, we go through all of our number of filters And actually the first res block Contains 224 to 224 So that's why it's kind of keeping track of this stuff And then the second res block is 224 to 448 and then 448 to 672 and then 672 to 896 That's why we're just going to have to keep track of these things Um, so yeah, we add so we have a sequential for our down blocks and we just add a down block The very last one Doesn't have down sampling Which makes sense right because the very last one There's nothing after it so no point down sampling other than that they all have down sampling Um, and then we have one more res block in the middle And which is that the same as what we did Okay, so we didn't have a middle res block in our original unit here What about this one do we have any midblocks? No, so we haven't done okay But I mean it's just another res block that you do after the down sampling Um, and then we go through the reverse list of filters and go through those and adding up blocks And then one convolution at the end to turn it from 224 channels to three channels Um, okay, and so the forward then Is going to store in saved For the layers just like we did Back with this unit We don't really have to do it explicitly now. We just call the sequential model And thanks to our automatic saving each of those now will we can just go through each of those And grab their dot saved So that's handy We then call that mid block which is just another res block and then same thing Okay now for the ups and what we do is we just passed in those saved Right and just remember it's going to pop them out each time And then the conv at the end so that's uh, yeah, that's it. That's our Unconditional model It's not quite the same as the diffuses unconditional model because it doesn't have attention, which is something we're going to add Next But other than that, this is the same So let's for because we're doing a simpler problem, which is fashion MNIST will use less channels than the default Using two layers per block is standard One thing to note though is that in the up sampling blocks It actually is going to be three layers num layers plus one And the reason for that is that the way stable diffusion and diffuses do it is that Even the output of the down sampling Is also saved So if you have num layers equals two And there'll be two res blocks saving things here and one conv saving things here. So you'll have three saved cross connections So that's why there's An extra plus one here Um, okay, and then we can just train it using mini ai As per usual. Um, no, I didn't save it after I last trained it. Sorry about that. So just be it trained Um Okay now That oh, okay. No, that is actually missing something else important as well as attention The other thing it's missing is that thing that we discovered is pretty important, which is the time embedding Um So we already know that sampling doesn't work particularly well without time embedding So I didn't even bother sampling this and I didn't want to add all this stuff necessary to make that work a bit better Uh, I thought let's just go ahead and do time embedding so Time embedding there's a few ways to do it And the way it's done in stable diffusion is what's called sinusoidal embeddings Um, the basic idea Maybe we'll skip ahead of it The basic idea is that we're going to create a res block with embeddings Where forward is not just going to get the activations but it's also going to get t Which is a vector That represents the embeddings Of each time step. So I'll actually be it'll be a matrix because it's really everything the batch But for one element of the batch it's a vector And it's an embedding in exactly the same way as when we did nlp each token Had an embedding and so the word the would have an embedding And the word jono would have an embedding and the word tenacious could have an embedding Although tenacious could probably actually be multiple tokens Until he's famous enough that he's mentioned in nearly every piece of literature At which point tenacious will get his own token I expect That's how you know when you've made it So the time embedding will be the same t t of you know time step zero We'll have a particular vector time step one will have a particular vector and so forth Um, or actually, you know, we're doing karras. So actually they're not time step one two three. There's the they're actually sigmas, you know So they're continuous, but same idea um a specific Value of sigma, which is actually what t is going to be slightly confusingly We'll have a specific embedding now We want Two values of sigma or t which are very close to each other should have similar Embeddings and if they're different to each other they should have different embeddings So how do we make that happen, you know, and also make sure there's a lot of variety of the embeddings across all the possibilities So the way we do that is with these These sinusoidal Time steps. So let's have a look at how they work So you first have to decide how how big do you want your embeddings to be Just like we do in an LP That is the word the is it represented by eight floats or 16 floats or 400 floats or whatever That's just assume it's 16 now um, so let's say we're just looking at the A bunch of time steps, which is between negative 10 and 10 and we'll just do a hundred of them I mean, we don't actually have negative sigmas or t so it doesn't exactly make sense, but it doesn't matter. It's you know It gives you the idea um And so then we say like okay, what's the largest Time step you could have or the largest sigma that you could have Interestingly every single model I've found every single model. I found uses 10 000 for this Even though that number actually comes from the nlp transformers literature And it's based on the idea of like, okay, what's the maximum sequence length we support you could have up to 10 000 things in it, you know In a document or whatever in a sequence um But we don't actually have a sigmas that go up to 10 000 So i'm using the number that's used in real life in stable diffusion and all the other models But it's interestingly this that here purely as far as I can tell as a historical accident Because this is like the maximum sequence length that nlp transformers people thought they would need to support Okay, now What we're then going to do is we're going to be then doing um e to the power of a bunch of things And uh, so that's going to be our exponent and so exponent is going to be equal to log of the period, which is about nine Pimes uh, the numbers between naught and one Eight of them Because we got we said we want 16. So you'll see why we want eight of them and not 16 in a moment, but basically here are the um eight exponents we're going to use So then not surprisingly we do e to the power of that Okay, so we do e to the power of that Each of these eight things And we've also got The actual time steps. So imagine these are the actual time steps. We have in our batch So there's a batch of 100 And they contain these this range of sigmas or time steps So to create our embeddings, um, what we do is we do a Out of product Of the exponent dot x and the time steps. This is step one and um So this is using a broadcasting trick we've seen before we add a unit axis An axis zero here And add a unit accept at a sorry an axis sorry an axis one here and add a unit axis Axis on axis zero Here so if we multiply those together Then it's going to broadcast This one across this axis And this one across this axis. So we end up with a 100 by eight So it's basically, you know a cartesian product all the possible combinations of time step and exponent multiplied together Um, and so here's like You know a few of those different exponents for a few different values Okay, so That's not very interesting Yet, we haven't yet reached something where each Time step is similar to each next door time step. Um, you know over here You know these embeddings look very different to each other and over here. They're very similar um So what we then do Is we take the sign And the cosine Of those So that is 100 by eight And that is 100 by eight and that gives us 100 by 16 um So we concatenate those together And so that's a little bit too hard to wrap your head around. So let's take a look So the across the hundred time steps or hundred sigmas Uh, this one here is the first sign wave And then this one here Is the second sign wave And this one here is the third And this one here is the fourth and the fifth Um, so you can see as you go up to higher Higher numbers You're basically, you know stretching the sine wave out Um, and then once you get up to Index eight You're back up to the same frequency as this blue one because now we're starting the Cosine rather than sine and cosine is identical to sine. It's just shifted across a tiny bit So you can see these two light blue lines are the same And these two orange lines are the same. They're just shifted across I shouldn't say lines so curves um So when we concatenate those all together we can actually draw a picture of it And so this picture is 100 pixels across and 16 pixels top to bottom And so if you picked out a particular point, so for example in the middle here for t equals zero both sigma equals zero One column is an embedding All right, so the bright represents Higher numbers and the dark represents lower numbers And so you can see every column Looks different even though the columns next to each other look similar So that's called a time step embedding Um, and this is definitely something you want to experiment with right like Really, I've tried to do the plots. I thought are useful to understand this but Um, and uh, Jono and Tanishk also had ideas about plots for these, you know, which we've shown Um, but you know the only way to really understand them is to experiment So then we can put that all into a function Where you just say okay. Well, how many times sorry, what are the time steps? How many embedding dimensions do you want? What's the maximum period? And then all I did was I just copied and pasted the previous cells and merged them together Um So you can see there's our outer product And there's our cat of sine and cos Um If you end up with a if you have an odd numbered embedding dimension You have to pad it to make it even don't worry about that So here's something that now you can pass in The number of sorry the actual time steps or sigmas and the number of embedding dimensions and you will get back Something like this It won't be a nice curve because your times your time steps in a batch won't all be next to each other It's the same the same idea Um Which goes back to the your comment about the max period being super high Yeah, you said like okay adjacent ones are somewhat similar because that's what we want But there is some change. Um, but if you look all of this first 100 some just like the half of the embeddings look like they don't really change at all and that's because 50 to 100 on a scale of like 0 to 10,000 You want those to be quite similar because those are sold very early in this like super long sequence that these are designed for Yeah, um, so here actually we've got wasted space. Yeah, so here we put a period of 1000 instead And I've changed the figure size so you can see it better and it's using up A bit more of the space Or go to max period of 10 Um And it's actually now this is yeah using it much better Yeah, so like based on you know, what you're saying Jono, I I agree it like It seems like it would be a lot richer to use These time step embeddings with a suitable max period or maybe you just wouldn't need as many Embedding dimensions. I guess if you did use something very wasteful like this, but you used lots of embedding dimensions then It's gonna still capture some useful ones Um, yeah, thanks Jono. So yeah, um Yeah, so this is one of these interesting little insights about things that are very deep in code, which I'm not sure anybody Probably much looks at um Okay, so let's do a unit with time step embedding in it so What do you do once you've got like this column? You know embeddings for for each the element of batch. What do you do with it? Well, there's a few things you can do with it What stable diffusion does I think this is correct? I'm not promising because I remember all these details, right? Is that they make their embedding dimension length twice as big as the Number of activations And what they what we they then do is we can use chunk to take that And spit it into two separate variables. So that just literally just it's the opposite of concatenate. It's into two separate variables and one of them is added to the activations and one of them is multiplied by the activations So this is a scale and a shift We don't just grab the embeddings as is though Because each layer might want to do each each res block might want to do different things with them So we have an embedding projection Which is just a linear layer which allows them to be projected So it's projected from the number of embeddings two Two times the number of filters so that that torch dot chunk works We also have an activation function called Silu. This is the activation function that's used in stable diffusion I don't think the details are particularly important But it looks basically like A rectified linear with a slight curvy bit And also known as swish And it's just equal to x times sigmoid x And yeah, I think it's like activation functions don't make a huge difference, but it'll They can make things train a little better or a little faster and swish has been something that's worked pretty well So people a lot of people using swish or silu I always call it swish That's I I think silu was actually where it was originally the galu paper Which had silu was where it originally was kind of invented and Maybe people didn't quite notice and then another paper called it swish and And everybody called it swish and then people were like wait that wasn't the original paper. So I guess I should try to call it silu Okay, so yeah other than that a res it's just a normal res block. So we do our first conv Then we do our embedding projection of the activation function of time steps And so that's going to be applied to every channel Sorry to every pixel height and width So that's why we have to add unit axes on the height and width that it's going to cause it to broadcast across those two axes Do our chunk to the scale and shift then we're ready for the second conv And then we add it to the input with a Additional conv one stride one conv if necessary as we've done before if we have to change the number of channels okay Yeah, because I like exercising our python muffin muscles. I decided to use a second approach now For the down block and the up block. I'm not saying which one's better or worse. We're not going to use Multi-plant inheritance anymore, but instead we're going to use Was not even a decorator. It's a function which takes a function What we're going to do now is we're going to use conv 2dd And embress block directly, but we're going to pass them to a function called saved The function called saved is something which is going to take as input a callable Which could be a function or a module or whatever. So in this case, it's a module Takes an embress block or a conv 2d and it returns a callable The callable it returns is identical to the callable. It's passed into it except that it saves The result saves the activations saves the result of the function. Where does it save it? It's going to save it into a list in The second argument you pass to it Which is the the block So the save function You're going to pass it the module We're going to grab the forward from it and store that away to remember what it was And then the function that we want to replace it with call it underscore f I'm going to take some arguments and some keyword arguments. Well, basically it's just going to call the original modules dot forward passing in the arguments and keyword arguments And we're then going to Store the result in something called the saved attribute Inside here Um, and then we have to return the result Um, so then we're going to replace the modules forward method with this function And return the module. So that module's now been Yeah, I said callable actually it can't be called what has to specifically be a module because it's the forward that we're changing This at wraps is just something which automatically it's from the python standard library It's just going to copy in the documentation and everything from the original forward So that it all looks like nothing's changed Um Now where does this dot saved come from? Um, I realized now actually we could make this easier and automate it But um, I forgot didn't think of this at the time. Um, so we you have to create the saved Here in the down block It actually would have made more sense. I think here for it to have said if saved attribute doesn't exist then create it Um, which would look like this if not has attr lock comma saved Lock dot saved equals if you do this Then you wouldn't need This anymore Anyway, I didn't think of that at the time. So let's pretend That's not what we do. Okay So yeah now, um the the down sampling conf and the res nets Both contain saved versions of modules Uh, we don't have to do anything to make that work. We just have to call them Um, we can't use sequentials anymore because we have to pass in the time step to the res nets as well Um, it would be easy enough to create your own sequential The things with time steps which passes them along Um But uh, that's not what we're doing here um Yeah, maybe it makes sense for sequential to always pass along all the extra arguments Um, but I don't think that's how they work Um, yeah, so our up block is basically exactly the same as Before except we're now using ember as blocks instead Um, just like before we're going to concatenate So that's all the same So, okay, so a unit model with time embeddings Um Is going to look if we're going to put the forward The thing we're passing into it now is a tuple containing the activations and the time steps Or the sigmas in our case. So split them out And what we're going to do Is we're going to call that time step embedding function we wrote Saying, okay, these are the time steps and the number of embed the number of time step embeddings we want Is equal to however many we asked for And we're just going to set it equal to the First number of filters That's all that happens there um, and then We want to give the model the ability then to do whatever it wants with those to make those work the way it wants to And the easiest smallest way to do that is to create a tiny little mlp So we create a tiny little mlp Which is going to take the time step embeddings And return the actual embeddings to pass into the res net box. So tiny little mlp is just a linear layer With it's thinking here Hmm linear layer That's interesting. I um My linear layer by default has an activation function. I'm Pretty sure we should have act equals none Here should be a linear layer and then an activation and then a linear layer Um, so I think I've got a bug which we will need to try rerunning Okay Uh, it won't be the end of the world. It just means all the negatives will be lost here Makes it half only half as useful Um, that's not great Okay, um, and these are the kind of things like you know, as you can see you've got to be super careful of like Where do you have activation functions? Where do you have batch norms? Is it pre activations or post activation? It trains Even if you make that mistake and in this case it's probably not too much performance, but often it's like, oh you've done something Where you accidentally zeroed out, you know, all except the last few channels of your Like outputs of a block or something like that and When it work tries anyway, it does the best it uses what it can Um, yeah, but yeah, it's yeah, it makes it very difficult. Make sure you're yeah, you're not giving it those handicaps Yeah, it's not like you're making a crud app or something and you know that it's not working because It crashes or because like it doesn't show the username or whatever Instead you just get like slightly less good results, but since you haven't Done it correctly in the first place. You don't know. It's a less good results. Yeah, there's not really great ways to do this. It's really nice if you can have a Existing model to compare to or something like that Which is where cackle competitions work really well Actually, if somebody's like got a cackle result, then you know, that's a really good baseline and you can check whether yours is Is as good as theirs All right, so yeah, that's what this mnlp is for So the down. Yeah, the down and up blocks are the same as before the con bout is the same as before So, yeah, so we grab our time step embedding. So that's just that outer product pass through this Sinusoidal the sign and cosine we then pass that through the mlp And then we Call our down sampling passing in those embeddings each time You know, it's kind of interesting that we pass in the embeddings every time In the sense, I don't exactly know why we don't just pass them in at the start Um, and in fact in nlp these kinds of embeddings I think are generally just passed into the start So this is kind of a curious difference and I don't know why It's you know, if there's been ablation studies or whatever Do you guys know are there like any popular Diffusiony or generative models with time embeddings that don't pass them in or is this pretty universal um some of the fancier architectures like In recurrent interface networks and stuff just pass in the conditioning Oh I'm actually not sure. Yeah, maybe they do still do it like at every stage I think some of them just take in everything all at once up up front and then do Just like a transformer block or something like that. So I don't know if it's universal But it definitely seems like all the unit style ones have this The time except embedding going in the middle of the day Maybe we should try some ablations to see Yeah, if it matters, I mean I guess it doesn't matter too much either way, but yeah, if you didn't need it at every step then it would Maybe say if you were a bit of compute potentially Yeah, so now the upsampling you're passing in the activations the time step embeddings and those and that list of saved activations So, yeah, now we have a non attention stable diffusion unit So We can train that And we can sample from it Using the same I just copied and pasted all the stuff from the keras notebook that we had And there we have it. This is our first diffusion from scratch And the So we wrote every piece of code for this diffusion model Yeah, I believe so. I mean obviously in terms of the optimized code of implementations of Stuff no, but yeah, we've we've written our We've written our version of everything here. I believe a big milestone I think so. Yeah, and these feds are about the same as the feds that we get From the stable diffusion one They're not particularly higher or lower They bounce around a bit. So it's a little hard to compare. But yeah, they're basically the same Yeah, so that's um That's a there that is an exciting step and Okay, yeah, that's probably a good time to have a five minute break Um, yeah, yeah, okay. Let's have a five minute break Okay, um, normally I would say we're back but only some of us are back gianno gianno's uh Internet and electricity in samplabway is not the most reliable thing and he seems to disappear But we expect him to reappear at some point. So we'll kick on Jadolus I'm hope that Zimbabwe's uh infrastructure sorts itself out Um all right, so We're going to talk about attention. We're going to talk about attention for a few reasons Reason number one very pragmatic. We said that we would replicate stable diffusion and the stable diffusion unit has attention in it So we would be lying if we didn't do attention um number two attention is um one of the two basic building blocks of transformers A transformer layer is attention attached to a one-layer mlp. We already know how to create a one-layer or one hidden layer mlp So once we learn how to do attention we'll know how to We'll know how to create transformer blocks Um So those are two good reasons Um I'm not including the reason which is our model is going to look a lot better with attention because I actually haven't had any success seeing any Diffusion models I've trained work better with attention Um, so just to set your expectations We are going to get it all working but Regardless of whether I use our implementation of attention or the diffuses one Um, it's not actually making it better um That might be because we need to use better types of attention than what diffuses has Or it might be because it's just a very subtle difference that you only see on bigger images. Um, I'm not sure that's something we're still Trying to figure out. This is all pretty new And not many people have done kind of the diffusion the kind of ablation studies necessary to figure these things out so, um Yeah, so that's just life Anyway, so there's lots of good reasons to to know about attention. We'll certainly be using it a lot once we do an LP Which we'll be coming to pretty shortly pretty soon um And it looks like chano is reappearing as well, so that's good um, okay, so Let's talk about um attention The basic idea of attention Is that we have You know an image And we're going to be sliding a convolution kernel across that image Right and and obviously we've got channels as well or filters And so this also has that okay and as we Bring it across we might be you know, we're trying to figure out like what What activations do we need to create to eventually, you know correctly create our outputs um but The correct Answer as to what's here may depend on something that's way over here And or something that's way over here um, so for example If it's a cute little bunny rabbit and this is where its ear is You know And there might be two different types of bunny rabbit that have different shaped ears will be really nice To be able to see over here What its other ear looks like for instance With just convolutions, that's challenging. It's not impossible. We talked in part one about the receptive field And as you get deeper and deeper in a conf net the receptive field gets bigger and bigger Um, but it's you know at higher up. It probably can't see the other ear at all So I can't put it into those kind of more texture level layers And later on, you know Even though this might be in the receptive field of here Most of the weight, you know, the vast majority of the activations it's using is the stuff immediately around it So what attention does is it lets you Um, take a weighted average of other pixels Um around the image Regardless of how far away they are Um, and so in this case, for example, why we might be interested in bringing in at least a few of the channels of these pixels over here Um The way that attention is done in stable diffusion is pretty hacky and Known to be suboptimal But it's what we're going to implement because we're implementing stable diffusion and time permitting Maybe we'll look at some other options later Um, but the kind of attention we're going to be doing is 1d attention and it was attention that was developed for nlp And nlp is sequences one dimensional sequences of tokens So to do attention stable diffusion style We're going to take this image and we're going to flatten out the the pixels So we've got all these pixels. We're going to take this row And put it here And then we're going to take this row. We're going to put it here So we're just going to flatten the whole thing out Into one big vector of all the pixels of row one and then all the pixels of row two and then all the people Who's on the row three or maybe it's column one column two column three I can't remember this row wise or column wise, but it's flattened out Anywho And then it's actually For each image it's actually You know a matrix which I'm going to draw it a little bit 3d Because we've got the channel dimension as well So this is going to be the number Across this way is going to be equal to the height times the width And then the number this way Is going to be the number of channels Okay, so How do we decide? Yeah, which you know bring in these other Pixels well what we do is we basically create a weighted average Of all of these pixels So maybe These ones get a bit of a negative weight And these ones get a bit of the positive weight And you know these get a weight kind of somewhere in between And so we're going to have a weighted average and so basically each Pixel so let's say we're doing this pixel here right now Is going to equal its original pixel plus so it's called x plus The weighted average so the sum across so maybe this is like x I Plus the sum of Over all the other pixels So from zero to The height times the width Some weight times Each pixel The weights they're going to sum to one And so that way the you know the pixel value Scale Isn't going to change. Well, that's not actually quite true It's going to end up potentially twice as big I guess because it's being added to the original pixel So attention itself is not with the x plus But the way it's done in stable diffusion at least is it's is that the attention is added to the original pixel So yeah, now I think about it Not need to think about how this is being scaled Anyhow So the big question is what value is to use for the weights and The way that we calculate those Is we do a We do a matrix product and so our For a particular pixel We've got You know the number of channels for that one pixel Um And what we do is we can compare that To all of the number of channels For all the other pixels So we've got kind of this is pixel Let's say x1 And then we've got pixel number x2 Right all those channels we can take the dot product Between those two things And that will tell us How similar They are And so one way of doing this would be to say like okay, well, let's take that dot product For every pair of pixels and that's very easy dot product to do Because that's just what the matrix product is equal to so if we've got H by w by c and then model play it by its transpose H by w nice Sorry It's said transpose and then totally fail to do transpose Um Model play by its transpose That will give us an H by w by H by w matrix So each pixel or the pixels are down here and for each pixel as long as these Add up to one Then we've got to wait for each pixel and it's easy to make these out up to one We could just take this matrix multiplication And take the sigmoid over the last dimension And that makes sorry not sigmoid Man, what's wrong with me? Softmax, right? Yeah And take the softmax over the last dimension And that will give me something that adds The sum equals one Okay, now the thing is It's not just that we want to find the places where They look the same where the channels are basically the same, but we want to find the places where they're like Similar in some particular way, you know, and so Some particular set of channels are similar in one to some different set of channels in another And so, you know in this case, we may be looking for the um pointy iridness You know, which actually represented by, you know, this this and this, you know, and we want to just find those So the way we do that is before we do this matrix product We first Put our matrix through Through a projection So we just basically put our matrix through a matrix multiplication Um This one so it's the same matrix, right? But we put it through two different projections And so that lets it pick two different kind of Sets of channels to focus on or not focus on before it decides, you know Oh, this pixel similar to this pixel in the way we care about And then actually we don't even just multiply it then by the original pixels. We also Put that Through a different projection as well. So there's these different projections well the projection one projection two And projection three and that gives it the ability to say like, oh, I want to compare these channels And, you know, these channels to these channels to find similarity and based on similarity I then want to pick out these channels Right both positive and negative weight. So that's why there's these three different projections And so the projections are called A U And V those are the projections and so they're all being passed the same Matrix and because they were all being passed the same matrix we call this self attention Okay, Jono to niche. I know this is I know you guys know this very well, but you also know it's really confusing Did you have any Anything to add Change anything else? Yeah, I like that you introduced this without resorting to the Let's think of this as queries at all, which I think is yeah, yeah as we've noted Yeah, using them. These actually yeah, these are actually short for key query And value even though I personally don't find those useful concepts Yeah Um, you'll note on this scaling you said, oh, so we we said it so that the weights sum to one And so then we'd need to worry about like are we doubling the Scale of x. Yes. Um, but because of that p3 aka v that projection that can learn to scale This thing that's added to x appropriately and so it's not like just doubling the size of x It's increasing it a little bit. Which is why we scatter Normalization in between all of these attention layers. That's a good But it's not like as bad as it might be because we have that v projection Yeah, that's a good point And if this is uh, if p3 or it's actually, you know, the v projection is initialized Such that it would have a mean of zero And on average, you know, it should start out by not messing with our scale Um, okay, so yeah, I guess, um I find it easier to think in terms of code Um, so let's look at the code. There's you know, there's actually not much code I think you've got a bit of background noise too, John. I maybe Yes, that's better. Thank you Um, so So in terms of code, there's um, you know This is one of these things getting everything exactly right and it's not just right. I wanted to get it identical to The stable diffusion so we can say we've made it Identical stable diffusion. I've actually imported the attention block from diffusers so we can compare Um, and it is so nice when you've got an existing version of something to compare to to make sure you're Getting the same results Um, so We're going to start off by saying let's say we've got a 16 by 16 pixel image This is some deeper level of activations. It's got 32 channels With a batch size of 64 so n c h w I'm just going to use a random numbers for now, but this has the you know, reasonable Dimensions for an activation inside a batch size 64 CNN or diffusion model or unit whatever Okay, so the first thing we have to do Is to Flatten these out Because as I said in in in 1d attention, this is just ignored So it's easy to flatten things out. You just say dot view and you pass in the dimensions of the In this case the three dimensions we want which is 64 32 and everything else minus one means everything else So x dot shape colon two in this case is you know, obviously it'd be easiest to type 64 32 But I'm trying to create something that I can paste into a function later. So it's general So that's the first two elements 64 32 and then the star just Inserts them directly in here. So 64 32 minus one. So 16 by 16 now then Again, because this is all stolen from the nlp world in the nlp world things are Have They call this sequence So I'm going to call this sequence by which we're in height by width sequence comes before Channel, which is often called d or dimension. So we then transpose those last two dimensions So we've now got batch by sequence 16 by 16 By channel or dimension So n they didn't only call this nsd sequence dimension Okay, so we've got 32 channels So we now need Three different projections that go from 32 channels in to 32 channels out. So that's just a linear layer Okay, and just remember a linear layer is just a matrix multiply plus a bias Um, so there's three of them And so they're all going to be randomly initialized to different random numbers and we're called they're going to call them sk sq sv And so we can then they're just callable so we can then pass the exact same thing into three or three because we're doing self attention to get back our keys queries and values Or k q and v. I just think of them as k q and v because they're not really Keys queries and values to me So then we have to do the matrix multiply by the transpose And so then for every one of the 64 items in the batch For every one of the 256 pixels there are now 256 weights So at least there would be if we had done softmax which we haven't yet So we can now put that into a self attention as as jonah mentioned We want to make sure that we normalize things So we can pop a normalization here. We talked about group norm back when we talked about batch norm. So group norm is just Batch norm which has been split into a bunch of sets of channels um Okay, so then We are going to create our k q v. Yep jonah I was just gonna ask should those be just bias equals false so that they're only a matrix multiply to like strictly match the traditional implementation No because Okay, they also do it that way Yeah, they have bias in their attention blocks cool um, okay, so we've got our q k and v self dot q self dot k to self dot v being our projections um, and so uh to do 2d self attention We need to find the n c h w from our shape We can do our normalization We then do our flattening as discussed. We then transpose the last two dimensions We then create our q k v by doing the projections and we then do The matrix multiply now. We've got to be a bit careful now because as a result of that matrix multiply. We've Changed the scale by by multiplying and adding all those things together So if we then simply divide by the square root of the number of filters it turns out that's You can convince yourself with this if you wish to but that's going to return it to the original scale um, we can now do the softmax across the last dimension and then multiply each of them by v so using matrix multiply to do them all in one go um We didn't mention but we then do one final projection again just to give it the opportunity to map things You know to to some different Scale, you know shifted it also if necessary Uh, transpose the last two back to where they started from and then reshape it back to where it started from and then add it Remember I said it's going to be x plus add it back to the original. So this is actually kind of self attention Res Res net style if you like diffuses if I remember correctly does include the x plus in theirs, but um Some implementations like for example pie torch implementation doesn't Um, okay, so that's a self attention module and all you need to do is tell it how many channels to do attention on um So and you need to tell it that because that's what we need for our four different projections and our group on and our scale, um I guess strictly speaking it doesn't have to be stored here. You could Calculate it here. Um, but anyway, it's either way is fine. Okay. So if we create a self attention layer We can then call it on our Little randomly generated numbers and It doesn't change the shape Because we transpose it back and reshape it back, but we can see that's basically worked. We can see it creates some numbers How do we know if they're right? Well, we could create a diffuses attention block That will randomly generate um a q k v projection Sorry, actually they call something else. They call a query key value projection attention and grip norm Um, we call it q k v project norm. They're the same things And so then we can just um Zip those Tuples together. So that's going to take each pair of first pair second pair third pair and copy The weight and the bias From their attention block Sorry from our attention block to the diffuses attention block And then we can check that They give the same value which you can see they do so this shows us that our attention block is the same as the diffuses attention block, which is nice um Here's a trick which Um Neither diffuses nor pytorch use for reasons. I don't understand Which is that we don't actually need three separate attention three separate projections here We could create one projection From n i to n i times three. That's basically doing three projections. So we could call this q k v And so that gives us 64 by 256 by 96 instead of 64 by 256 by 32 Uh, because it's a three sets and then we can use chunk which we saw earlier To split that into three separate variables along the last dimension to get us our q k v And we can then do the same thing q at q dot transpose, etc So here's another version of attention where we just have one projection for q k v and we Chunkify it into separate q k and v and this does the same thing It's just a bit more concise Uh, and should be faster as well. At least if you're not using some kind of XLA compiler or on x or Triton or whatever for normal pytorch This should be faster because it's doing less back and forth between the cpu and the gpu um All right, so that's basic self attention um, this is not what's done basically ever however because in fact The kind of question of like or which Pixels do I care about Depends on which channels you're referring to, you know, because like The ones which are about like oh What color is its ear as opposed to how pointy is its ear might depend more on like what? You know, is this bunny in the shade or in the sun? And so maybe, you know, you mainly kind of want to look at its body over here To decide what color to make them rather than how pointy to make it um and so Yeah, different different channels Need to bring in information From different parts of the picture depending on which channel we're talking about And so the way we do that Is with multi-headed attention And multi-headed attention actually turns out to be really Simple and conceptually it's also really simple What we do Is we say let's come back to when we look at c here and let's split them into four separate Vectors one two three four Let's split them right and let's do The whole, you know Dot product thing on just You know, here's the first Part with the first part and then do the whole dot product part with a second part With a second part And so forth right so we're just going to do it separately Separate matrix multiplies or different groups of channels Um And the reason we do that is it then allows Yeah, different Parts different sets of channels to pull in different parts of the image And so these different Groups are called heads And I don't know why but they are Um, is that same reasonable Anything to add to that? it's maybe worth um thinking about why With just a single head specifically the soft max starts to come into play Um, because you know, we said oh it's like a weighted sum just able to bring in information from different parts and whatever else Um, but with soft max what tends to happen is whatever weight is highest gets scaled up quite dramatically and so it's like almost like Focused on just that one thing um, and then yeah, like as you said Jeremy like different channels might want to refer to different things and You know just having this one like single weight um, that's across all the channels means that That signal is going to be like focused on maybe only one or two things as opposed to Being able to bring in lots of different kinds of information based on the different channels right And you're pointing out I was going to measure the same thing actually That's a good point So so you're mentioning the second interesting important point about soft max, you know point one is that It creates something that's to one The point two is that because of its e to those it tends to highlight One thing very strongly And yes, so if we had single headed attention your point guys I guess is that you're saying it would end up basically picking nearly all one pixel which would not be Very interesting. Okay. Awesome um Oh, I see where everything's got thick. I've accidentally turned it into a marker right okay, so multi-headed attention um I'll come back to the details about implemented in terms of um, but but I'm just going to mention the basic idea This is multi-headed attention and this is identical to before except I've just stored one more thing, which is how many heads do you want? And then the forward is actually nearly all the same. So this is identical identical identical This is new identical identical identical New identical identical So there's just two new lines of code which might be surprising that that's all we needed to make this work And they're also pretty wacky interesting new lines of code to look at Conceptually what these two lines of code do? Is they first they do the projection? all right, and then um They um, they basically take um The number of heads So we're going to do four heads. We've got 32 channels four heads. So each head's going to contain eight channels and they basically Grab They're going to we're going to keep it as being eight channels not as 32 channels And we're going to make each batch four times bigger right because the the the images in a batch don't Uh combine with each other at all. They're totally separate. So in instead of having one image containing um 32 channels we're going to turn that into four images containing eight channels And that's actually all we need right because remember I told you that Each group of channels each head We want to have nothing to do with each other So if we literally turn them into different images Then they can't have anything to do with each other because batches don't interact don't react to each other at all So rear these rearrange this rearrange And I explain how this works in a moment, but it's basically saying think of the channel dimension as being of H groups of D And rearrange it so instead the batch channel is n groups of h And the channels is now just D. So that would be eight instead of four by eight And then we do everything else exactly the same way as usual but now That group that the channels are being split into groups of h to groups of four and then after that Okay, well, we were thinking of the batches as being of size n by h Let's now think of the channels as being of size h by d. That's what these rearranges do So let me explain how these work in the diffusers code. Um, I've Can't remember if I duplicated it or just inspired by it They've got things called heads to batch and batch to heads which do exactly these things And so for heads to batch they say, okay, you've got 64 per batch By 256 pixels by 32 channels Okay, let's reshape it So you've got 64 images by 256 pixels by four heads by the rest So that would be 32 over eight Channels So it's split it out into a separate dimension And then if we transpose these two dimensions, it'll then be n by four So n by heads by s l by minus one And so then we can reshape So those first two dimensions get combined into one So that's what heads to batch does and batch to heads does the exact opposite Right reshapes to bring the batch back to here and then heads by s l by d And then transpose it back again and reshape it back again so that the heads gets it um So this is kind of how to do it using just traditional pytorch methods that we've seen before But I wanted to show you guys this new ish library called inox Inspired as it suggests by Einstein summation notation But it's absolutely not Einstein summation notation. It's something different And the main thing it has is this thing called rearrange and rearrange is kind of like a nifty Rethinking of Einstein summation notation as a tensor rearrangement notation And so We've got a tensor called t we created earlier 64 by 256 by 32 and what Inops rearrange does is you pass it this specification string that says turn this Into this Okay, this Says that I have a rank three tensor three dimensions three axes containing the first dimension Is of length n The second dimension is of length s the third dimension is in parentheses is of length h times d Where h is eight Okay, and then I want you to just move things around So so that nothing is like broken, you know, so everything's shifted correctly into the right spots So that we now have each each batch Is now instead n Times eight n times h The sequence length is the same And d is now the number of channels previously the number of channels was h by d now It's d so the number of channels has been reduced by a factor of eight And you can see it here. It's turned t from something of 64 by 256 by 32 into something of size 64 times eight By 256 By 32 divided by eight And so this is like really nice because You know a This one line of code to me is clearer and easier and I liked writing it better than These lines of code, but whereas particularly nice is when I had to go the opposite direction I literally Took this cut it Put it here and put the arrow in the middle like it's literally Backwards which is really nice right because we're just rearranging it in the other order And so if we rearrange it in the other order, we take our 512 by 256 by 4 thing that we just created And end up with a 64 by 256 by 32 thing which We started with and we can confirm that the end thing equals Or every element equals the first thing So that shows me that my rearrangement has returned It's original correctly Yeah, so I'm already headed attention. I've already shown you it's the same thing as before but Pulling everything out into the batch for each head and then pulling the heads back into the channels So we can do multi-headed attention with 32 channels and four heads And uh, check that all looks okay So pie torch has that all built in it's called nn.multi-head attention Um be very careful be more careful than me in fact because I keep forgetting that it actually expects the batch to be the second dimension So make sure you write batch first equals true to make batch the first dimension and that way it'll be the same as Um Diffuses I mean it might not be identical, but the same it should be almost the same the same idea and to make it Self-attention you've got to pass in three things, right? So the three things will all be the same for self-attention. This is the thing that's going to be passed through the q projection the k projection And the v projection and you can pass different things to those if you pass different things to those you'll get something called Self-attention rather than self-attention which I'm not sure we're going to talk about until we do it in nlp Um just on the rearrange thing I know if you've been doing pure pie torch and you're used to it like you really know what transpose and You know reshape and whatever do then it's can be a little bit weird to see this new notation But once you get into it, it's really really nice And if you look at the self-attention multi-headed implementation there you've got dot view and dot transpose and dot reshape It's quite it's quite fun practice like if you're just saying oh this ion ops thing looks really useful Like taken an existing implementation like this and say oh, maybe instead of like Can I do it instead of dot reshape or whatever? Can I start replacing these individual operations with the equivalent like rearrange call? And then checking at the upwards are the same like that's that's what helped it like click for me was oh, okay Like I can start to express if it's just transpose then that's a rearrange with the last two channels Yeah, which if I wanted to do it that way, you know, I only just started using this And I've obviously had many years of using reshape transpose, etc in piano tensorflow keras pie torch APL And I would say within 10 minutes I was like oh, I like this much better, you know like it's I find for me at least it didn't take too long to be Convinced it's not part of um pie torch or anything. You've got to pip install it by the way And it seems to be coming super popular now at least in the kind of diffusion research crowd everybody seems to be using I not suddenly even though it's been around for a few years um and actually put in a issue there and asked them to add in Einstein summation notation as well, which they've now done so it's kind of like your one place for everything which is great and it also works across tensorflow and other libraries as well just nice Okay, so we can now add that to our Unit uh, so this is basically a copy of the previous notebook except what I've now done is I've uh, this I did this at the point where it's like Oh, yeah, it turns out that cosine Scheduling is better. So I um back to cosine schedule now So this is copy from the cosine schedule book and we're still doing the minus point five thing because we love it Um And so this time I actually decided to export stuff into a mini ai dot diffusion So so to this point I felt like things were working pretty well And I so I renamed unit com to pre con since it's a better name Time step embeddings been exported Up samples been exported This is like a pre act linear version exported um I tried using an end up multi head attention and it didn't work very well for some reason So I haven't figured out why that is yet um, so I'm using Yeah, this self attention which we just talked about multi-headed self attention You know the just the scale We have to divide the number of channels by the number of heads because The effective number of heads is you know divided across n heads um And instead of specifying n heads here you specify attention channels. So if you have like yeah, that's what in is 32 attention channels Is 8 then you calculate. Yeah here. We've got four hands. That's what diffuses does. I think it's not what an n multi head attention does And actually I think n i divided by n i divided by attention chance is actually just equal to attention chance So I could have just put that probably um Anyway, never mind Yeah, so okay, so that's all copied in from the previous one um, the only thing that's different here is I haven't got the Dot view minus one thing here So this is a 1d self attention and then 2d self attention Just adds the dot view before we call forward and then Don't reshape it back again um So yeah, so we've got 1d and 2d self attention Okay, so now our embres block has one extra thing you can pass in which is attention channels Um, and so if you pass in attention channels, we're going to create something called self dot attention which is a self attention 2d layer with the right number of filters and the requested number of channels And so this is all identical to what we've seen before except if we've got attention Then we add it. Oh, yeah, and the attention that I did here is the non res neti version So we have to do x plus because that's more Flexible you can then choose to have it or not have it this way Um, okay, so that's an embres block with attention Um, and so now our down block you have to tell it how many attention channels you want Because the res the res blocks need that the up block You have to know how many attention channels you want because again the res blocks need that And so now the unit model Where does the attention go? Okay, we have to say how many attention channels you want And then you say which index block do you start adding attention? So why don't we and so then what happens is the attention is done Here we um each res net Has attention And so as we discussed you just do the normal res and then the attention, right and if you um What that in at the very start, right? Let's say you've got a 256 by 256 image 256 by 256 image Then you're going to end up With this matrix here It's going to be 256 by 256 on one side and 256 by 256 on the other side And contain However many, you know NF Channels that's huge um And you have to back prop through it So you have to store all that to allow that prop to happen. Uh, it's going to explode your memory So what happens is basically nobody puts attention In the first layers So that's why I've added a attention start, which is like at which block do we start adding attention and it's not zero For the reason we just discussed um Another way you could do this is to say like At what grid size should you start adding attention? And so generally speaking People say when you get to 16 by 16 That's a good time to start adding attention Although stable diffusion adds it at 32 by 32 Because remember they're using latents, which we'll see very shortly. I guess in the next lesson So it starts at 64 by 64 and then they add attention at 32 by 32 So we're again, we're replicating stable diffusion here stable diffusion uses attention start at index one So we you know when we go self dot downs dot append we the down block Has zero attention channels if we're not up to that block yet and Ditto on the up block except we have to count from the end up blocks Um Now I think about it that should have attention as well the mid block So that's missing um Yeah, so the forward actually doesn't change at all for attention It's only the in it Um, yeah, so we can train that And so previously. Yeah, we got Without attention We got to 137 And with attention Oh, we can't compare directly because we've changed from keras to cosine Um, we can compare the sampling though Um So we're getting what are we getting four five five five It's very hard to tell if it's any better or not because Um Well, again, you know a cosine schedule is better But yeah, when I've done kind of direct like with like I haven't managed to find any obvious improvements From adding attention, but I mean it's doing fine, you know four is great Um Yeah Um, all right. So then finally did you already I just want to add anything before we go into a conditional model I was just going to make a note that um, like I guess just to clarify the With it for the attention part of the motivation was certainly to do the sort of spatial mixing and kind of like Yeah, to get from different parts of the image and mix it but then the problem is if it's too early Very you do have one of you know the more individual pixels Um, then the memory is very high. So it seems like you have to get that balance of Where you don't you kind of want it to be early so you can do some of that mixing But you don't want to be too early where then the memory usage is is too high So it seems like there is certainly kind of the balance of trying to find Maybe that right place where to add attention into your network So I just thought I was just thinking about that and maybe that's a point worth noting Yeah, just sure There is a trick which is like what they do in for example vision transformers or um, the DIT the diffusion Diffusion with transformers, um, which is that if you take uh, like a Eight by eight patch of the image And you flatten that all out or you run that through some like convolutional thing to turn it into a one by one By some large number of channels But you can reduce the spatial dimension By increasing the the number of channels And that gets you down to like a manageable size where you can then start doing attention as well So that's another trick is like patching where you take a patch of the image and you focus on that as Some number like some embedding dimension or whatever you like to think of it But there's a one by one rather than an eight by eight or a 16 by 16 And so that's how like you'll see, you know 32 by 32 patch models like some of the smaller clip models or a 14 by 14 patch for some of the larger like VIT You know classification models and things like that. So that's another. Yeah, I guess that's the Yeah, that's mainly used when you have like a full transformer network I guess and then this is one where we have that sort of incorporating the attention into a convolutional Network, so there's certainly I guess yeah for different sorts of networks different tricks, but yeah Yeah, and um, I haven't decided yet if we're Going to look at VIT or not. Um, maybe we should based on what you're describing. I was just going to mention though that Um, since you mentioned transformers um We we've actually now got everything we need to create a transformer Here's here's a transformer block within blade in for the embeddings And a transformer block with embeddings is exactly the same embeddings that we've seen before And then we add attention As we've seen before there's a scale and shift And then we pass it through an mlp Which is just a linear layer an activation Anormalize and a linear layer Um, for whatever reason this is, you know, gal U which is just another activation function is what people always use in transformers um For reasons that Suspect don't quite make sense in vision everybody uses layer norm and again I was just trying to replicate an existing paper, but this is just a standard mlp So if you do, um, so in fact if we get rid of the embeddings Just to show you a true pure transformer Okay, here's a pure transformer lock, right? So it's just normalize attention add Normalize model layer perception add that's all a transformer block is And then what's a transformer network? A transformer network is a sequential of transformers And so in this diffusion model I replaced my mid block with a list of Sequential transformer blocks so that is a transformer network and to prove it um, I then Replay oh, this is another version in which I replace that entire thing with the pytorch Uh transformer transformers encoder. Just just caught encoder. This is just taken from pytorch and so that's That's the encoder and I just replaced it with that um So, yeah, we've now built Transformers now Okay, why aren't we using them right now? And why did I just say I'm not even sure if we're going to do v it Which is vision transformers the reason is that Transformers You know, they're doing something very interesting right, which is Um, remember we're just doing 1d versions here, right? so Transformers are Taking something where we've got a sequence Right, which in the in our case is pixels height by width, but it's called the sequence And everything in that sequence has a bunch of channels All right four dimensions Um, I'm gonna draw them all but you get the idea Um, and so for each element of that sequence, which in our case, it's you know It's just some particular pixel right and these are just the filters channels activations whatever activations I guess um What we're doing Is the we first do attention Which you know, remember there's a projection for each so so like it's mixing the channels a little bit But just putting that aside the main thing it's doing is each Row Is getting mixed together You know into a weighted average um And then after we do that We put the whole thing through a multi-layer perceptron and what the multi-layer perceptron does Is it entirely looks at each pixel on its own? So let's say this one Right and puts that through linear activation norm linear Which we call an mlp And so a transformer network is a bunch of transformer layers. So it's basically going attention mlp attention mlp Attention etc etc mlp. That's all it's doing um And so in other words, it's mixing together The pixels or sequences and then it's mixing together The channels and it's mixing together the sequences and they're mixing together the channels and it's repeating this over and over Because of the projections being done um in the Attention it's not just mixing the pixels, but it's kind of it's it's largely mixing pixels and so this combination is very very very flexible Um, and it's flexible enough that it provably can actually approximate any convolution that you can think of given enough layers and enough time and learning the right parameters um The problem is that for this to approximate a combination requires an a lot Of data and a lot of layers and a lot of parameters and a lot of compute So if you try to use this So this is a transformer Network transformer architecture if you pass images into this So pass an image in And try to predict say from image net the class of the image So use sgd to try and Find weights for these attention projections and mlps Um, if you do that on image net you will end up with something that does indeed predict the class of each image, but it does it poorly Now it doesn't do it poorly because it's not capable of approximating a convolution It does it poorly because image net the entire image net as an image net 1k is not big enough To for for a transformer to learn how to do this However, if you pass it a much bigger data set many times larger than image net 1k Then it will learn to approximate this Very well, and in fact it'll figure out a way of doing something like convolutions that are actually better than convolutions And so if you then take that So that's going to be called a vision transformer or v it Uh, that's been pre-trained on a data set much bigger than image net and then you fine tune it on image net You will end up with something That is actually better than res net And the reason it's better than res net is because These combinations, right Which together when combined Can approximate a convolution these transformers You know convolutions are Our best guess as to like a good way to kind of represent The calculations we should do on images But there's actually much more sophisticated things you could do You know if you're a computer and you could figure these things out better than a human can And so a v it actually figures out things that are even better than convolutions And so when you fine tune image net using a very you know a v it that's been pre-trained on lots of data Then that's why it ends up being better than a res net so um That's why you know the the the um the things i'm showing you Are not the things that contain transformers and diffusion because To make that work would require pre-training on a really really large data set for a really really long amount of time um So anyway, um, so we're We might only come to transformers Well, not in a very long time, but when we do Do them in nlp in in vision Maybe we'll cover them briefly, you know, they're very interesting to use as pre-trained models The main thing to know about them is yeah a v it You know, which is a really successful when when pre-trained on lots of data Which they all are nowadays is a very successful architecture But like literally the v it paper says oh, we wondered what would happen if we take a totally plain 1d transformer You know and convert it and use convert make it work on images with as few changes as possible so everything we've learned about Attention today and MLPs applies directly because they haven't changed anything um, and so one of the things you might realize that means is that you can't use A v it that was trained on 224 by 224 pixel images On 128 by 128 pixel images Because you know all of these um self attention things Are the wrong size you know um, and specifically The the problem is actually the um, actually it's not really the attention. Let me let me take that back the all of the The Position embeddings are the wrong size and so actually that's something I sorry. I forgot to mention is that um in transformers The first thing you do is you always take Uh your you know these pixels and you add to them um a position or embedding And that's done. I mean they can be done lots of different ways But the most popular way is identical to what we did for the time step embedding at the sinusoidal embedding and so that's specific you know to how many How many pixels there are in your image? So Um, yeah, that's an example of one of the things that makes v it's a little tricky. Anyway, hopefully yeah, you get the idea that we've got all the pieces That we need um Okay, so with that discussion I think that's officially taken us over time. So maybe we should do the conditional Next time Do you know what actually it's tiny? Let's just quickly do it now You guys got time Yeah, okay So let's just yeah, let's finish by doing a conditional model. So for a conditional model um We're going to basically say I want something where I can say draw me the number. Sorry draw me a shirt Or draw me some pants or draw me some sandals So we're going to pick one of the 10 fashion mnist classes and And creating you know, creating an image of a particular class um To do that We need to know what class each thing is now we already know what class each thing is because it's the the y label which uh Way back in the beginning of time we set Okay, it's just called the label So that tells you what category it is um So we're going to change our collation function So we call noise of fires per usual that gives us our Noised image our time step and our noise But we're also going to then add to that tuple What kind of fashion item is this? And so the first tuple will be noise image noise and label and then the dependent variable as per usual is the noise And so what's going to happen now when we call our unit which is now a conditioned unit model Is the input is now going to contain Not just the activations and the time step, but it's also going to contain the label Okay, that label will be a number between zero and nine So how do we convert the number between zero and nine into a vector which represents that number? Well, we know exactly how to do that And end on embedding. Okay, so we did that lots in part one So let's make it exactly um You know the same size as our Time embedding so n number of number of activations in the embedding It's going to be the same as our time step embedding And so that's convenient. So now in the forward we do our time step embedding as usual We'll pass the labels into our conditioned embedding The time embedding we will put through the embedding layer p in before and then we're just going to add them together That's it, right? So this now represents a combination of the time and the fashion item plus And then everything else is identical in in both parts So all we've added is This one thing and then we just literally Sum it up. So we've now got a joint embedding representing two things And then yeah, and then we train it um And you know interestingly looks like the loss Well, it ends up about the same, but it's you know, you don't often see 0.031 You know, it's it's it's a bit easier for it to do a conditional embedding model because you're telling it what it is Just that makes it a bit easier. So then to do um conditional sampling You have to pass in what type of thing do you want? Um of from these labels um and so then we um create a Vector just containing that number repeated over many times there in the batch And we pass it To our model. So our model has now learned how to denoise something of type c And so now if we say like oh trust me this noise contains is a noise image of type c It should hopefully Denoise it into something of type c Um, that's that's all there is to it. There's no magic there So yeah, that's all we have to do to change the sampling. Um, so like we didn't have to change ddim step at all Right, literally all we did was we added this one line of code and we added it there Um, so now we can say okay, let's say Plus id zero which is t-shirt top so we'll pass that to sample And there we go or everything looks like t-shirts and tops Um, yeah, okay. I'm glad we didn't leave that till next time because it's we can now say We have successfully replicated everything in stable diffusion except for Being able to create whole sentences, which is what we're doing with click Getting really close Yes Well, except the tape requires like all of it up. So we'll I guess we'll We might do that next or depending on how research goes All right, we still need to do a latent diffusion part Oh, good point latents Okay, we'll definitely do that next time So let's see. Yeah, so we'll do a VAE and latent diffusion Um, which isn't enough for one lesson. So maybe some of the research I'm doing will end up in the next lesson as well But yes, okay. Thanks for the reminder Although we've already kind of done auto encoders. So VAE is going to be pretty Pretty easy Well, thank you to niche congenot fantastic comments as always glad your internet slash power reappeared jono Back up All right, thanks gang cool. Thanks. That was great