 Hi everybody today. We are covering lesson 23 and we're here with Johnno and Tanishq. How are you guys both doing? Doing well excited for another lecture or another lesson Yeah, likewise great I Shamefully have to start with admitting to a bug Which actually is rather well, I don't know it kind of messed up things in a sense But I kind of I think it's really interesting actually what happened the the bug It was in notebook 23 the Keras notebook and it's about the measure measuring the FID so to recall FID measures how similar a bunch of samples are from a Model to a bunch of samples of real images and that similarity is defined in this kind of like Some kind of distance between the distributions Of the features in a classifier or some kind of model So that means that to get fed we have to load a model And we have to pass it some Data loaders so that it can calculate what the samples look like from real images Now the problem is that the data loaders are always passing actually had images that the pixels were between negative point five and positive point five But you might recall this model that I trained Has pixels between negative one and one so what this Image eval class would have seen and specifically this this C model which we are putting which we are getting the features from is it would have seen a whole bunch of Unusually low contrast images So they wouldn't really have looked like many things in the data set because in fact in the data set I think particularly for fashion MNIST things are pretty consistently You know normalized in terms of of going all the way from zero to one or negative one to one Well, I guess zero to two five five in the original and So as a result, I think what would have happened is that the the features That came out of this would have been kind of weird and they might not have necessarily consistently said all these are t-shirt features and these are Shoe features, but they would have said oh, this is a weird low cast low contrast image feature and so then The shame continues in that I added another bug on top of this bug, which is when I then did the sampling I Didn't I didn't multiply by two and the data that I trained it on was actually the same data loaders Or that were the specifically the same transform the same noise if I transform Um Well, where did it come from it's the same Yeah, the same transform I not noise if I the same transform I which yeah previously was point from point negative 0.5 to 0.5 So I trained the model using this restricted input space as well And therefore it was spitting out things that were between negative 0.5 and 0.5 And so the FID Then said wow these are so similar the samples are consistently spitting out features of low contrast things And all of the real samples are low contrast things. So those are really similar and that's how we got really Low numbers so those low numbers were wrong So I was a bit surprised I guess that that the keras model was doing doing so much better And it certainly had made me a big believer in the keras model But actually it's not doing so much better. So once we fix that The FIDs are actually around 5 6 5 And the reals are two and a half um So to compare We were getting some pretty good results in cosine. So cosine. Yeah, we were getting three Three Four depending on how many steps we were doing ddam So the result of this is that this um, this somewhat odd situation where the cosine model Where we Um Scaled it accidentally to be negative 0.5 to 0.5 and then Um Post sampling multiplied by two so we're not cheating like the keras one used to be Um is working better than keras, which yeah, it's a surprise to me because I was thinking keras was kind of like in theory optimally scaling things, but I guess the truth is It was scaling things to unit variance But there's nothing particularly to say that's optimally scaling things and so empirically we've found kind of accidentally a better way to scale things and also our dependent variable is different You know our dependent variable is not that keras, you know c mix combination But our dependent variable is just The the noise the zero one noise, you know the noise before it's multiplied by alpha Okay, so that's that's the bug Um Anyway, I promised last time we would stop looking at fashion eminence through a while. So let's move on to tiny image net so And the reason we're going to do this is because we want to I want to show an example of We're going to try and create units today And I wanted to show an example of a of a nice Unit we can create that combines a lot of the ideas we've been looking at It's going to be a super resolution unit and doing super resolution on fashion eminence isn't going to be very interesting because the maximum training size we have is 28 by 28 so So I thought we'd go a little bit bigger than that to tiny image net which is 64 by 64 I found it quite difficult actually to find Tiny image net data, but eventually I discovered that it's still on the stanford servers where it was originally created It's just not linked to anywhere So we'll try to if this disappears we will we will keep our forum and website up to date with other places to find it Anyway, so for now we can grab the url from there and Unpack it so shutel is a very handy little library inside the python standard library And one of the things it has is a very handy unpack archives which can handle zip files And it's going to put it in our data directory So I yeah, just you know, there's a few different ways we could process this And I thought we might experiment some things, but I thought yeah, it wouldn't be a bad idea to try doing things The recently kind of manual way just to see you know what that looks like and often this is the easiest way to do things because You know, that's a very well defined set of steps, right? So step one is to create a data set So a data set is just literally something that has a length And that you can index into it. So it has to have these two things to find Um, you don't have to inherit from anything. Um, you just have to define these two things Um, broadly speaking in python You generally don't have to inherit from things. You just have to provide the methods that are expected um so our data set Is in a directory called tiny image net 200 And then there's a train directory and a val directory for the training and the validation set and then the train directory This is pretty classic normal thing each category So this is a category Has images in a separate folder And specifically there in images subfolder. So what I wanted to do was to just grab Start with grab all of the files in path slash train or the image files So the python standard library has a glob Function which searches recursively if asked to for everything that matches this Well this specification So this specification is path slash star dot jpeg and then this star star here I don't know why we need to do it twice. It's a bit weird. It was that you also need that to be recursive So to be recursive, you both have to say recursive tree here and also put star star before the slash Here, so that's going to give us a list of all files inside path train um And so then if we index into that training data set with zero that will call get item passing an i of zero And so we will then return a tuple one is The thing in self dot files i Which is this file and then the label for it and the label is that So it's the parents parents name Parents parents name And so that's the name Okay, so there's a data set that returns two strings when you index into it a tuple of two strings So first is the name of the image file of so the path of the image file and the second is the Name of the category it's in these weird names are called wordnet categories. There's like codes that indicate concepts basically in in english So one of the reasons I actually used this particular data set is because it's going to force us to Do some more data processing which I think is good practice And that's because weirdly in the validation set although it's in tiny image net 200 slash val Which is the not weird part the weird part is that they are not then in Subdirectories organized by label Instead there is a separate Val annotations dot text file Which looks like this. So it says to each file name What category is it? It's also got the like the bounding box of whereabouts that is but we're not going to be using that today So I decided to create a dictionary That would tell us for each File What category is it in? So that means that I want to create a In this case here. I'm doing something exactly like a list comprehension, but because it's not in square brackets It's a generator comprehension. So it'll it'll generate It'll kind of stream out the the results and we're going to go through each line in This file And We're going to Split on tab So that's going to give us this and then this and then this and then we're going to grab to the first two and if you basically pass to pass a List of lists or list of tuples or whatever to dict it will create a dictionary Using these pairs as key values So if we have a look There it is So that's quite a nice nice neat way to do it And if you're not sure you can just click to type dict type open brackets and then hit shift tab a couple of times And it will show you the various options And you can see here I'm doing dict iterable because my generator it is it is iterable And it says oh, that's exactly as if you created a dictionary and then gone for kv in iterable dk equals v So there's a nice little trick. Okay um Now um We need a data set That works just like tiny data set But the get items are going to label things differently So I just inherited from tiny data set So that means we don't need to do in it or len again and then get item again. It's going to turn the ith file This time the label will not be the parent parent name, but We will look up in the annotations dictionary The name of the file and so that works We can check the length works So then um, a fairly generally useful thing that I thought we'll then create is something that lets us transform Any data set So here's a class that you can pass it a data set And you can pass it a transformation for the x or the independent variable and you can pass it a transformation from the y And both of them default to no op that is no operation. So it just doesn't change it at all So a transform data set the length of it Is just the length of the original data set um, but when we call get item it'll grab the tuple From the data set we passed in And it will return that tuple but with transform x and transform y apply to it Does that make sense so far? great um Okay So I don't like working with these n 0 3 0 things But the data set luckily has a word net ids file in it um So if I just open it up Oh, sorry, this one actually is not quite going to help us. This is just a list of all of the word net ids that they have Images for we could have actually got this by simply grabbing Um By listing this directory it would have told us all the ids, but they've got they've also got just a The text file containing all of them. So we can see that there are 200 categories Okay um And that's useful because we're going to want to change n 0 3 0 etc into an int And the way we can change it into an int Is by simply saying oh, we'll call we'll call this one zero and this one one and so forth, right? So they're kind of the int to string or id to string version of this Is literally this list so zero will be that that But the string to int version where you do this all the time is basically enumerate so that gives us the index and the value for everything in the list So those are going to be our keys and values, but actually we're going to invert it to become value colon key And that's what strur to id will be so note here that we have a dictionary comprehension you can tell because it's got curly brackets And a colon and so here's our dictionary comprehension. So we could have used that Uh for this as well we could have done a dictionary comprehension Instead But um, yeah, so there's lots of ways of doing things. None of them is any better or worse than any other Um, okay, so that's the The those were the tags or whatever. Do we have the the names for them or is that something? Yes The names i'm going to get to Yes, shortly. There's a word stop text. So yeah All right, I grabbed one batch of data and grabbed its mean and standard deviation And so then i've just copied and pasted them in here for normalizing So my my transform x is going to be i'm going to read the image If you read it as rgb, that's going to force it to be three channels because actually some of them are already one channel Divide it by 255 so it'll be between zero and one And then we will normalize And then for our y's We will go through straight id to get the id and just use that as our tensor so It's you know Doing it manually is actually pretty straightforward, right because now We just pass those to our tiffim ds our transformed data set Um, and we can check that You know, you can see yi Is a tensor, but we can look it up to get its value and xi Is an image tensor with three channels So channel by height by width has this normal for pytorch Um, so for showing images, it's nice to denormalize them. So that's just denormalizing um, and so if we Show the image that we just grabbed It's a water jug, I guess All right, so now we can create a try a data loader for our training set. So it's going to contain our transformed training data set And pass in a batch size this one has to be shuffled um Not sure why I put num workers equals zero there generally eight's pretty good if you've got at least eight cores Um, yeah, so we can now grab an x batch and a y batch And Take a look at a denormalized image from there. So there we've got a nice little kitty cat So I think this is already looking better than fashion emnist Um, yeah, so there's this thing Words dot text that they've also provided and this is actually a list of the entire Um A wordnet hierarchy So the top of the hierarchy is entity and one of the entity types is a physical entity or an abstract entity Entities can be things And so forth. So this is how wordnet is Yeah Handled so this is quite a big file actually um So if we go through each item of that file and again split on tabs um because um Split on tabs, that's what backslash t means is going to give us the wordnet id and then the name of it Um, so now we can go through All of those they call them sin sets And uh, if the key is in our list of the 200 that we want we'll keep it and um We we don't really want like causal agent comma cause comma causal agency The first one generally seems to be the most normal. So I just split on comma And grab the first one um All right, so that's um, so we could then go through our y batch and just Turn each of those numbers into strings and then look each of those up in our sin sets and join them up And then use those as titles to see our egyptian cat in our cliff in our guacamole It's a monarch butterfly And so forth And you can see that this is going to be quite tricky because like a cliff versus a cliff dwelling for instance could be quite You know complicated um I have a feeling for this they intentionally like a hundred of the Um Categories might have come from the normal image net and I think they might have then picked a hundred that it's designed to be particularly difficult or something if memory serves correctly um All right, so then we could define a transform batch function with the same basic idea Um, and that's just going to yeah transform the x and the y in a batch um Oh, yes, we're about to use that. I should move that down a bit because we're not quite there yet Okay, so before that we can create our data loaders We created a get dls back in an earlier lesson which simply Turns that into a data loader and that into a data loader and this one gets shuffled and that one doesn't and so forth Um, oh, I see this is where we do our numb workers. Cool um All right, so then Oh, yeah, so then we want to add um our data augmentation. So I I noticed that um training a Tiny image net model. I mean, it's it's a much harder thing to do than fashion MNIST And um Overfitting was actually a real challenge Um, and I guess It's because 64 by 64 isn't that many pixels Um So, yeah, so I found I really needed data augmentation to make much progress at all now Very common data augmentation is called random resize crop, which is basically to pick like one Area inside and then zoom into it and make that your image But for such low resolution images that tends to work really poorly Um, because it's going to introduce a lot of kind of blurring artifacts So instead for small images, I think it's better to Add a bit of padding around them And then randomly pick a 64 by 64 area from that padded area. So it's just going to shift them slightly It's not a lot of augmentation But it's something and then we do our random horizontal flips and then we'll use that random arrays thing that we created earlier Um This is just something I was experimenting with So, yeah, so now we can um use that batch transform callback Um using transform batch passing in those transforms so, um with Torch vision transforms so this capital T is torch vision transforms. Um, yeah, because these are all Um nn.modules you can pass them to nn.sequential to just have each of them called one at a time in a row There's nothing magic about this. It's just doing function composition. We could easily create our own In fact, they're also the transforms dot compose that does the same thing. Yeah, it's gonna say so we've got a fast um Fastcore dot compose which uh, as you can see basically it just says for f in funx x equals f of x um Yeah, I don't know. Is there is there's a yeah torch Torch vision compose I think might be kind of the old way to do it. Is that right? I'm not sure I have a feeling maybe this is considered the better way now because it's kind of scriptable I'm not promising that though But yeah, it does basically the same thing Okay, so Yeah, we can now create A model as usual um Okay, so basically, um, I copied the get model with dropout get drop model from our earlier Tiny, sorry our earlier fashion mnist stuff Um And I yeah started with Uh kernel size five convolution And then yeah a bunch of res blocks Um Um Yeah, so this is um all what we've used to seeing before um And so we can take a look in this case as it's quite often seems to be the case We accidentally end up with no random erasing. Let's just run it again Hmm Really doesn't want to do random erasing here we go So we can see it. So um, yeah, there's this very small border You can hardly see sometimes and a bit of random erasing and it's being done. Um You know all of the batch is being transformed or augmented in the same way um Which is kind of okay um It's certainly faster It can be a bit of a problem if you have like One batch that has lots and lots and lots of augmentation being done to it and it could be like Really hard to recognize and that could Pause the loss to be a lot in that batch and if you're like been training for ages that could Kind of jump you out of the um You know the smooth part of the loss surface. Um, that's that's the one downside of this So i'm not going to say it's always a good idea to do augmentation at batch level, but it can certainly speed things up A lot if you don't have Heaps of cpu's um All right, so you can use that summary thing we created there's our model um And yeah, because we're increasing the Doubling the number of channels as we're decreasing the grid size a number of megaflops per layer is constant So that's a pretty good sign that we're using compute throughout Um, so yeah, then we can train it with atom w mix precision Um and our um augmentations so I then did the learning rate finder And trained it for 25 epochs And got Nearly 60 50 9 percent and um, yeah, this took quite a while actually to get close to 60 percent. I gotta admit um It uh, and you can see That the training sets already up to 91 so we're kind of on the verge of overfitting Okay, so then I thought all right, um, how do we do better? And I wanted to have to have a sense of like how much better could we get and I kind of tend to like to look at papers with code which is a site that shows Papers with their code and also like how good results did they get? So this is the the image classification on tiny image net Um, and at first I was like pretty disheartened to see all these like 90 plus things but as I looked at the Um papers, I realized something well the first thing is I noticed that these ticks here represent extra training data. So these are actually Pre-trained models that are only fine tuned on tiny image net So that's a total treat And then I looked more closely at this one and actually these are also using pre-trained data. So papers with code is actually incorrect um And so the first ones I could see which I could clearly kind of replicate and make sense of Was this one so the the highest one that I'm confident of is this 72 percent And so then I kind of wanted to get a sense of all right how you know How how much work is there to get from like 60 percent to 70 percent and how good is this? um So I opened up the paper And so here's tiny image net Um And they've got like basically this paper turns out to be about a new type of mix-up data augmentation This is the normal kind of mix-up and this is their special kind of mix-up And on a res net 18 Yeah, I see they're getting like 63 64 65 with various different types of mix-up Uh and kind of 64 or 65 for their special one And then if they use much bigger models than we're using um, they can get up to 66 Ish so that kind of made me think okay, this classifier is Not Not bad. Um, but there's clearly room to improve it And I can't help myself. I always have to try to Do better So this is a good opportunity to learn about a trick that is used in Real res nets Which is in a real res net. We don't just say Um How many filters or channels or activations per layer And then just go through and do a You know, let's try to conv Each time But instead Um You can also say The number of res blocks per Per kind of down sampling layer So this would say do three res blocks And you know, then down sample or down sample and then do three res blocks or something like that I'll do three res blocks the first of which or the last of which is a down sample and then two res blocks Uh with a down sample and then two res blocks are down sample. So this has got a total of one two three four five down samples But it's got it's yeah rather than having one two three four five Res blocks it's going to have three four five six seven eight nine Res blocks. So it's nearly twice as deep And so the way we do that is we just replace the places it was saying res block with res underscore blocks And that's just a sequential which goes through the number of blocks and creates a res block And you can do it a couple of ways in this case um I said if it's the last one Then make it straight to otherwise straight one. So it's going to be down sampling at the end of each set of res blocks Um, so that's the anything I changed I changed res res block to res blocks and passed in The number of blocks which is this Okay, so Um, so the number of megaflops is now seven ten ish which is more than double right so Should give should have more opportunity to learn stuff which also could be more opportunity to overfit um So again, we do our lr find and uh, yeah, so I just did 25 epochs And I didn't actually add more augmentation um Okay, and that got up to nearly 62 So that was a good improvement um And you know interestingly It's not overfitting more. It's actually if anything less which you know, there's something about its ability to actually learn this which is Slowing it down or something um, so I thought yeah, it'd be nice to train it for longer So I decided to add um More augmentation Um, and uh to do that I decided to use something called trivial augment Which is not a very well known Approach, but it deserves to be um And it comes from Frank Hutter's lab. He's he's Frank Hutter is somebody who consistently creates extremely Practical useful improvements with much less of the Lonsense that we often see from That some of the huge well funded labs And so this one's kind of a bit of a reaction to some previous approaches Such as one called auto augment one called rand augment They might have both come from google brain. I'm not quite sure where they kind of used lots of like You know many many thousands of tpu hours um to like optimize How every images, you know, or how how each set of images is is augmented And um, yeah, what these guys did is they said well, what if we don't do that? But we just randomly pick a different augmentation for each image um And that's what they did they just they just said The algorithm one is the procedure pick an augmentation pick an amount do it um I feel like they're almost kind of like Try trying to make a point about writing this algorithm here um um Yeah, and they basically find this is at least as good or often better Actually than the incredibly resource intensive ones the incredibly resource intensive ones also kind of Require a different version for every data set Um, which is why they describe this as a tuning free Um, so rather nicely and surprisingly for me, it's actually built into py torch um So if we go to py torches website and get a trivial augment wide um Yeah, you could they show you some examples of trivial augment wide um We can create our own as well Now the thing is um, I found Um that doing this at a batch level worked poorly And I think the reason is what I described earlier. I think sometimes it will pick a really challenging augmentation to see on you know, and it all Totally don't mess up the loss function. And if every single image in the batch is like that that it all shoot it off into the distant parts of the of the Weight area um Which is a good excuse for me to show how to do augmentations on a per item level um now These actually require or some of them require um Having a pa l image the python imaging library image lot of tensor um, so I had to change things around So we have to import image from pa l um And we have to change our to from x now And we're going to do the augmentations in there instead Um for the training set um So for the training set We're going to say well in fact for both so we're going to pass in something this is do you want to do augmentations? So for the training set We're going to pass aug equals true And for the validation set we won't um, so yeah, so we we so image.open is how you create a pa l image object Um, and then if we wanted augmentations then do these augmentations And then convert it into a tensor So a torch vision has a dot to tensor we can then call Uh, and then we can normalize it and actually I decided just to use torch visions normalize I mean either it's fine or this one works. Well, and then again if you want augmentation then do your random rays And if you remember our random rays was designed to kind of use um zero one Distributed Gaussian noise. So you want that to happen after normalization So that's why I do in this order. So yeah, so now we don't need to use the Batch tripham thing. We're just doing it all directly In the data set so you can see, you know, you can do data augmentation in very simple Ways without almost any framework help here. In fact, we're really not we're not doing any and nothing's coming from a framework really Um, it's just yeah, it's just this little tiff and ds we made And so now yeah, we just pass that into our data loaders get deals Um, and we don't need uh any augmentation callback Um, all right. So now We can keep improving things by Doing something called pre activation res nets So if we go back to our original res net You might recall That the way we did it We have this a conv block which consists two convolutions In a row Um, the second one has no activation and to remind you what conv is Um, is that we first of all Do a conv and then optionally we do a normalization and then optionally we do our activation function Um, so We end up and then the second of those has act equals none. So basically what this is saying is go Uh convolution norm activation convolution norm That's what self.com's is And then this is the identity path. So this does nothing at all if there's no down sampling or no change of channels And then we apply the activation function The final activation function to the whole thing So that was how the um original res block was designed which is kind of a bit of an accident because I To be honest when I wrote that I didn't bother looking at the paper. I just did whatever seemed reasonable in my head Um, but yeah, then looking into it further I looked at this uh, this this slightly later paper by the same author as of the resonant paper chiming her um and um Timing her uh, I'll drew You know this uh This version here on the left as you can see it's conv norm value conv norm add value And um, yeah, he basically pointed out Yeah, you know what? Maybe that's not great because the relu is being applied to the addition So there isn't actually a really an identity path at all So wouldn't it be nice if we could have a pure identity path? And so to do that he proposed reordering things to go norm value conv norm value conv add And so this is called a pre act or pre activation res block um So that means I had to redefine conv to do norm then act and then conv Um, so my sequential now has the activation in both places um And so yeah other than that Um, oh and then of course there's no activation happening in the res block because it's all happening in the cons Does that make sense? That makes sense. Yeah cool um So this is now the same this is exactly the same except we now need to have an activation and a batch norm after all those blocks because previously it finished with an activation norm and activation now it starts with them So we have to put these At the end it also means we can't start with a res block anymore So if we started with a res block then it would have an activation function at the start Which would throw away half of our data which would be a bad idea Um So you've got to be a bit careful with some of the details Um, but yeah, so now you can see that each Image is getting its own augmentation And so this one's been sheared looks like it's a door or something gosh It's really hard to tell what the hell it is. It's been sheared. This one's been moved Uh, it looks like this one's also been sheared Um, and you can also see they've got different amounts of random rays on them Um, so yeah, so I thought I'd try to change training that for 50 epochs and That got us to 65 percent which Um is You know as good as nearly as good as the You know normal mix up things that are getting even on a resonant fifties. This is looking really good Um So I won't spend time on this, but I'll just mention I was kind of curious like I mean one of the things I should mention also is they trained all these for 400 epochs So I was kind of curious what would happen if we trained it a bit longer I wasn't patient enough to train it for 400 epochs, but I thought I could do 200 epochs, so I just duplicated that last one Um That made it 200 epochs and That got us to 67 and a half which Yeah, is Better than any of their non-special mix-ups So I think it just goes to show you can get you know, genuinely state-of-the-art results Um, so if we use their special mix-up that would be interesting to try as well See if we can match their results there But you know, we've we've built all this from scratch We didn't do the data augmentation from scratch because it's not very interesting, but uh, yeah other than that So I think that's really cool So I know that you did some other experiments with the the pre-activation Oh, right Yeah Right when I saw that when I saw the Pre-activation success. I was quite enthusiastic about it. So I actually thought like, oh, maybe I should go back and actually use it everywhere Um But for that weirdly enough, I think it's weird like it. It was worse for fashion MNIST and worse for like less data augmentation um I mean, maybe it's not that weird, but because The idea of when her et al introduced it. They said this is to train deeper models You know, there's a there's a more pure identity path um, and so with that more pure identity path that That should kind of let the gradients flow through it more easily and so there should be a smoother Surface wait surface lost surface um So, yeah, I guess it makes sense that you don't really see the benefits on less deep models um The bit I'm surprised did you elaborate because like it seems like I should be that that sort of uh, justification should be true for Smaller models, right? Well, yeah, it does but smaller models Um Are going to have a less bumpy surface anyway. They've just got less dimensions to be bumpy on and Um, there's less more importantly. They're less deep. So there's less room for gradients to explode exponentially Um, so they're not as sensitive um, but yeah, I mean I can see why They don't necessarily help as much, but I don't have any idea why they were Worse and they were quite consistently worse Yeah Yeah, I find it quite interesting too. Yeah Yeah Yeah, it's quite curious And it's interesting that when we do these like Experiments on things that nowadays are considered pretty fundamental and foundational you kind of All the time discover things that everybody seems to have noticed or written about or there's plenty of room to That's a kind of a more experimental researcher to do experiments and then go like, oh, that's interesting and then try and figure out what's going on Yeah I think a lot of researchers go in the opposite direction And they try to start with like theoretical assumptions and then test them Well, I think about it. I feel like Maybe a lot of the more successful Folks in terms of people who build stuff that actually get used are more experimental first, maybe um, okay, so Um Shall we have a five minute break? Since we're kind of on the hour Sure All right. So let's now look at um notebook 25 super reds, uh, I've just Poppied a few things from the previous notebook some transforms and our data sets and our d norm And our tripham batch and our tripham x Let me show we're using tripham batch here We're not even using tripham batch. Let's get rid of that because that's just confusing Okay, so it looks like we're doing the per uh Let's figure this out. So what are we doing here? So we've got um We've got our two data sets All right, so the goal of this is we're going to do super resolution not Um, classification. So let's talk about what that means what we're going to do is the Independent variable will be um scaled down to a 32 by 32 pixel Um image and the dependent variable will be the original image um And so to do random crop within a padded image and random flips Both the Independent and the dependent variable needs to have had exactly the same random cropping and exactly the same flipping otherwise It can't say oh, this is how you do super res to go from the 32 by 32 to the 64 by 64 It might be like, oh, it has to be flipped around and moved around. So yes. So for this kind of um image reconstruction task You it's important to make sure that your Um augmentation is done in the same way on the independent the dependent variable So that's why We've put it into our data set Um, and so this is something people often get confused about and they don't know how to do it But it's actually pretty straightforward if we do it this way. We just put it straight in the data set Um, and it doesn't require any framework fanciness um Now then what I did do Is I then um added random erasing Just to the training set And the reason for that is I wanted to make the Super resolution task a bit more difficult Which means sometimes it doesn't just do super resolution But it also has to like replace some of the deleted pixels with proper pixels And so it gives it a little bit more to do, you know, which um Can be quite helpful. It's kind of it's a it's a it's a data augmentation technique and also something to give it like More of an opportunity to learn what the pictures really look like um Okay, so with that in case that so these are going to do the Padding random cropping and flipping The training set will also add random erasing and then we create data loaders from those Would it make sense to use the trivial augment here? The trivial augment did you say? Yeah, um Maybe Yeah, I gotta particularly see a reason not to if um, if if well only if you found that uh Overfitting was a problem and if you did do it you would do it to both independent and dependent variables So yeah, here you can see an example the independent variables some of the in this case All of them actually have some random arrays the dependent doesn't so it has to figure out how to replace that with that and you can also see That this is very blocky And this is less blocky that's because this has been gone down to 32 by 32 pixels And this one's still at the 64 by 64 So in fact once you go down that far the cat's lost its eyes entirely. So it's going to be quite challenging It's lost its lines entirely Um, so super resolution is quite a good task to try to get a model to learn what pictures look like Um, because it has to yeah figure out like how to draw an eye and how to draw cat's whiskers and things like that Were you going to say something Jono? Sorry Oh, I was just going to point out that the um data sets are also simpler because you don't have to load the labels Um, so there's no difference between the train and the validation now. It's just finding all the images Good point. Yeah, because the the label, you know is actually the dependent variable is just the the picture um And so Okay, so because um turf mds turf mds has a turf mx Which is only applied to the independent variable Um, the independent variable has applied to it this pair of resize to 30 by 32 by 32 And then interpolate And what that actually does is it ends up still with a 64 by 64 image But the the pixels in that image Are all like doubled up And so that means that it's still doing super resolution But it's not actually going from 32 by 32 to by 64 by 64 But it's just going from the 64 by 64 where all of the pixels are like two by two pixels And it's just a little bit easier because that way Um, we could certainly create a unit that Goes from 32 to 64, but if you have the input and output image the same size it can make code a little bit simpler um, I originally started doing it by Not doing this interpolate thing and then I decided I was just getting a little bit confusing and there's no reason not to do it this way frankly um Okay, so that's our task and the idea is that then If it does a good job of this, you know, you could pass 64 by 64 images into it And hopefully it might turn them into 128 by 128 images Um, particularly if you trained it on a few different resolutions, you'd expect it to get pretty good at You know resizing things to a bunch of different resolutions. You could even call it multiple times um But anyway for this I was just kind of doing it to to demonstrate Um, but we have in previous courses trained You know bigger ones for longer with larger images and they actually do One of the interesting things is they tend to not only Do super resolution, but they often make the images look better because the kind of the pixels it fills in it kind of fills in with like What that image looks like on average, which tends to kind of like average out Imperfections so often these super resolution models actually improve image quality as well funnily enough Okay, so let's consider The dumb way to do things we've seen a kind of a dumb way to do things before which is an auto encoder So go in with low expectations here because we've done an auto encoder before and it was so bad It actually inspired us to create the learner if you remember. So that was back in notebook eight um And so basically what we're going to do Is we're going to have a model Which looks a lot like previous models It starts with a res block kernel size five and then it's got a bunch of res blocks stride two um, but then we're going to have an equal number of up blocks And what an up block is going to do Is it's going to sequentially first of all it's going to do an up sampling nearest 2d, which is actually identical To this Right, so it's going to just double all the pixels And then we're going to pass that through a res block So it's basically a res block with like A stride of a half if you like, you know, it's it's it's it's undoing a stride two It's up sampling rather than down sampling um Okay, so And then we'll have an extra res block at the end to get it down to three channels, which is what we need um Okay, so we can do our learning learning rate finder on that um, and I just train it pretty briefly for five epochs So so this model is basically um Trying to take the image that we start up then kind of really squeeze it into I guess a small Representation and then try to bring that small representation back up to then the full super resolution. Yeah, exactly right tanishkin and we could have done it Without any of the stride two, you know, I guess we could have just had a whole bunch of stride one layers There's a few reasons not to do it that way though one is obviously just the computation requirements are very high because the convolution has to Scan over the image And so when you keep it at 64 by 64, that's a lot of scanning um, another is that um You're never kind of forcing it to learn higher level abstractions by recognizing how to kind of like You know use more channels on a smaller grid size to represent it um So Yeah, it's like the same reason that we in in classifiers We don't leave it at stride one the whole time, you know, you end up with something that's inefficient and generally not as good Exactly. Yep. Thanks for clarifying tanishkin um Okay, so the loss goes down and the loss function i'm using is just mse here, right? So it's how similar is each pixel to the pixel it's meant to be And so then I can call capture prets To get the predictions and the targets and the inputs or probabilities targets and inputs. I can't remember now So here's our input images So they're pretty low resolution And oh dear here's our predicted images So pretty terrible Um So why is that? well basically It's kind of like the problem we had with our earlier auto encoder. It's really difficult to go from a like a two by two or a four by four or whatever image Into a 64 by 64 image, you know We're asking it to do something that's just really challenging and so that would require a a much bigger Model trained for a much longer amount of time. I'm sure it's possible um And in fact, you know latent diffusion as we've talked about has a model that kind of does exactly that um But in our case, there's no need to make it so complicated. We can actually do something Dramatically easier um, which is um, we can um create a A unit so units were originally developed in 2015 and they were originally developed for medical imaging um, but they've been used very very widely Since um, and I was involved in medical imaging at the time they came out and certainly they quite quickly got recognized in medical imaging They took a little bit longer to get recognized elsewhere, but nowadays they're pretty universal And they are used in stable diffusion And basically um Some of the details don't matter here. This is like the original paper So let's focus on the kind of the broad idea This thing here is called that we're going to call it the down sampling path. So in this case, they started with 572 by 572 images It looks like they started with one channel images and then they You know as we've seen then they took them down to 284 by 284 by 128 and then down to 140 by 140 by 256 and then down to 68 by 68 by 512 32 by 32 by 1024. So here's this down sampling path, right? And then the up sampling path is exactly what we've seen before, right? So we Up sample and have some I mean in the original thing they didn't use res nets or res blocks Um, they just use comms. Um, so the idea is the same um But the trick is these extra things across here these arrows Which is copy and crop what we can do Is we can take so during the up sampling we've got a 512 by 512 here. Sorry a 512 channel thing here. We can up sample to a 512 channel thing Uh, we can then put it through a conf to make it into a 256 channel thing And then what we can do is we can copy across The activations from here now and they actually do things in a slightly weird way where they're down sampling They had 136 pixels by 136 and over here they have 104 by 104. So they crop out the center bit That's because of just kind of like the slightly weird way they did They basically weren't padding things Nowadays we don't have to worry about that that cropping. So what we do is we literally copy over this These activations And we then either concatenate or add and you can see in this case, they're concatenating See how there's the white bit and the blue bit so They have concatenated the two lots together. So actually I think what they did here is they went From a 52 by 52 by 512 to a 104 by 104 by 256 And I think that's what this little blue rectangle here is and then they had another Uh, copied copied out the 104 by 104 by 256 and then put the two together to get a 104 by 104 by 512 um, and so this um, these activations Half are from the up sampling and half are from the down sampling From earlier in this whole process And it might be easiest to understand why that's interesting when we get all the way back up to the top Where we've got this uh 393 by 292 by 392 thing The thing we're copying across now is just Two convolutions away from the original image So like for super resolution, for example We want it to look a lot like the original image So in this case, we're actually going to have an entire copy of almost something very much like the original image That we can include in these final convolutions And so ditto here we have, you know Something that's kind of like the somewhat down sampled version we can use here and the more down sampled version we can use here So Yeah, that's that's how the u-net works Do either of you guys have anything to add like things that you found this helpful to understand or um anything surprising I mean, I guess it's shaping the fascinating thing these days a lot of people tend to just add so you've got the You know the outputs from the down layer are the same shape the inputs for the corresponding like up block and then they just Kind of add the yeah Particularly for super resolution adding might make more sense than concatenating because you're like literally saying like oh this little two by two bit is basically the right pixel But it just have to be slightly modified on the edges Yeah, it also makes me think of like a boosting sort of thing where If you think about like the fact that a lot of information from the original and just being passed all the way across at that highest skip connection then the rest of the network can be Effectively producing an update to that Rather than having to recreate the whole image. It's just another way It's like a res net but there's a skip connections right, but the skip connections are like Jumping from the start to the end and a bit after the start to a bit before the end And I guess a resonance a bit like boosting too Hmm. Yeah Yeah, I mean I was kind of going to say same thing. So yeah, but basically I think uh Compared to like the denoising auto quota where like we saw like the results from like even worse than I guess The original image here. I guess the the worst it could be is basically the original image. So, you know I guess it's it's just like a similar sort of uh, kind of intuition behind the the the the res net And how that works. So yeah, I mean it could be worse if These comms at the end are incapable of undoing what these comms did Which is like one argument for maybe why there should also be a connection from here Over to here and maybe a few more comms after that, which is something I'm kind of interested in and Not enough people do in my opinion um Another thing to consider is that they've only got two comms down here, but at this point You have the benefit of only being a 28 by 28 You know why not do more computation at this point? You know, um, so there's a couple of things that Sometimes people will consider but maybe not enough So, let me try to remember what I did um So in my unit here So, um, we've got the down sampling path Which is a list of res blocks Now a module list is just like a sequential except it doesn't actually do anything So then in the forward we have to go through the down path and The x equals lx each time So it's basically yeah, it's sequential that doesn't actually do anything um And so the up path is exactly the same as we saw before it's a bunch of up blocks Um, and then like we saw before the final one is going to have to go to three channel um But now for our forward What we're going to do is we're going to keep track of Since we're going to be copying this over here and copying this over here. We have to save it during the down sampling path so We're going to save it in a something called layers So I actually decided to do the little trick I mentioned which is to save the very first input um So I save the very first input I then put it through the very first res block And then we go through each In the downward path There's actually no need at all for there to be an il here doesn't have to be enumerated because we don't use i Okay, so we go through the downward path so for this l for layer so for each layer in the downward path Append the activations So that again as we go through each one, we're going to be able to copy them over by saving them for later And then call the layer Okay, so how many layers have we got there's n layers that we've stored away So now we're going to go through the up sampling path and again, we're going to call call each one But before we do we're going to actually do the thing that jonno mentioned which is rather than concatenating Unless we're back at unless with this is the very first layer because the very first up sampling layer. There's nothing to copy right So unless it's the very first up sampling layer let's just add the saved activations And then call the layer And then right at the very end We'll add back the very first layer And then pass it through the very fine last Res block um All right, maybe that last one should be concatenated. I'm not sure any who this is what I did Now the next thing that I wondered about was like how to Initialize this and basically what I wanted to do is I wanted to initialize this so that when it's when it's untrained it would The output of the model would be identical to the input Because like a reasonable starting point for like what does this look like? So yeah, what does this look like? Following super resolution would be this You know, that's a reasonable starting point So, um, I just created this little zero weights thing, which zeroes out the weights and biases of a layer Right, so I created the model And then I said, okay um Let's look at the very end of the up sampling path And I will call that the Last ResNet And so let's zero out The very last convolutions Um, and also the id connection and so that means that Whatever it does for all this at the very end It's going to have um Nothing in there. This will be zero. So that means that this will be equal to layer zero Um, and then that means we also want to make sure that this doesn't change anything So then we can just zero out The weights there Um, that's probably not quite right is it? um I guess I should have actually set those to like an identity matrix Maybe I'll try to do that later But at least it's something that would be very easy for it to I have a question Jeremy. Yeah This this zero weights. I see a lot of people do a thing where they Instead like multiplied by one e minus three or one e minus four to make the weights really small but not completely zero And I don't have a good intuition whether it's like You know in some sense having everything set to zero Fires off some warnings that maybe this is going to be like perfectly balanced on some saddle point Or it's not going to have any signal to work with. Yeah, it's very small but not quite so around the weights might be better Yeah, do you have an intervention for that? I think so or not too much intuition but more empirical like or both um I don't I don't think it's an issue And I think it comes from like a lot of people's phd supervisors and stuff You know come from back in an era when they were doing like linear regression with one layer or whatever and In those cases. Yeah, if all the weights are the same Then no learning can happen because every weight update is identical But in this case all the previous weights are different. So there's They all have different gradients and there's definitely yeah, nothing to worry about Um I mean model playing it by a small number would work too. Like it's not a problem but Yeah, setting it to zeros and honestly I I have to stop myself from I mean, that's a problem, but I just I always have this natural inclination to not want to set them to zeros because of years of being told not to but There's no reason that should be a problem Um, all right, so I just would I was just like big and like that unit code is very concise and it's very very It's interesting to see The basic ideas, you know, very simple and oh, yeah to see that I guess Yeah Yeah, it's helpful. I think we just get it into a little bit of code, isn't it? Yeah Thanks Um, that's very simple code too Um, okay, so we do a lot of find and then we train And you can see but previously our loss even after five epochs was 207 And in this case our loss after one epoch Is oh wait six so it's obviously much easier And we end up at 07 3 Okay So we can take a look There's our inputs And there's our outputs so it's actually better rather than dramatically worse now. So that's good Um, yeah some of it's actually not bad at all. I would say Um This car definitely looks like I think it's like a little over smoothed, you know Uh, I think you could say so if we look at the other guys eyes kids eyes still aren't great Like in the original he's actually got proper pupils um So yeah, it's definitely not recreated the original but You know given limited compute and limited data like the basic idea is Not bad I do worry that the poor koala like it It didn't have eyes here, but like It ought to have known there should be eyes in a sense and it didn't create any And maybe it should have done a better job on the eyes. So, um My feeling is Um, and this is a pretty common way of thinking about this is that when you use means great error msc as your loss function on these kinds of models You tend to get rather blurry Results because if the model's not sure what to do it's just going to predict kind of the average you know, um So one good way to fix that is to use perceptual loss and um I think it was jono who taught us about perceptual loss wasn't it when we did the style transfer stuff Um, so perceptual loss is this idea that we could look it's kind of similar as well to the the fit idea um, we could look at the some intermediate layer of a pre-trained model and try to make sure that um Our output images have the same Features as the real Images and in this case it ought to be saying like the real image You know, if we went to kind of mid way through a resnet it should be saying like there should be an eye here You know and in this case this would not represent an eye very well So that would should give it some useful Feedback to improve how it draws an eye here um So that's the basic idea Um, so to do perceptual loss we need to classify a model So I just used the little I don't know why I used the little 25 epoch one. I guess maybe that's all I had trained when at that time um So let's use little 25 epoch model um so then Yeah, just grab a batch from a validation set and then we can just try it out by Calling the classifier model um And here i'm doing it In fp16 just keeping my memory use down I don't think this dot half would be necessary since I've got autocast. Anyway, never mind. Um Okay, this is the same code we had before for the synth sets um, so here is our Images So what we've got here I don't know I'm just looking at some of them. They're a bit weird aren't they? I mean koala's sort of fine You know, I wouldn't have picked this as a parking meter I wouldn't have picked this as a bow tie um, so yeah, so basically what this is doing here is it's um showing us the predictions So the predictions are not amazing um trolley bus that looks right um This is weird. It's called this one a neck brace and this one a basketball that looks more like a neck brace the Labrador retriever It's got right the tractor. It's got right centipedes right mushrooms, right There's probably aren't much punching bags. Okay, so, you know, you can see our classifier. It's okay, but it's not amazing I think this was one with like a 60% accuracy Um, but the important thing is it's like it's got enough features to be able to like do an okay job I have no idea what this is. So I'm pretty sure it's not a goose Um, okay, so the model um The model was a very simple just a bunch of res blocks um three four Five and then at the end we've got our pulling flatten drop out linear batch knot um so We don't need Yeah, so what we're going to do is just to keep things simple. We're just going to grab um I think the end of the three res block And so a simple way to do that is we'll just go from range four to the end of the model And delete those layers So if we do that And then look at the model again, you can now see I've got zero one Two three And that's it So this model um is going to yeah return A kind of the activations after the fourth res block So for potential losses, I think we talked about you could like pick a couple of different places like there's various ways to do it This is just the simplest. I didn't even have to use hawks or anything. We can just call c model and In fact, if we do it So just to take a look at this looks like and again, we're going to use Mixed precision here We can grab our white batch as before put it through our classifier model Um, and so now that we've done this This is now going to give us those intermediate level features Um, so the features what's the shape of them? It's batch size one or two four By the number of channels of that layer by the height and width of that layer So these are eight by eight by two fifty six features We're going to be using for the perceptual loss And so when I was doing this, I kind of wanted to like check whether things were vaguely looking reasonable So I would expect that that these features From the actual why Should be similar to if I Use our model Um, so something that I did I thought okay if we if we took that model that we trained Then we would hope that the features were at least of the same sign Um from you know from the um result of the model Then they are in the real images Um, so this is just me comparing that and it's like, oh, yeah, they are generally the same sign So this is just little checks that I was doing along the way And then I also thought I kind of look at the msc loss along the way um Yeah, so There's no need to keep all those in there It was just stuff I was kind of doing to like debug as I went What not even debug to like identify ahead of time as of any problems So now we can calculate create our loss function so Our loss function is going to be the The msc loss just like before between the input and the target just just all that's being passed in here Plus the msc loss between the Features we get out of c-model And the features we get from the actual And the features we get from the actual target image and so the features um, we can calculate For the target image now the target image We're not going to be modifying that at all. So we do that bit with no gradient Um, but we do want to be able to modify the thing that's generating our input at the model We're trying to actually optimize so we do have gradient for that So in each case, we're calling The classifier model one on the target and one on the input and so those are giving us our features Now then we add them together, but they're Not particularly similar Numerically like they're very different scales and we wouldn't want it to focus entirely on one or the other So I just ran it For epoch or to check what the losses will look like and I noticed that the feature loss was about 10 times bigger So my very hacky way was just to divide it by 10 Um, but honestly like that detail doesn't tend to matter very much in my opinion Which there's nothing wrong with doing it a rather hacky way Um There are papers which suggest more elegant ways to handle it Um, which isn't a bad idea to save you a bit of time if you're doing a lot of messing around with this Jeremy, I don't know if you know it, but the um the new VAE decoder from stability ai for the stable diffusion autoencoder They trained it some with just mean squared error and some with mean squared error combined with the perceptual loss And they had a scaling factor of you know times 0.1 So exactly there you go dividing the position. So the answer is 0.1. That's that's the officially And and raker path. He says that the correct learning rate to use is always four re neg three So we're getting all this sorted out now. That's good All right, so for my unit we're going to do the same stuff as before in terms of initializing it Do our lr find Train it for 20 epochs And obviously the loss is not comparable because this is lost now incorporates the perceptual loss as well And so this is one of the challenges with these things. It's like is it better or worse? Well, we're just trying to have to take a look and compare I guess and maybe I should copy over our previous models images So we can compare Okay, there's our inputs There's our outputs And yeah, look he's got pupils now, but you didn't used to have Koala still doesn't quite have eyeballs but like it's definitely less You know out of focusy looking Um So yeah, I'm sort of flipping that's going on. Yeah, so there's there's some of them are going to be flipped because this is copied from earlier Um, so yeah, there's flipping and cropping going on so they won't be identical Um Yeah, you can also see like the background Like was all just blurred before where else now it's got texture which If we look at the real the real has texture, you know, so Yeah, clearly the perceptual loss has improved matters Quite significantly There's an interesting thing here, which is that there's not really any metric we can use now right because if we did mean squared error The one that's trained again means could error would probably do better, but visually it looks worse Yeah, and if we use like an fid well, that's based on the features of the pre-trained network So that would probably be biased by the one that's trained using those features the perceptual loss And so you get back to this very old school thing of like well actually how we choosing is just looking and evaluating Right. Um, and when you speak to someone like Jason antique who's made a whole career out of you know image restoration and super resolution and colorization That is like a big part of his process even now is still like Looking at a bunch of images to decide whether something is better Rather than relying on these. Yeah, some phd student yelled at me on twitter a few weeks ago for like saying like look at this cool thing Our student made look at they look better and he was like, don't you know, there's rigorous ways to measure these things This is not a rigorous approach at all. It's like phd students men they got all the answers Can't have a human looking at a picture and deciding if they like it or not. That's insane Well, i'm a phd student. I agree feel that we should be looking at it so Yeah, okay. Some phd students are better than others. That's that's fair enough What's this? Oh, right. Okay So talking of cheating Let's do that um So we're going to do something which is kind of Fast ai's favorite trick and has been since we first launched which is a gradually unfreezing pre-trained networks So in a sense, it seems a bit funny to initialize all of this down path randomly Because we already have A model that's perfectly capable of doing something useful on tiny image net images Which is this so Yeah, what if we um took our unit? right and For the model dot start which to remind you Is the res block right at the front? Why don't we use the actual? weights Of the pre-trained model And then for each of the bits in the down sampling path Why don't we use the actual weights? That we used from that as well. And so this is a useful way to understand how we can Copy over weights which is that any Part of a module an nn.module is itself an nn.module an nn.module has a state dict Which is a thing you can then call load state dict to put it somewhere else So this is going to fill in the whole res block called model dot start With the whole res block, which is p model zero So here's how we can copy across yeah That starting one and then all the down blocks are going to have the rest of it So this is basically going to copy Into our model rather than having random weights. We're going to have all the weights from our pre-trained model And then since they're They're good at doing something. They're not going to doing super resolution, but they're going to doing something Why don't we assume that they're good at doing super resolution? So turn off requires grad And so what that means if we now train It's not going to update any of the parameters in the down block I guess I should have actually done model model dot start requires grad as false to now. I think about it Um And so this is uh, the the classic fine tune approach From fast ai the library. Um, we're going to do one epoch of Just the upsampling path And that gets it to a loss of 255 now our Loss function hasn't changed. So that's totally comparable So previously our one epoch was 385 And in fact after one epoch with frozen weights for the down path. We've beaten this now This is in a sense totally cheating, but in a sense, it's totally not it's totally cheating because the thing we're trying to do is to generate For the perceptual loss intermediate layer activations, which are the same as this And so we're literally using that to create intermediate layer activations So Obviously that's going to work But why is it okay to be cheating? Well, because that's actually what we want like to be able to do super resolution. We need something that can like Recognize there's an eye here. So we already have something to know that there's an eye there and in fact Interestingly this thing trained a lot more quickly Than this thing and it turns out it's better at super resolution Than that thing even though it wasn't trained to do super resolution and I think that's because that the signal which is just like What is this is a really simple signal to use So yeah, so we do that and then we can basically go through and set requires grad equals true again And so the basic idea being here that Yeah, when you've got a bunch of Random weights, which is the whole up sampling path and a bunch of pre-trained weights the down sampling path Don't start then fine-tuning the whole thing Because at the start it's going to be crap, you know, so and so just train the random weights For at least an epoch and then set everything to unfrozen And then we'll do our 20 epochs on the whole thing And so we go from 255 to 249 207198 So it's improved a lot so to verify it with the With using these weights And comparing that to the perceptual loss the perceptual loss is looking at the Up sample they throw the super resolution images, but it's been we're incorporating the weights That's for the down sampling path. And so that's what we add. I guess the original Downgraded right although we are just adding them. So if you have zeros in the up sampling path that it's Going to be the same. So it is very easy for it to get the correct activations in the up sampling path And then yeah, I mean Then it's kind of a bit weird because it goes all the way back to the top Creates the image and then goes into the class of c-model the classifier again But I think it's going to create basically the same activations It's a bit confusing and weird. So yeah, I mean it's not totally cheating, but it's some It's certainly a much easier problem to solve Yeah Okay, so let's get our Results again. So there's our inputs Yeah, so that's looking pretty impressive. So the kid Has a you know, yeah, definitely looks pretty reasonable now Car looks pretty reasonable We still don't have eyes for the koala such as life, but definitely the background textures look way better The candy store looks less much better than it did Um medicine looks a lot better than it did. So yeah, it's really I think it looks great Okay, so then we can get better still This is not part of the original unit, but you know Making better models is often about like where can we squeeze in more computation? Give it opportunities to do things and like there's nothing particularly that says that this Down sampling thing is exactly the right thing you need here Right, it's being used for two things one is This conf and one is this conf But those are two different things and so it's kind of having to like learn to squeeze both purposes into one thing So I had this idea Probably I'm sure lots of people have had this idea, but um, whatever I had this idea Which is why don't we put some res blocks in here Which I called cross connections Or cross cons So I decided that a cross conf is going to be just a res block by a conv And so the unit I just copied and pasted but now as well as the downs I've also got crosses And so the crosses are cross cons So now Rather than just adding the layer I add the cross conf applied to the layer Yeah, I really should have added a cross con for this one as well now I think about it This is the probably the one that wants it the most Oh well never mind another time um Okay, so now yeah again, we can definitely compare Lost functions. So this is 198. So everything else was the same So I did the same thing of because you know the down sampling is the same so we can still copy in the state dict requires grad and It's better 189 quite a lot better really because you know, this is these are hard to get improvements So if we can notice anything Hey, look It's got an eye just Yeah So how about that? um At this point, it's almost quite difficult to see whether it's an improvement or not, but I think there's a bit of an eye on the koala. I think it's encouraging So that's our Super res Oh man, the bad news is we're out of time Okay We didn't promise to do diffusion unit this lesson. So We built a unit. We built a unit Yes, we did and it's and we did super resolution with it and it looks pretty good so um I gotta admit I haven't thought about like exercises so people to do what would be useful things for people to try With like maybe they could create a unit They could try and learn about segmentation credit unit for segmentation or Oh, you know, you can create there are a couple of lines where you Oh, yeah, I was just gonna say there were a couple of ways. We said, oh, I should have tried this and should have tried that I think that's obviously Yeah, basically, yeah, I think that's Obviously next step I was gonna say um style transfer is a good idea to do I think with a unit. So style transfer you can actually Um set up a loss function so that you can create a unit that learns to create images that look like van Gogh, you know, for example Um, it's a totally different approach. It's a it's a tricky one I think I think when I was playing with that it almost hoped to Not have the skip connections at the highest resolutions. Otherwise, it just really wants to copy the input and modify it slightly interesting um Maybe doing um where Would be better there too Oh, yes, that's a good point. Yeah Cool. Well, we'll put some stuff up on the website about yeah, you know ideas And I'm sure some students hopefully by the time you watch this will have some ideas on the forum of things I've tried to All right Yeah, the colorization is nice because it's um Colourization right the transform is just Too grayscale and back. Um, oh, yes, and then that's yeah, that's a really actually okay So there's all kinds of de-crapification you could do isn't there. So if you want to keep it a bit more simple yes Rather than doing These two lines of code you could um um Yeah, just turn it into black and white That's a great point um Or um, you could delete The center every time, you know to create like a something that learns how to Fill in or maybe delete the left hand side and that way that would lead that something that you can give it a photo It at all invent a little bit more to the left. Yeah, and then you could keep running it to Generator panorama Um, another one you could do would be to like um in memory or something save it as a really uh highly compressed jpeg And so then you would it would be something that would learn to remove jpeg artifacts which then for your like Old photos that you saved with crappy jpeg compression you could Bring them back to life You could probably do like yeah, you could do like I guess drawing to painting or something like this by taking Some paintings with them like passing it through some sort of edge detection and using that as your starting point Sounds interesting. Oh, uh, what about watermark removal? You could um, you know use pil or whatever to draw watermarks text whatever over the top which is quite useful for like You know radiology images and stuff sometimes have personally identifiable information written on them and you can just like Learn to delete it Yeah, okay, so lots of things people can do that's awesome. Thanks for your ideas. Basically any image to image task Super all right, um, or just make the super res better Um try it on full image net if you like If you've got lots of hard drive space Thanks, jono. Thanks tanishk See you next time You