 Welcome to the last lesson, Lesson 14. We're going to be looking at image segmentation today, amongst other things. Before we do, a bit of show and tell from last week. Elena Harley did something really interesting, which was she tried finding out what would happen if you did CycleGAN on just 300 or 400 images. And I really like these projects where people just go to Google Image Search using the API or one of the libraries out there. Some of our students have created some very good libraries for interacting with Google Images API. Download a bunch of stuff that they're interested in. In this case, some photos and some stained glass windows. And yeah, with 300 or 400 photos of that, she trained a model. She trained actually a few different models. This is what I particularly liked. And as you can see with quite a small number of images, she gets some very nice stained glass effects. So I thought that was an interesting example of how using pretty small amounts of data that was readily available, which as I would have downloaded pretty quickly. And there's more information about that on the forum if you're interested. Yeah, it's interesting to wonder about what kinds of things people will come up with with this kind of generative model. It's clearly a great artistic medium. It's clearly a great medium for forgeries and fakeries. I wonder what other kinds of things people will realize they can do with these kind of generative models. I think audio is going to be the next big area. And also very interactive type stuff. NVIDIA just released a paper showing an interactive photo repair tool where you just like, brush over an object and it replaces it with a deep learning generated replacement very nicely. Those kinds of interactive tools I think will be very interesting too. So before we talk about segmentation, we've got some stuff to finish up from last time, which is that we looked at doing style transfer by actually directly optimizing pixels. And like with most of the things in part two, it's not so much that I'm wanting you to understand style transfer per se, but the kind of idea of optimizing your input directly and using activations as part of a loss function is really the key kind of take away here. So it's interesting then to kind of see what is effectively the follow up paper, not from the same people, but the paper that kind of came next in the sequence of these kind of vision generative models with this one from Justin Johnson and folks at Stanford. And it actually does the same thing, style transfer, but it does it in a different way. Rather than optimizing the pixels, we're going to go back to something much more familiar and optimize some weights. And so specifically, we're going to train a model which learns to take a photo and translate it into a photo in the style of a particular artwork. So each comnet will learn to produce one kind of style. Now, it turns out that getting to that point, there's an intermediate point, which is I actually think kind of more useful and takes us halfway there, which is something called super resolution. So we're actually going to start with super resolution, because then we'll build on top of super resolution to finish off the style transfer, comnet based style transfer. And so super resolution is where we take a low res image. We're going to take 72 by 72 and upscale it to a larger image, 288 by 288 in our case, trying to create a higher res image that looks as real as possible. And so this is a pretty challenging thing to do, because at 72 by 72, there's not that much information about a lot of the details. And the cool thing is that we're going to do it in a way, as we tend to do with vision models, which is not tied to the input size, so you could totally then take this model and apply it to a 288 by 288 image and get something that's four times bigger on each side, so 16 times bigger than that. And often it even kind of works better at that level, because you're really introducing a lot of detail into the finer details. And you could really print out a high resolution print of something which earlier on was pretty pixelated. So this is the notebook called Enhance. And it is a lot like that kind of CSI style enhancement, where we're going to take something that appears like the information's just not there, and we kind of invent it, but the comfnet's going to learn to invent it in a way that's consistent with the information that is there, so hopefully it's kind of inventing the right information. One of the really nice things about this kind of problem is that we can create our own data set as big as we like without any labeling requirements, because we can easily create a low-res image from a high-res image just by down-sampling our images. So something I would love some of you to try during the week would be to do other types of image-to-image translation where you can invent kind of late labels, invent your dependent variable. For example, de-skewing. So either recognize things that have been rotated by 90 degrees, or better still, that have been rotated by five degrees, and straighten them. Colorization. So make a bunch of images into black and white, and learn to put the color back again. Noise reduction, maybe do a really low-quality JPEG save, and learn to put it back to how it should have been. And so forth. Or maybe taking something that's like a 16-color palette and put it back to a higher-color palette. I think these things are all interesting because they can like be used to take pictures that you may have taken back on crappy old digital cameras before they were high-resolution, or you may have scanned in some old photos that have now kind of faded, or whatever. I think it's a really useful thing to be able to do. And also, it's a good project because it's like really similar to what we're doing here, but different enough that you'll come across some interesting challenges on the way, I'm sure. So I'm going to use ImageNet again. Again, you don't need to use all of ImageNet at all. I just happen to have it lying around. You can download the 1% sample of ImageNet from FastedAR. You can use any set of pictures you have lying around, honestly. And in this case, as I said, we don't really have labels per se. So I'm just going to give everything a label of 0 just so we can use it with our existing infrastructure more easily. Now, because I'm, in this case, pointing at a folder that contains all of ImageNet, I certainly don't want to wait for all of ImageNet to finish, to run an epoch. So here I'm just, most of the time, I would set keep percent to like one or 2%. And then I just generate a bunch of random numbers. And then I just grab those, keep those, which are less than 0.02. And so that lets me quickly sub-sample my rows. All right, so we're going to use VGG16. And VGG16 is something that we haven't really looked at in this class. But it's a very simple, very simple model where we take our normal, presumably, three-channel input. And we basically run it through a number of 3 by 3 convolutions. And then from time to time, we put it through a 2 by 2 max pool. And then we do a few more 3 by 3 convolutions, max pool, so on and so forth. And then this is kind of our backbone, I guess. And then we don't do an adaptive average pooling layer. After a few of these, we end up with this 7 by 7 grid, as usual. I think it's about 7 by 7 by 512, something like that. And so rather than average pooling, we do something different, which is we flatten the whole thing. So that spits out a very long vector of activations of size 7 times 7 times 512, if memory says correctly. And then that gets fed into two fully connected layers, each one of which has 4096 activations, and then one more fully connected layer, which has however many classes. So if you think about it, the weight matrix here is huge. It's 7 by 7 by 512 by 4096. And it's because of that weight matrix, really, that VGG went out of favor pretty quickly, because it takes a lot of memory and takes a lot of computation and it's really slow. And there's a lot of redundant stuff going on here, because really those 512 activations are not that specific to which of those 7 by 7 grid cells they're in. But when you have this entire weight matrix here of every possible combination, it treats all of them uniquely. And so that can also lead to generalization problems, because there's just a lot of weights and so forth. My view is that the approach that's used in every modern network, which is here, we do an adaptive average pooling in Keras that we know as a global average pooling. Or in fast AI, we didn't really do a concat, adaptive concat pooling, which spits it straight down to a 512 long activation. I think that's throwing away too much geometry. So to me, probably the correct answer is somewhere in between and would involve some kind of factored convolution or some kind of tensor decomposition, which, yeah, maybe some of us can think about in the coming months. So for now, anyway, we've gone from one extreme, which is the adaptive average pooling, to the other extreme, which is this huge, flattened, poorly connected layer. So a couple of things which are interesting about VGG that make it still useful today. The first one is that there's more interesting layers going on here with most modern networks, including the ResNet family. The very first layer generally is a seven by seven pond or something similar, which means we, and that's drive two, which means we throw away half the grid size straight away. And so there's little opportunity to use the fine detail, because we never do any computation with it. And so that's a bit of a problem for things like segmentation or super resolution models, because the fine detail matters. We actually wanna restore it. And then the second problem is that the adaptive average pooling layer entirely throws away the geometry in the last few sections, which means that the rest of the model doesn't really have as much interest in kind of learning the geometry as it otherwise might. And so therefore for things which are dependent on position, any kind of localization-based approach to anything that requires generative modeling is gonna be less effective. So one of the things I'm hoping you're hearing as I describe this is that probably none of the existing architectures are actually ideal. We can invent a new one. And actually I just tried inventing a new one over the week, which was to take the VGG head and attach it to a ResNet backbone. And interestingly, I found I actually got a slightly better classifier than a normal ResNet, but it also was something with a little bit more useful information in it. It took, I don't know, five or 10% longer to train, but nothing worth worrying about. Yeah, I think maybe we couldn't in ResNet replace this as we've talked about briefly before, this very early convolution with something more like an Inception Stam, which does a bit more computation. I think there's definitely room for some nice little tweaks to these architectures so that we can build some models which are maybe more versatile. At the moment, people tend to build architectures that just do one thing. They don't really think, what am I throwing away in terms of opportunity? Because that's how publishing works. You publish like I've got the state of the art in this one thing, rather than I've created something that's good at lots of things. So for these reasons, we're gonna use VGG today, even though it's ancient and it's missing lots of great stuff. One thing we are gonna do though is use a slightly more modern version, which is a version of VGG where BatchNorm has been added after all the convolutions. And so in FastAI actually, when you ask for a VGG network, you always get the BatchNorm one because that's basically always what you want. So this is actually our VGG with BatchNorm. There's a 16 and a 19. The 19 is way bigger and heavier and doesn't really, it's really any better. So no one really uses it. Okay, so we're gonna go from 72 by 72 LR is low resolution input, size low resolution. We're gonna initially scale it up by times two with a batch size of 64 to get a two times 72. So one by 44 by 144 output. So that's gonna be our stage one. We'll create our own dataset for this and the dataset, it's very worthwhile looking inside the FastAI.dataset module and seeing what's there because just about anything you'd want, we probably have something that's almost what you want. So in this case, I want a dataset where my X's are images and my Y's are also images. So there's already a files dataset we can inherit from where the X's are images and then I just inherit from that and I just copied and pasted the get X and turned that into get Y. So it just opens an image. So now I've got something where the X is an image and the Y is an image and in both cases what we're passing in is an array of file names. I'm gonna do some data augmentation. Obviously with all of image net, we don't really need it but this is mainly here for anybody who's using smaller datasets to make the most of it. Random dihedral is referring to every possible 90 degree rotation plus optional left, right flippings or the dihedral group of eight symmetries. Normally we don't use this transformation for image net pictures because like you don't normally flip dogs upside down but in this case we're not trying to classify whether it's a dog or a cat, we're just trying to keep the general structure of it. So actually every possible flip is a reasonably sensible thing to do for this problem. So create a validation set in the usual way and you can see I'm kind of like using a few more slightly lower level functions. Generally speaking, I just copy and paste them out of the fast.ai source code to find the bits I want. So here's the bit which takes an array of validation set indexes and one or more arrays of variables and simply splits. So in this case, this into a training and a validation set and this into a training and a validation, sorry, yeah, in a training and a validation set you give us our x's and our y's. Now in this case, the x and the y are the same. Our import image and our output image are the same. We're gonna use transformations to make one of them lower resolution. So that's why these are the same thing. Okay, so the next thing that we need to do is to create our transformations as per usual and we're gonna use this transform y parameter like we did for bounding boxes but rather than use transform type dot coordinate we're gonna use transform type dot pixel and so that tells our transformations framework that your y values are images with normal pixels in them and so anything you do the x you also need to do to the y, do the same thing, okay? And you need to make sure any data augmentation transforms you use have the same parameter as well, okay? So you can see the possible transform types who basically you've got classification which we're about to use the segmentation in the second half of today coordinates, no transformation at all or pixel. All right, so once we've got a data set class and some x and y training and validation sets there's a handy little method called get data sets which basically runs that constructor over all the different things that you have to return all the data sets you need in exactly the right format to pass to a model data constructor as a constructor in this case the image data constructor. So we're kind of like going back under the covers of fast AI a little bit and building it up from scratch. And in the next few weeks this will all be wrapped up and refactored into something that you can do in a single step in fast AI but the point of this class is to learn a bit about going under the covers. So something we've briefly seen before is that when we take images in we transform them not just with data augmentation but we also move the channels dimension up to the start. We subtract the mean divide by the standard deviation whatever. So if we wanna be able to display those pictures that have come out of our data sets or data loaders we need to denormalize them. And so the model data objects data set has a denorm function that knows how to do that. So I'm just gonna give that a short name for convenience. So now I'm gonna create a function that can show an image from a data set. And if you pass in something saying this is a normalized image then we'll denorm it. So we can go ahead and have a look at that. You'll see here we've passed in size low res as our size for the transforms and size high res as this is something new the size Y parameter. So the two bits are gonna get different sizes. And so here you can see the two different resolutions of our X and our Y for a whole bunch of fish. Okay, as per usual plot dot sub plots to create our two plots and then we can just use the different axes that came back to put stuff X to each other. So we can then have a look at a few different versions of the data transformation and you can see them being flipped in all different directions. Okay, so let's create our model. So we're gonna have an image coming in, small image coming in and we wanna have a big image coming out. And so we need to do some computation between those two to calculate what the big image would look like. And so essentially there's kind of two ways of doing that computation. We could first of all do some upsampling and then do a few stride one kind of layers to do lots of computation or we could first do lots of stride one layers to do all the computation and then at the end do some upsampling. We're gonna pick the second approach because we want to do lots of computation on something smaller because it's much faster to do it that way. And also like all that computation we get to kind of leverage during the upsampling process. So upsampling, we know a couple of possible ways to do that. We can use transposed or fractionally strided convolutions or we can use nearest neighbor upsampling followed by a one by one conv. And then in the kind of do lots of computation section we could just have a whole bunch of three by three cons. But in this case particular it seems likely that ResNet blocks are gonna be better because really the output and the input are very, very similar. So we really want a kind of a flow through path that allows as little fussing around as possible except kind of the minimal amount necessary to do our super resolution. And so if we use ResNet blocks then they have an identity path already. So like you can imagine the most simple version where it does like a bilinear sampling kind of approach or something it could basically just go through identity blocks all the way through. And then in the upsampling blocks just learn to take the averages of the inputs and get something that's like not too terrible. So that's what we're gonna do. We're gonna create something with five ResNet blocks and then for each two X scale up we have to do we'll have one upsampling block. So they're all gonna consist of obviously as per usual convolution layers possibly with activation functions after many of them. So I kind of like to put my standard convolution block into a function so I can refactor it more easily. As per usual I just won't worry about passing in padding and just calculate it directly as kernel size over two. So one interesting thing about our little home block here is that there's no batch norm which is pretty unusual for ResNet type models. And the reason there's no batch norm is because I'm stealing ideas from this fantastic recent paper which actually won a recent competition in super resolution performance. And to see how good this paper is here's kind of a previous state of the art there's SR ResNet, right? And what they've done here is they've zoomed way in to a an upsampled kind of net or fence, right? This is the original. And you can see in the kind of previous best approach there's a whole lot of distortion and blurring going on. Whereas in their approach it's nearly perfect. So like it was a really big step up this paper. They call their model EDSR enhanced deep residual networks and they did two things differently to the kind of previous standard approaches. One was to take the ResNet blocks. This is a regular ResNet block and throw away the batch norm. So why would they throw away the batch norm? Well, the reason they would throw away the batch norm is because batch norm changes stuff and we want a nice straight through path that doesn't change stuff, okay? So the idea basically here is like, hey, if you don't want to fit all with the input more than you have to, then don't force it to have to calculate things like batch norm parameters. So throw away the batch norm. And the second trick we'll see shortly. All right, so here's a con with no batch norm. And so then we're going to create a residual block containing as per usual two convolutions. And as you see in their approach, they don't even have a value after their second conv, okay? So that's why I've only got activation on the first one. So a couple of interesting things here. One is that this idea of like having some kind of main ResNet path like conv value conv and then turning that into a value block by adding it back to the identity. It's something we do so often. I kind of factored it out into a tiny little module called Res Sequential, which simply takes a bunch of layers that you want to put into your residual path, turns that into a sequential model, runs it, and then adds it back to the input, right? So with this little module, we can now turn anything like conv activation conv into a ResNet block just by wrapping it in Res Sequential, okay? But that's not quite all I'm doing because like normally a Res block just has that in its forward, but I've also got that. What's ResScale? ResScale is the number 0.1. Why is it there? I'm not sure anybody quite knows, but the short answer is that the guy who invented batch norm also somewhat more recently did a paper in which he showed, I think the first time, the ability to train ImageNet in under an hour. And the way he did it was fire up lots and lots of machines and have them work in parallel to create really large batch sizes. Now generally when you increase the batch size by order n, you also increase the learning rate by order n to go with it. So generally a very large batch size training means very high learning rate training as well. And he found that with these very large batch sizes of like 8,000 plus, or even up to 32,000, that at the start of training, his activations would basically go straight to infinity. And a lot of other people have found that. We actually found that when we were competing in Dawn Bench, both on the SciFar and the ImageNet competitions we really struggled to make the most of even the eight GPUs that we were trying to take advantage of because of these kind of challenges with these larger batch sizes and taking advantage of them. So something that Christian found this researcher was that in the ResNet blocks, if he multiplied them by some number smaller than one, something like 0.1 or 0.2, it really helped stabilize training at the start. And that's kind of weird because mathematically it's kind of identical, right? Because obviously whatever I'm multiplying it by here, I could just scale the weights by the opposite amount here and have the same number, right? So, but it's kind of like, we're not dealing with abstract math, we're dealing with like real optimization problems and different initializations and learning rates and whatever else. And so, the problem of kind of weights disappearing off into infinity, I guess generally is really about the kind of the discrete and finite nature of computers in practice partly. And so often, yeah, these kind of little tricks can make the difference, right? So in this case, we're just kind of toning things down basically based on our initialization. And so there are probably other ways to do this. For example, one approach from some folks at NVIDIA called LARS, L-A-R-S, which I briefly mentioned last week, is an approach which uses discriminative learning rates calculated in real time, basically looking at the ratio between the gradients and the activations to scale learning rates by layer. And so they found that they didn't need this trick to scale up the batch sizes a lot. Maybe a different initialization would be all that's necessary. The reason I mention this is not so much because I think a lot of you are likely to want to train on massive clusters of computers, but rather that I think a lot of you want to train models quickly and that means using high learning rates and ideally getting superconvergence. And I think these kinds of tricks, the tricks that we'll need to be able to get superconvergence across more different architectures and so forth. And other than Leslie Smith, no one else is really working on superconvergence other than some past AI students nowadays. So these kinds of things about how do we train at very, very high learning rates, we're gonna have to be the ones who figure it out because as far as I can tell, nobody else cares yet. So I think looking at the literature around training ImageNet in one hour or more recently there's now a train ImageNet in 15 minutes, these papers actually have some of the tricks to allow us to train things at high learning rates. And so here's one of them. And so interestingly, other than the train ImageNet in one hour paper, the only other place I've seen this mentioned was in this EDSR paper. And it's really cool because like, I don't know, people who win competitions, I just find them to be very pragmatic and well read. Like they actually have to get things to work. And so this paper describes an approach which actually works better than anybody else's approach and they did these pragmatic things like throw away batch norm and use this little scaling factor which almost nobody else seems to know about and stuff like that. Okay, so that's where the point one comes from. So basically our super resolution ResNet is done and do a convolution to go from our three channels to 64 channels just to richen up the space a little bit. Then, oh, sorry, I've got actually eight, not five. Eight lots of these res blocks. We're just gonna keep, remember, every one of these res blocks is stride one so the grid size doesn't change. The number of filters doesn't change. So 64 all the way through. We'll do one more convolution and then we'll do our up sampling by however much scale we asked for. And then something I've added which is, little idea is just one batch norm here because it kind of felt like it might be helpful just to scale the last layer and then finally a column to go back to the three channels we want. So you can see that's basically, here's lots and lots of computation and then a little bit of up sampling just like we kind of described. So the only other piece here then is, and also just to mention, as you can see as I'm tending to do now this whole thing is done by creating just a list of layers and then at the end turning that into a sequential model and so my forward function is as simple as can be. So here's our up sampling and up sampling is a bit interesting because it is not doing either of these two things. So let's talk a bit about up sampling. Here's a picture from the paper from not from the competition winning paper from this original paper. And so they're saying, hey, our approach is so much better but look at their approach. It's got, goddamn artifacts in it, right? They just pop up everywhere, don't they? And so one of the reasons for this is that they use transposed convolutions and we all know don't use transposed convolutions. So here are transposed convolutions. This is from this fantastic convolutional arithmetic paper that was shown also in the Theano docs. If we're going from the blue is the original image. So a three by three image up to a five by five image, right? Or it'd be six by six if we added a layer of padding. Then all a transposed convolution does is it uses a regular three by three conv but it sticks white, you know, zero pixels between every pair of pixels, right? So that makes the input image bigger and when we run this convolution over it, it therefore gives us a larger output, right? But I mean, that's obviously stupid because when we get here, for example, of the nine pixels coming in, eight of them is zero. So like we're just wasting a whole lot of computation. And then on the other hand, if we're slightly off over here, then four of our nine are non-zero. But yet we only have one filter, like one kernel to use. So it can't like change depending on how many zeros are coming in. So it has to kind of be suitable for both and it's just not possible, right? So we end up with these artifacts. So one approach we've learned to make it a bit better is to not put white things here, but instead to copy this pixels value to each of these three locations, right? So that's just a nearest neighbor up sampling. That's certainly a bit better, right? But it's still pretty crappy because now still when we get to these nine here, four of them are exactly the same number, right? And when we move across one, then now we've got a different situation entirely, right? And so depending on where we are, so in particular, if we're here, there's gonna be a lot less repetition. So again, we have this problem where there's like wasted computation and too much structure in the data and it's gonna lead to artifacts again. So up sampling is better than transposed convolutions. Better to copy them rather than replace them with zeros, but it's still not quite good enough. So instead, we're gonna do the pixel shuffle. So the pixel shuffle is an operation in this sub-pixel convolutional neural network. And it's a little bit mind-bending, but it's kind of fascinating. And so we start with our input. We go through some convolutions to create some feature maps for a while until eventually we get to this layer i-1, which has n i-1 feature maps. We're gonna do another three by three conv. And our goal here is to go from a seven by seven grid cell. We're gonna go a three by three upscaling. So we're gonna go up to a 21 by 21 grid cell. So what's another way we could do that? To make it simpler, let's just pick one face, just one filter. So we just take the topmost filter and just do a convolution over that just to see what happens. And what we're gonna do is we're gonna use a convolution where the kernel size is the number of filters is nine times bigger than we strictly speaking need. So if we needed 64 filters, we're actually gonna do 64 times nine filters. Why is that? And so here r is the scale factor, so three. So r squared, three squared is nine. So here are the nine filters to cover one of these input layers, one of these input slices. Well, what we can do is we started with seven by seven and we turned it into seven by seven by nine. Well, the output that we want is equal to seven times three by seven times three. So in other words, there's an equal number of pixels here or activations here as there are activations here. So we can literally reshuffle these seven by seven by nine activations to create this seven by three by seven by three map. And so what we're gonna do is we're gonna take one little kind of tube here or the top left hand of each grid and we're gonna put the purple one up in the top left and then the blue one one to the right and then the light blue one one to the right of that and then the slightly darker blue one and the middle of the far left, the green one in the middle and so forth. So each of these nine cells in the top left are gonna end up in this little three by three section of our grid. And then we're gonna take two comma one and take all of those nine and move them to these three by three part of the grid and so on and so forth, right? And so we're gonna end up having every one of these seven by seven by nine activations inside this seven by three by seven by three image. So the first thing to realize is yes, of course this works under some definition of works because we have a learnable convolution here and it's gonna get some gradients which is gonna do the best job it can of filling in the correct activation such that this output is the thing we want, right? So the first step is to realize there's nothing particularly magical here. We can create any architecture we like, we can move things around anyhow we want to and our weights in the convolution will do their best to do all we asked. The real question is, is it a good idea? Is this an easier thing for it to do and a more flexible thing for it to do than the transposed convolution or the up-sampling followed by one by one con? And the short answer is, yes it is, right? And the reason it's better in short is that the convolution here is happening in the low resolution seven by seven space which is quite efficient. Whereas if we first of all up-sampled and then did our con, then our con would be happening in the 21 by 21 space which is a lot of computation, right? And furthermore, as we discussed, there's a lot of replication and redundancy in the nearest neighbor up-sampled version. So they actually show in this paper, they actually, in fact I think they have a follow-up technical note where they kind of provide some more mathematical details as to exactly what work is being done and show that the work really is more efficient this way. So that's what we're gonna do, right? So for our up-sampling, we're gonna have two steps. The first will be a three by three con with r squared times more channels than we originally wanted. And then a pixel shuffle operation which moves everything in each grid cell into the little r by r grids that are located throughout here, okay? So here it is. It's one line of code, right? And so here's the con from number of in to number of filters out times four because we're doing a scale two up-sample, right? So two squared is four. So that's our convolution and then here is our pixel shuffle. It's built into PyTorch. Pixel shuffle is the thing that moves each thing into its right spot. So that will increase, will up-sample by a scale factor of two. And so we need to do that log base two scale times. So if scale is four, then we have to do it two times to go two times two bigger, okay? So that's what this up-sample here does. Great. Guess what? That does not get rid of the checkerboard patterns. We still have checkerboard patterns. So I'm sure in great fury and frustration, this same team from Twitter, I think this was back when they used to be at a startup called Magic Pony that Twitter bought, came back again with another paper saying, okay, this time we've got rid of the checkerboard. Okay, so why did we still have, as you can see here, we still have a checkerboard, right? And so the reason we still have a checkerboard, even after doing this, is that when we randomly initialize this convolutional kernel at the start, that means that each of these nine pixels in this little three-by-three grid over here are gonna be totally randomly different. But then the next set of three pixels will be randomly different to each other, but will be very similar to their corresponding pixel in the previous three-by-three section. So we're gonna have repeating three-by-three things all the way across. And so then as we try to learn something better, it's starting from this like repeating three-by-three starting point, which is not what we want, right? What we actually would want is for these three-by-three pixels to be the same to start with. So to make these three-by-three pixels the same, we would need to make these nine channels the same here, right, for each filter. And so the solution, and this paper is very simple, it's that when we initialize this convolution at the start, when we randomly initialize it, we don't totally randomly initialize it. We randomly initialize one of the R squared sets of channels, and then we copy that to the other R squared. So they're all the same. And that way initially each of these three-by-threes will be the same. And so that is called ICNR. And that's what we're gonna use in a moment. So before we do, let's take a quick look. So we've got this super resolution ResNet, which just does lots of computation, with lots of ResNet blocks, and then it does some upsampling and gets our final three channels out. And then to make life faster, we're gonna run this in parallel. One reason we wanna run it in parallel is because Dorado told us that he has six GPUs, and this is what his computer looks like right now. And so I'm sure anybody who has more than one GPU has had this experience before. So how do we get these men working together? All you need to do is to take your PyTorch module and wrap it with nn.dataparallel. And once you've done that, it copies it to each of your GPUs and will automatically run it in parallel. It scales pretty well to two GPUs, okay to three GPUs, better than nothing to four GPUs, and beyond that, performance starts to go backwards. By default, it'll copy it to all of your GPUs. You can add an array of GPUs, otherwise if you want to avoid getting in trouble, for example, I have to share our box with Unet, and if I didn't put this here, then she would be yelling at me right now, or maybe, you know, boycotting my class. So this is how you avoid getting into trouble with Unet. So one thing to be aware of here is that once you do this, it actually modifies your module. So if you now print out your module, let's say, previously it was just an nn.sequential, now you'll find it's an nn.sequential embedded inside a module called module, right? And so in other words, if you save something which you had nn.dataparalleled, and then try to load it back into something that you hadn't, nn.dataparalleled, it'll say it doesn't match up, because one of them is embedded inside this module attribute, and the other one isn't. It may also depend even on which GPU IDs you had it copied to. So, two possible solutions. One is, don't save the module m, but instead save the module attribute m.module, because that's actually the nn.dataparalleled bit, or always put it on the same GPU IDs and use dataparallel and load and save that every time. That's what I was using. This will be an easy thing for me to fix automatically in FastAI, and I'll do it pretty soon, so it'll look for that module attribute and deal with it automatically, but for now, we have to do it manually. It's probably useful to know what's going on behind the scenes anyway. All right, so we've got our module. I find it'll run like 50 or 60% faster on a 1080 Ti. If you're running on Volta, it actually parallelizes a bit better. There are much faster ways to parallelize, but this is like a super easy way. All right, so we create our learner in the usual way. We could use MSE loss here, so that's just gonna compare the pixels of the output to the pixels that we expected, and we can run our learning rate finder, and we can train it for a while, and here's our input, and here's our output, and you can see that what we've finished to do is to train a very advanced residual convolutional network that's learned to blur things. Why is that? Well, because it's what we asked for. We said to minimize MSE loss, right? And MSE loss between pixels, really the best way to do that is just average the pixels, i.e. to blur it. So that's why pixel loss is no good. So we wanna use our perceptual loss. So let's try perceptual loss, right? So with perceptual loss, we're basically gonna take our VGG network, and just like we did last week, we're gonna find the block index just before we get a max pool, okay? So here are the ends of each kind of block of the same grid size, and if we just print them out as we'd expect, every one of those is a value module. And so in this case, these last two blocks are less interesting to us. They're kind of the grid size there is small enough, you know, kind of coarse enough that it's not as useful for super resolution. So we're just gonna use the first three. And so just to save unnecessary computation, we're just gonna use those first 23 layers of VGG. We'll throw away the rest. We'll stick it on the GPU. We're not gonna be training this VGG model at all. We're just using it to compare activations. So we'll stick it in eval mode, and we will set it to not trainable, okay? Just like last week, we'll use a save features class to do a forward hook, which saves the output activations in each of those layers. And so now we've got everything we need to create our perceptual loss, or as I call it here, feature loss class, right? And so we're gonna pass in a list of layer IDs, you know, the layers where we want the content loss to be calculated and array of weights, or a list of weights for each of those layers. So we can just go through each of those layer IDs and create an object which is gonna store, which has got the hook function, forward hook function to store the activations. And so in our forward, then we can just go ahead and call the forward pass of our model with the target. So the target is the high-res image we're trying to create, okay? And so the reason we do that is because that's gonna then call that hook function and store in self.save features, the activations we want, right? Now we're gonna need to do that for our conf net output as well, right? So we need to clone these, because otherwise the conf net output is gonna go ahead and just plobber what we already had. Okay, so now we can do the same thing for the conf net output, which is the input to the loss function, okay? And so now we've got those two things. We can zip them all together along with the weights. So we've got inputs, targets, and weights, and then we can do the L1 loss between the inputs and the targets and multiply by the layer weights. The only other thing I do is I also grab the pixel loss, right? But I weight it down quite a bit, right? And most people don't do this, I haven't seen papers that do this, but in my opinion, it's maybe a little bit better because you've got the perceptual content loss activation stuff, but at the really finest level, it also cares about the individual pixels. Okay, so that's our loss function. We create our super resolution res net, telling it how much to scale up by. And then we're going to do our ICNR initialization of that pixel shuffle convolution, right? So there's really like, this is very, very boring code. I actually stole it from somebody else. Like literally all it does is just say, okay, you've got some weight tensor x that you want to initialize. So we're going to treat it as if it had shape divided by, so number of features divided by scale squared features in practice. So like, this might be two squared before because we actually want to copy, we want to just keep one set of them and then copy them four times. So we divide it by four and we create something of that size and we initialize that with a default timing normal initialization and then we just make scale squared copies of it. Okay, and the rest of it's just kind of moving axes around a little bit, right? So that's kind of return a new weight matrix where each initialized sub kernel is repeated r squared or scale squared times. So the details don't matter very much. All that matters here is that I just looked through to find what was the actual layer, the conf layer just before the pixel shuffle and store it away. And then I called ICNR on its weight matrix to get my new weight matrix and then I copied that new weight matrix back into that layer, okay? So, as you can see, I went to quite a lot of trouble in this exercise to really try to implement all the best practices, right? And I kind of tend to do things a bit one extreme or the other. I show you like a really hacky version that only slightly works or I go to the nth degree to make it work really well, right? And so this is a version where I'm claiming that this is pretty much a state of the art implementation. It's a competition winning or at least my re-implementation of a competition winning approach. And the reason I'm doing that is because I think like this is one of those rare papers where they actually get a lot of the details right. And I kind of want you to get a feel of what does it feel like to get all the details right. And remember, getting the details right is the difference between this hideous blurry mess, you know, and this really pretty exquisite result. Okay. So, so we're gonna have to do potato parallel on that again. We're gonna set our criterion to be feature loss using our VGG model, grab the first few blocks, and these are a sets of layer weights that I found worked pretty well. Do a learning rate finder, fit it for a while. And I fit all around for a little while trying to kind of get some of these details right. But here's my favorite part of the paper is what happens next. Now that we've done it for scale equals two, progressive resizing, right? So progressive resizing is the trick that let us get the best single computer result for image net training on Dawn Bench. So this idea is starting small, gradually making bigger. I only know of two papers that have used this idea. One is the progressive resizing of GANs paper, which allows training of very high resolution GANs. And the other one is the EDSR paper. And the cool thing about progressive resizing is not only are your earlier epochs, assuming you've got two by two smaller, four times faster, you can also make the batch size maybe three or four times bigger. But more importantly, they're gonna generalize better because you're feeding your model different size images during training, right? So we were able to train like half as many epochs for image net as most people. So our epochs were faster and there were fewer of them. So progressive resizing is something that, particularly if you're training from scratch, I'm not so sure if it's useful for fine tuning transfer learning, but if you're training from scratch, you probably wanna do nearly all the time. So the next step is to go all the way back to the top, right? And change to four scale 32 batch size, right? Like restart. So I saved the model before I do that, go back. And that's why there's a little bit of fussing around in here with reloading. Because what I needed to do now is I needed to load my saved model back in, but there's a slight issue, which is I now have one more upsampling layer than I used to have. To go from two by two to four by four, my little loop here is now looping through twice, not once, and therefore it's added an extra con and an extra pixel shuffle. So how am I gonna load in weights for a different network? And the answer is that I use a very handy thing in PyTorch, which is if I call, this is what, this is basically what learn.load calls behind the scenes, load state dict. If I pass this parameter strict equals false, if I pass in this parameter strict equals false, then it says, okay, if you can't fill in all of the layers, just fill in the layers you can. So after loading the model back in this way, we're gonna end up with something where it's loaded in all the layers that it can, and that one con layer that's new is gonna be randomly initialized. And so then I freeze all my layers and then unfreeze that upsampling part. And then use ICNR on my newly added extra layer. And then I can go ahead and learn again. And so then the rest is the same. So if you're trying to replicate this, don't just run this top to bottom. Realize it involves a bit of jumping around. Okay, the longer you train, the better it gets. I ended up training it for about 10 hours, but you'll still get very good results much more quickly if you're less patient. And so we can try it out and here is the result. Here is my pixelated bird. And look here, it's like totally random pixels. And here's the upsampled version. It's like, it's literally invented coloration. But it figured out what kind of bird it is, right? And it knows what these feathers are meant to look like. And so it has imagined a set of feathers which are compatible with these exact pixels. Which is like genius, like same here, right? There's no way you can tell what these blue dots meant to represent. But if you know that this kind of bird has an array of feathers here, you know that's what they must be, right? And then you can figure out where the feathers would have to be such that when they were pixelated, they would end up in these spots, right? So it's like literally reverse engineered, like given its knowledge of this exact species of bird, how it would have to have looked to create this output. And so this is like so amazing. It also knows from all the kind of signs around it that this area here was almost certainly blurred out. So it's actually reconstructed blurred vegetation. And if it hadn't have done all of those things, it wouldn't have got such a good loss function, right? Because in the end, it had to match the activations saying like, oh, there's a feather over here and it's kind of fluffy looking and it's in this direction and all that. All right. Well, that brings us to the end of Super Resolution. Don't forget to check out the Ask Jeremy Anything thread and we will do some Ask Jeremy Anything after the break. Let's see you back here at quarter to eight. Okay. So we are going to do Ask Jeremy Anything. Rachel will tell me the most voted up of your questions. Yes, Rachel. What are the future plans for fast AI in this course? Will there be a part three? If there is a part three, I would really love to take it. That's cool. I'm not quite sure. It's always hard to guess. I hope there'll be some kind of follow up. Last year after part two, one of the students started up a weekly book club going through the Ian Goodfellow Deep Learning Book and Ian actually came in and presented quite a few of the chapters and other people like there was somebody, an expert who presented every chapter. That was really, that was like a really cool part three. To a large extent, it'll depend on you, the community to come up with ideas and help make them happen. And yeah, and I'm definitely keen to help. I've got a bunch of ideas, but I'm nervous about saying them because I'm not sure which ones will happen and which ones won't, but the more support I have in making things happen that you want to happen from you, the more likely they are to happen. What was your experience like starting down the path of entrepreneurship? Have you always been an entrepreneur? Or did you start out at a big company and transition to a startup? Did you go from academia to startups or startups to academia? No, I was definitely not in academia. I'm totally a fake academic. I started at McKinsey and Company, which is a strategy firm when I was 18, which meant I couldn't really go to university so I didn't really turn up. And then yeah, I spent eight years in business helping really big companies on strategic questions. I always wanted to be an entrepreneur. Planned to only spend two years in McKinsey. Only thing I really regret in my life was not sticking to that plan and wasting eight years instead. So two years would have been perfect. But yeah, then I went into entrepreneurship, started two companies in Australia, and the best part about that was that I didn't get any funding. So all the money that I made was mine or the decisions were mine and my partners. I focused entirely on profit and product and customer and service. Whereas I find in San Francisco, I'm glad I came here and so the two of us came here for Kaggle, Anthony and I, and raised a ridiculous amount of money, $11 million for this really new company. That was really interesting, but it's also really distracting, trying to worry about scaling and VCs wanting to see what your business development plans are and also just not having any real need to actually make a profit. And yeah, so I had a bit of the same problem at NLIVIC where I again, raised a lot of money, like $15 million pretty quickly and yeah, a lot of distractions. So yeah, I think trying to bootstrap your own company and focus on making money by selling something at a profit and then plowing that back into the company, it worked really well, right? Because within like five years, we were making a profit from three months in and within five years we were making enough of a profit, not just to pay all of us and their own wages, but also to see my bank account growing and after 10 years sold it for a big chunk of money, not enough that a VC would be excited, but enough that I didn't have to worry about money again. So I think, yeah, bootstrapping a company is something which people in the Bay Area at least don't seem to appreciate how good an idea that is. If you were 25 years old today and still know what you know, where would you be looking to use AI? What are you working on right now or looking to work on in the next two years? You should ignore the last part of that and I won't even answer it, it doesn't matter where I'm looking, like what you should do is leverage your knowledge about your domain. So like one of the main reasons we do this is to get people who have backgrounds in whatever, recruiting, oil field surveys, journalism, activism, whatever, right? And solve your problems. It'll be really obvious to you what your problems are and it'll be really obvious to you what data you have and where to find it. And those are all the bits that for everybody else it's really hard. So people who start out with like, oh, I know deep learning, now I'll go and find something to apply it to, basically never succeed, or else people who are like, oh, I've been spending 25 years doing specialized recruiting for legal firms and I know that the key issue is this thing and I know that this piece of data totally solves it and so I'm just gonna do that now and I already know who to call to actually start selling it to, they're the ones who tend to win. So yeah, and if you've done nothing but like academic stuff, then it's more maybe about your hobbies and interests. So everybody has hobbies. The main thing I would say is, please don't focus on building tools for data scientists to use or for software engineers to use because every data scientist knows about the market of data scientists, whereas only you know about the market for analyzing oil survey well logs or understanding audiology studies or whatever it is that you do. Given what you've shown us about applying transfer learning from image recognition to NLP, there looks to be a lot of value in paying attention to all of the developments that happen across the whole machine learning field and that if you were to focus in one area, you might miss out on some great advances in other concentrations. How do you stay aware of all the advancements across the field while still having time to dig in deep to your specific domains? Yeah, that's awesome. I mean, that's kind of the message of this course. So one of the key messages of this course is like lots of good works being done in different places and people are so specialized most people don't know about it. Like if I can get state-of-the-art results in NLP within six months of starting to look at NLP, then I think that says more about NLP than it does about me, frankly. So, yeah, it's kind of like the entrepreneurship thing. It's like you pick the areas that you see that you know about and kind of transfer stuff like, oh, we could use deep learning to solve this problem or in this case, like we could use, you know, this idea of computer vision to solve that problem. So things like transfer learning, I'm sure there's like a thousand things, opportunities for you to do in other fields to do what Sebastian and I did in NLP with NLP classification. So the short answer to your question is the way to stay ahead of what's going on would be to follow my feed of Twitter favorites. And my approach is to follow lots of lots of people on Twitter and put them into the Twitter favorites for you. Like literally, every time I come across something interesting, I click favorite and there are two reasons I do it. The first is that when the next course comes along, I go through my favorites to find which things I wanna study. The second is so that, you know, you can do the same thing. And then, you know, which do you go deep into? It almost doesn't matter. Like I find every time I look at something, it turns out to be super interesting and important. So it's like pick something which is like, you feel like solving that problem would be actually useful for some reason and it doesn't seem to be very popular, which is kind of the opposite of what everybody else does. Everybody else works on the problems which everybody else is already working on because they're the ones that seem popular and I don't know. I can't quite understand this chain of thinking but it seems to be very common. Is deep learning an overkill to use on tabular data? When is it better to use deep learning instead of machine learning on tabular data? Is that a real question or did you just put that there so that I would point out that Rachel Thomas just wrote an article. So, yes, so Rachel's just written about this and Rachel and I spent a long time talking about it and the short answer is we think it's great to use deep learning on tabular data. Actually, of all the rich, complex, important and interesting things that appear in Rachel's Twitter stream covering everything from the genocide of the Rohingya through to the latest ethics violations in AI companies, the one by far that got the most attention and engagement from the community was her question about is it called tabular data or structured data. So, yeah, ask computer advice, people how to name things and you'll get plenty of interest. Yeah, and there's some really good links here to stuff from Instacart and Pinterest and other folks who have done some good work in this area. Any of you that went to the Data Institute Conference will have seen Jeremy's down his presentation about the really cool work they did at Instacart. Yes, Rachel. So, I relied heavily on lessons three and four from part one and writing this post. So, much of it may be familiar to you. Yeah, Rachel asked me during the post like how to tell whether you should use the decision tree ensemble like GBM or random forest or neural net and my answer is I still don't know. Nobody I'm aware of has done that research in any particularly meaningful way. So, there's a question to be answered there. I guess my approach has been to try to make both of those things as accessible as possible through the Fast AI library so you can try them both and see what works. That's what I do. Oh, and that was it for the top voted questions. Thank you. Okay, so just quickly to go from super resolution to style transfer is kind of, oh. Sorry, I think I missed the one on reinforcement learning. Reinforcement learning popularity has been on a gradual rise in the recent past. What's your take on reinforcement learning would Fast AI consider covering some ground and popular RL techniques in the future? I'm still not a believer in reinforcement learning. I think it's an interesting problem to solve but it's not at all clear that we have a good way of solving this problem. So, the problem it really is the delayed credit problem. So, I wanna learn to play Pong. I move up or down and three minutes later I find out whether I won the game of Pong which actions I took were actually useful. And so to me, the idea of calculating the gradients of those inputs with respect, you know, sorry, the gradients of the output with respect to those inputs, the credit is so delayed that those derivatives don't seem very interesting. And there's been, in order to kind of get this question quite regularly in every one of these four courses so far, I've always had the same thing. I'm rather pleased that finally recently there's been some results showing that actually basically random search often does better than reinforcement learning. So basically what's happened is very well funded companies with vast amounts of computational power throw all of it at reinforcement learning problems and get good results. And people then say, oh, it's because of the reinforcement learning rather than the vast amounts of compute power. Or they use extremely thoughtful and clever algorithms like a combination of convolutional neural nets and Monte Carlo tree search like they did with the AlphaGo stuff to get great results. And people incorrectly say, oh, that's because of reinforcement learning when it wasn't really reinforcement learning at all. So I'm very interested in solving these kind of more generic optimization type problems rather than just prediction problems. And that's what these delayed credit problems would look like. But I don't think we've yet got good enough best practices that I have anything I'm ready to teach and say like, I'm gonna teach you this thing because I think it's still gonna be useful next year. So we'll keep watching and yeah, see what happens. Okay, so we're gonna now turn the super resolution network basically into a style transfer network and we'll do this pretty quickly. We basically already have something. So here's my input image and I'm gonna have some loss function and I've got some neural net again. So instead of a neural net that does a whole lot of compute and then does up sampling at the end, our input this time is just as big as our output. So we're gonna do some down sampling first and then our compute and then our up sampling. Okay, so that's the first change we're gonna make is we're gonna add some down sampling. So some stride two convolution layers to the front of our network. The second is rather than just comparing Y, C and X to the same thing here. So we're gonna basically say our input image should look like itself by the end. And so specifically we're gonna compare it by chucking it through VGG and comparing it at one of the activation layers. And then its style should look like some painting which we'll do just like we did with the Gaddies approach by looking at the Grammatrix correspondence at a number of layers. So that's basically it. And so that ought to be super straightforward. It's really just combining two things we've already done. And so all this code at the start is identical except we don't have high res and low res. We just have one size 256. All this is the same. My model's the same. One thing I did here is I did not do any kind of fancy best practices for this one at all. Partly because there doesn't seem to be any, like there's been very little follow up in this approach compared to the super resolution stuff. And we'll talk about why in a moment. So you'll see this is like much more normal looking. I've got batch norm layers. I don't have the scaling factor here. I don't have a pixel shuffle. It's just using a normal upsampling followed by one by one conge, blah, blah, blah. So it's kind of it's just more normal. One thing they mentioned in the paper is they had a lot of problems with zero padding, creating artifacts. And the way they solved that was by adding 40 pixels of reflection padding at the start. So I did the same thing. And then they used zero padding in their convolutions in their res blocks. Now, if you've got zero padding in your convolution in your res blocks, then that means that your, the two parts of your res net won't add up anymore because you've lost a pixel from each side on each of your two convolutions. So my res sequential has become res sequential center and I've removed the last two pixels on each side of those good cells. So other than that, this is basically the same as what we had before. So then we can bring in our starry night picture. We can resize it. We can throw it through our transformations just to make the method a little bit easier for my brain to handle. I took my transformation image which after transformed style image which after transformations is three by 256 by 256. And I made a mini batch. My batch size is 24, 24 copies of it. That just makes it a little bit easier to do the kind of batch arithmetic without worrying about some of the broadcasting. They're not really 24 copies. I used NP dot broadcast to basically fake 24. Okay, so just like before, we create our VGG, grab the last block. This time we're gonna use all of these layers. So we keep everything up to the 43rd layer. And so now our combined loss is going to add together a content loss for the third block plus the gram loss for all of our blocks with different weights. And so the gram loss, and again, it's kind of going back to everything being as like normal as possible. I've gone back to using MSE here. Basically what happened is I had a lot of trouble getting this to train properly. So I gradually removed trick after trick and eventually just went, okay, I'm just gonna make it as bland as possible. Last week's gram matrix was wrong, by the way. It only worked for a batch size of one and we only had a batch size of one. So that was fine. I was using matrix multiply, which meant that every batch was being compared to every other batch. You actually need to use batch matrix multiply, which does a matrix multiply per batch. So that's something to be aware of there. Okay, so I've got my gram matrices. I do my MSE loss between the gram matrices. I weight them by style weights. So I create that ResNet. So I create my style, my combined loss, passing in the VGG network, passing in the block IDs, passing in the transformed starry night image. And so you'll see the very start here. I do a forward pass through my VGG model with that starry night image in order that I can save the features for it. Now notice, it's really important now that I don't do any data augmentation because I've saved the style features for a particular non-augmented version. And so if I augmented it, it might make some minor problems. But that's fine because I've got all of the image net to deal with. I don't really need to do data augmentation anyway. Okay, so I've got my loss function and I can go ahead and fit. And there's really nothing clever here at all. At the end, I have my sum layers equals false so I can see what each part looks like and see that they're reasonably balanced. And I can finally pop it out. So I mentioned that should be pretty easy. And yet it took me about four days because it just, I just found this incredibly fiddly to actually get it to work. So like when I finally got up in the morning, I said to Rachel, guess what? It trained correctly. Rachel was like, I never thought that was gonna happen. It just looked awful all the time and it was really about getting the exact right mix of content loss as a style loss and the mix of the layers of the style loss. And the worst part was it takes a really long time to train the damn CNN. And I didn't really know how long to train it before before I decided it wasn't doing well. Like should I just train it for longer or what? And I don't know, all the little details didn't seem to like slightly change it but just like it would totally fall apart all the time. So I kind of mentioned this partly to say like just remember the final answer you see here is after me driving myself crazy all week of it nearly always not working until finally at the last minute it finally does. Even for things which just seem like they couldn't possibly be difficult because they're just combining two things we already have working. The other is like to be careful about how we interpret what authors claim. Yeah, so it was so fiddly getting this style transfer to work and like after doing it it left me thinking why did I bother? Because now I've got something that takes hours to create a network that can turn any kind of photo into one specific style. It just seems very unlikely I would want that for anything like about the only reason I could think that being useful would be to like do some arty stuff on a video where I wanted to turn every frame into some style. Like it's incredibly niche thing to what to do. But when I looked at the paper the tables saying like oh we're a thousand times faster than the Gaddies approach which is like it's just such an obviously meaningless thing to say and such an incredibly kind of misleading thing to say because it ignores all the hours of training for each individual style. And I don't know I find this frustrating because like groups like this Danford group clearly know better or ought to know better but still I guess the academic community kind of encourages people to make these ridiculously grand claims and it also completely ignores this incredibly sensitive fitly training process. So this paper was just so well accepted when it came out. I remember everybody getting on Twitter and being like wow these Danford people have found this way of doing style transfer a thousand times faster. And clearly the people saying this were like all like top researchers in the field but clearly none of them actually understood it because nobody said you know I don't see why this is remotely useful and also I tried it it was incredibly fitly to get it all to work. And so it's not until like what is this now like 18 months later or something that I finally coming back to it and kind of thinking like wait a minute this is kind of stupid. And so this is the answer I think to the question of like well why haven't people done follow-ups on this to like create really amazing best practices like with a super resolution part of the paper and I think the answer is because it's done. So I think this part of the paper is clearly not done and it's been improved and improved and improved and now we have great super resolution and I think we can derive from that great noise reduction great colorization, great slant removal great interactive artifact removal, whatever else. So I think there's a lot of really cool techniques here and it's also leveraging a lot of stuff that we've been learning and getting better and better at. Okay, so then finally let's talk about segmentation. This is from the famous CANVID data set which is in the classic example of an academic segmentation data set. And basically you can see what we do is we start with a picture. They're actually video frames in this data set like here and we construct, we have some labels where they're not actually colors. Each one has an ID and the IDs are matched colors. So like red might be one, purple might be two, light pink might be three. And so all the buildings are one class or the cars are another class or the people are another class or the road is another class. And so what we're actually doing here is multi-class classification for every pixel. Okay, and so you can see sometimes that multi-class classification really is quite tricky. There's like these branches. Although sometimes the labels are really not that great. This is very coarse as you can see. So here are traffic lights and so forth. So that's what we're gonna do. We're gonna do, this is a segmentation. And so it's a lot like bounding boxes, right? But rather than just finding a box around each thing, we're actually going to label every single pixel with its class. And really that's actually a lot easier because it fits our CNN style so nicely that we basically, we can create any CNN where the output is an N by M grid containing the integers from zero to C where there are C categories. And then we can use cross entropy loss with a softmax activation and we're done, right? So like I could actually stop the class there and you can go and use exactly the approaches you've learned in like lessons one and two and you'll get a perfectly okay result, right? So the first thing to say is like this is not actually a terribly hard thing to do but we're gonna try and do it really well. And so let's start by doing it the really simple way, right? And we're going to use the Kaggle Carvana competition so you Google Kaggle Carvana to find it. You can download it with the Kaggle API as per usual. And basically there's a train folder containing a bunch of images which is the independent variable and a train masks folder that contains the dependent variable and they look like this. Here's one of the independent variable and here's one of the dependent variable, okay? So in this case, just like cats and dogs we're going simple rather than doing multi-class classification, we're gonna do binary classification but of course multi-class is just the more general version, you know, categorical cross entropy your binary cross entropy, okay? So there's no differences conceptually. So we've got, this is just zeros and ones whereas this is a regular image. So in order to do this well, it would really help to know what cars look like, right? Because really what we just want to do is figure out this as a car and this as orientation and then put white pixels where we expect the car to be based on the picture and our understanding of what cars look like. The original data set came with these CSV files as well. I don't really use them for very much other than getting a list of images from them. Each image after the car ID has a 0102, et cetera of which I've printed out all 16 of them for one car and as you can see, basically those numbers are the 16 orientations of one car. So there that is. I don't think anybody in this competition actually used this orientation information. I believe they all kept the car's images just treated them separately. These images are pretty big, like over a thousand by a thousand in size and just opening the JPEGs and resizing them is slow. So I process them all. Also, OpenCV can't handle GIF files, so I converted them. Yes, Rachel. The question, how would somebody get these masks for training initially, mechanical Turk or something? Yeah, yeah, just a lot of boring work. You know, probably some tools that help you with a bit of edge snapping and stuff so that the human can kind of do it roughly and then just fine tune the bits it gets wrong. Yeah, these kinds of labels are expensive. And so one of the things I really wanna work on is deep learning, enhanced interactive labeling tools. Cause, you know, that's clearly something that would help a lot of people. Yeah, so I've got a little section here that you can run if you want to. You probably want to, which converts the GIFs into PNGs. So just open it up with a PAL and then save it as PNG cause OpenCV doesn't have GIF support. And as per usual for this kind of stuff, I do it with a thread pool so I can take advantage of parallel processing and then also create a separate directory trained at 128 and trained masks 128 which contains the 128 by 128 resized versions of them. And this is the kind of stuff that keeps you sane if you do it early in the process. So anytime you get a new data set, you know, seriously think about creating a, you know, smaller version to make life fast. Anytime you find yourself waiting on your computer, you know, try and think of a way to create a smaller version. So yeah, after you grab it from Kaggle, you probably want to run this stuff. Go away, have lunch, come back when you're done. You'll have these smaller directories which we're going to use here, 128 by 128 pixel versions to start with. So here's a cool trick. If you use the same access object to plot an image twice and the second time you use alpha, which as you might know means transparency in the computer vision world, then you can actually plot the mask over the top of the photo. And so here's a nice way to see all the masks on top of the photos for all of the cars in one group. This is the same match files data set we've seen twice already. This is all the same code we're used to. And here's something important though. If we had something that was in the training set good at this image and then the validation had that image, that would kind of be cheating because it's the same car. So we use a contiguous set of car IDs and since each set is a set of 16, we make sure that it's evenly divisible by 16. So we make sure that our validation set contains different car IDs to our training set. This is the kind of stuff which you've got to be careful of. On Kaggle, it's not so bad. You'll know about it because you'll submit your result and you'll get a very different result on your leaderboard compared to your validation set. But in the real world, you won't know until you put it in production and send your company bankrupt and lose your job. So you might wanna think carefully about your validation set in that case. So here we're gonna use transform type.classification. It's basically the same as transform type.pixel but if you think about it, with the pixel version, if we rotate a little bit, then we probably wanna like average the pixels in between the two, but the classification, obviously, we don't, we use nearest neighbor. So the slight difference there. Also for classification, lighting doesn't kick in, normalization doesn't kick in to the dependent variable. Okay, they're already square images so we don't have to do any cropping. So here you can see different versions of the augmented, you know, they're moving around a bit and they're rotating a bit and so forth. Yeah, I get a lot of questions, kind of like during our study group and stuff about like how do I debug things and fix things that aren't working and like I never have a great answer other than like every time I fix a problem it's because of stuff like this that I do all the time. You know, I just always print out everything as I go and then the one thing that I screw up always turns out to be the one thing that I forgot to check along the way. So yeah, the more of this kind of thing you can do, the better. If you're not looking at all of your intermediate results you're gonna have troubles. Okay. So given that we want something that knows what cars look like, we probably wanna start with a pre-trained image network. So we're gonna start with ResNet 34 and so with ComvNet Builder we can grab our ResNet 34 and we can add a custom head. And so the custom head is gonna be something that upsamples a bunch of times and we're gonna do things really dumb for now which is we're just gonna do com transpose 2D batch norm value, okay? And so here's like, this is what I'm saying, but any of you could have built this without looking at any of this notebook or at least like you have the information from previous classes. There's nothing new at all, okay? And so at the very end we have a single filter, okay? And now that's gonna give us something which is batch size by one by 128 by 128, but we want something which is batch size by 128 by 128. So we have to remove that unit access. So I've got a Lambda layer here. Lambda layers are incredibly helpful, right? Because without the Lambda layer here which is simply removing that unit access by just indexing into it at zero. Without the Lambda layer I would have to have created a custom class with a custom forward method and so forth. But by creating a Lambda layer that does like the one custom bit I can now just chuck it in the sequential and so that just makes life easier. So the PyTorch people are kind of snooty about this approach. Lambda layer is actually something that's part of the fast AI library, not part of the PyTorch library. And like literally people on the PyTorch discussion board like yes we could give people this, yes it is only a single line of code but then it would like encourage them to use sequential too often. So there you go. Okay, so this is our custom head, right? So we're gonna have a Residit 34 that goes down sample and then a really simple custom head that very quickly up samples and that hopefully will do something and we're gonna use accuracy with a threshold of 0.5 to print out metrics. And so after a few epochs we've got 96% accurate. Okay, so is that good? Is 96% accurate good? And hopefully the answer to your question, that question is it depends. What's it for, right? And the answer is Kavana wanted this because they wanted to be able to take their car images and cut them out and paste them on exotic Monte Carlo backgrounds or whatever. That's Monte Carlo the place, not the simulation. So to do that you need a really good mask, right? You don't wanna like leave the rear view mirrors behind or like kind of have one wheel missing or include a little bit of background or something. That would look stupid. So you would need something very good. So only having 96% of the pixels correct doesn't sound great, right? But we won't really know until we look at it. So let's look at it. So there's the correct version that we wanna cut out. That's the 96% accurate version. Okay, so like when you look at it you realize, oh yeah, getting 96% of the pixels accurate is actually easy because like all the outside bits not car and all the inside bit is car and really the only interesting bit is the edge. Okay, so we need to do better. So let's unfreeze because all we've done so far is train the custom head, okay? And let's do more. And so after a bit more we've got 99.1%, okay? So is that good? I don't know. Let's take a look. And so actually, no, it's totally missed the rear view vision mirror here and missed a lot of it here and it's clearly got an edge wrong here and these things are totally gonna matter when we try to cut it out. So it's still not good enough. So let's try upscaling. And the nice thing is that when we upscaled to 512 by 512 make sure you decrease the batch size because you'll run out of memory. You know, here's the true ones. It's quite a lot more, this is all identical, it's quite a lot more information there for it to go on. So our accuracy increases to 99.4% and things keep getting better but we've still got quite a few little black blocky bits. So let's go to 124 by 124 down to batch size of four. So this is pretty high res now and trained a bit more, 99.6, 99.8. And so now, if we look at the masks, they're actually looking not bad. Okay, that's looking pretty good. So can we do better? And the answer is yes, we can. So we're moving from the Carvana notebook to the Carvana UNET notebook now. And the UNET network is quite magnificent, right? You see, with that previous approach, our pre-trained image net network was being squished down all the way down to seven by seven and then expanded out all the way back up to, you know, well, it's 224, greater than seven by seven. 124 is going quite a bit bigger and then expanded out again all this way, which means it has to somehow store all the information about the much bigger version in the small version, right? And actually, most of the information about the bigger version was really in the original picture anyway, right? So it doesn't seem like a great approach, the squishing and unsquishing. So the UNET idea comes from this fantastic paper where like it was literally invented in this, you know, very domain specific area of biomedical image segmentation. But in fact, basically every Kaggle winner in anything even vaguely related to segmentation has ended up using UNET. It's one of these things that like everybody in Kaggle knows is the best practice, but in more of an academic circles, like even now, this has been around for a couple of years at least, a lot of people still don't realize, but it's like this is by far the best approach. And here's the basic idea. Here's the downward path, right? Where we basically start at 572 by 532 in this case, and then kind of half the grid size, half the grid size, half the grid size, half the grid size, right? And then here's the upward path where we double the grid size, double, double, double, double. But the thing that we also do is we take, you know, at every point where we've halved the grid size, we actually copy those activations over to the upward path and concatenate them together. And so you can see here, these red blobs of max pooling operations, the green blobs are upward sampling, and then these gray bits here are copying, right? And so we copy and concat. So basically, in other words, the input image after a couple of comms is copied over to the output, concatenated together. And so now we get to use all of the information that's gone through all the down and all the up, plus also a slightly modified version of the input pixels, right? And a slightly modified version of one thing down from the input pixels because they came out through here, right? So we have like all of the richness of going all the way down and up, but also like a slightly less-caused version and a slightly less-caused version and then this really kind of simple version, they can all be combined together, right? And so that's Unet. Such a cool idea. So here we are in the Carvana Unet notebook. Oh, this is the same code as before. And at the start, I've got a simple upsample version just to kind of show you again the non-Unet version. This time I'm gonna add in something called the Dice metric. Dice is very similar, as you see, to jacquard or I over U. It's just a minor difference. It's basically intersection over union with a minor tweak. And the reason we're gonna use Dice is that's the metric that the chemical competition used. And it's kind of, it's a little bit harder to get a high dice score than a high accuracy because it's really looking at like what the overlap of the correct pixels are with your pixels. But it's pretty similar. So in the Kaggle competition, people that were doing okay, were getting about 99.6 dice and the winners were about 99.7 dice, right? So here's our standard upsample. This is all as before. And so now we can check our dice metric. And so you can see on dice metric, we're getting like 968 at 128 by 128. And so that's not great. Okay, so, oh, that's the real. So let's try Unet. And I'm calling it Unet-ish because as per usual, I'm creating my own somewhat hacky version, right? Kind of trying to keep things as similar to what you're used to as possible and doing things that I think make sense. And so there should be plenty of opportunity for you to at least make this more authentically Unet by looking at the exact kind of grid sizes and like see how here the size is going down a little bit. So they're obviously not adding any padding. And then they're doing here they've got some cropping going on. There's a few differences, right? But one of the things is because I want to take advantage of transfer learning, that means I can't quite use Unet. So here's another big opportunity is what if you create the Unet down path and then add a classifier on the end and then train that on ImageNet. And you've now got an ImageNet trained classifier which is specifically designed to be a good backbone for Unet, right? And then you should be able to now come back and get pretty close to winning this old competition. It's actually not that old. It's fairly recent competition because that pre-trained network didn't exist before. But if you think about like what YOLO v3 did it's basically that, right? They created DarkNet, they pre-trained on ImageNet and then they used it as the basis for their founding boxes. So again, it's kind of idea of pre-training things which are designed not just for classification but designed for other things just something that nobody's done yet, right? And as we've shown, but as we've shown, you can train ImageNet for 25 bucks in three hours now. And if people in the community are interested in doing this hopefully I'll have credits I can help you with as well. So if you do the work to get it set up and give me a script I can probably run it for you. So for now though we don't have that. So we're gonna use ResNet. So we're basically gonna start with this, let's see, with getBase. And so base is our base network and that was defined back up in this first section, right? So getBase is gonna be something that calls whatever this is and this is ResNet 34. So we're gonna grab our ResNet 34 and cutModel is the first thing that our comnet builder does. It basically removes everything from the adaptive pulling onwards. And so that gives us back the backbone of ResNet 34, okay? So getBase is gonna give us back our ResNet 34 backbone. Okay, and then we're gonna take that ResNet 34 backbone and turn it into a, I call it a unit 34, right? So what that's gonna do is it's going to save that ResNet that we passed in and then we're gonna use a forward hook just like before to save the results at the second, fourth, fifth, and sixth blocks which as before is the, basically before each stride two convolution. Then we're gonna create a bunch of these things we're calling unit blocks. And the unit block basically says, so these unit blocks are these things. These are unit blocks. So the unit block tells us, well we have to tell it, how many things are coming from the previous layer that we're up-sampling, how many are coming across, and then how many do we wanna come at, right? And so the amount coming across is entirely defined by whatever the base network was, right? It's like whatever the downward path was, we need that many layers. And so this is a little bit awkward and actually one of our master students here, Kerram, has actually created something called a dynamic unit that you'll find in fastai.unit.dynamicunit and it actually calculates this all for you and automatically creates the whole unit from your base model. It's got some minor quirk still that I wanna fix. By the time the video's out, it'll definitely be working and I will at least have a notebook showing how to use it and possibly add additional video. But for now, you'll just have to go through and do it yourself. You can easily see it just by, once you've got a ResNet, you can just type in its name and it'll print out all the layers and you can see how big, how many activations there are in each block. Or you could even have it printed out for you for each block automatically. Anyway, I just did this manually and so the unit block works like this. So you said, okay, I've got this many coming up from the previous layer. I've got this many coming across this X. I'm using across from the downward path. This is the amount I want coming out. Now, what I do is I then say, okay, we're gonna create a certain amount of convolutions from the upward path and a certain amount from the cross path. And so I'm gonna be concatenating them together. So let's divide the number we want out by two, right? And so we're gonna have our cross convolution take our cross path and create number out divided by two. And then the upward path is gonna be a comm transpose 2D, right? Because we wanna increase up sample. And again, here we've got the number in divided by two. And then at the end, I just concatenate those together. All right, so I've got an upward sample. I've got a cross convolution. I concatenate the two together. All right, and so that's all a unit block is. And so that's actually a pretty easy module to create. And so then in my forward path, I need to pass to the forward of the unit block, the upward path and the cross path. So the upward path is just wherever I'm up to so far, right? But then the cross path is whatever the value is, whatever the activations are that I stored on the way down, right? So as I come up, it's the last set of say features that I need first. And then as I gradually keep going up further and further and further, eventually it's the first set of features, okay? And so there are some more tricks we can do to make this a little bit better, but this is a good start, right? So if we try this, so the simple upsampling approach looked horrible, right? And had a dice of nine, six, eight. A unit with everything else identical, except we've now got these unit blocks has a dice of nine, eight, five, right? So that's like, we've kind of halved the error with everything else exactly the same. And more to the point, you can look at it. This is actually looking somewhat car-like compared to our non-unit equivalent, which is just a block, right? Because, you know, trying to do this through down and up paths, it's just asking too much, you know? Where else when we actually provide the downward path pixels at every point, it can actually start to create something car-ish. So at the end of that, we'll go dot close to again, remove those SFS features to taking up GPU memory, go to a smaller batch size, a higher size, and you can see the dice coefficients really going up. This is just, so notice here, I'm loading in the 128 by 128 version of the network, okay? So we're doing this progressive resizing trick again. So that gets us 99.3, and then unfreeze to get to 99.4. And you can see it's now looking pretty good, okay? Go down to a batch size of four, size of one or two, four. Load in what we just did with the 512, takes us to 99.5, unfreeze, takes us to, 99, we'll call that 99.6, 59.5, 99. And as you can see, that actually looks good, right? In accuracy terms, 99.82, you know, you can see this is looking like something you could just about use to cut out. I think to, you know, at this point, there's a couple of minor tweaks we can do to get up to 99.7, but really the key thing then, I think is just maybe to do a few bit of smoothing maybe, or a little bit of post-processing. You can go and have a look at the Carvana winners blogs and see some of these tricks, but as I say, the difference between where we're at 99.6 and what the winners got of 99.7, you know, it's not heaps. And so really that just the unit on its own pretty much, pretty much solves that problem. Okay, so that's it. So the last thing I wanted to mention is now to come all the way back to bounding boxes, because you might remember I said, our bounding box model was still not doing very well on small objects. So hopefully you might be able to guess where I'm gonna go with this, which is that for the bounding box model, remember how we had at different grid cells, we spat out outputs of our model, and it was those earlier ones with the small grid sizes that weren't very good. But how do we fix it? Unit it, right? Let's have an upward path with cross connections, right? And so then we're just gonna do a unit and then spit them out of that, because now those finer grid cells have all of the information of that path and that path and that path and that path to leverage. Now, of course, this is deep learning. So that means you can't write a paper saying, we just used UNet for bounding boxes. You have to invent a new word. So this is called feature pyramid networks, or FPNs, okay? And like, literally the paper, this is part of the retina net paper, which is actually a, no, it's not the retina net paper, it was used in the retina net paper. It was created in earlier paper, specifically about FPNs. And like, if memory says correctly, they did briefly cite the unit paper, but they kind of made it sound like it was this vaguely slightly connected thing that maybe some people could consider slightly useful. But it really, FPNs is units, okay? I don't have an implementation of it to show you, but it'll be a fun thing maybe for some of us to try and some of us have already, I haven't yet, but I know some of the students have been trying to get it working well on the forums. So yeah, interesting thing to try. So I think a couple of things to look at after this class, as well as the other things I mentioned would be playing around with FPNs and also maybe trying Keram's dynamic unit. They would both be interesting things to look at. All right, so you guys have all been through 14 lessons of me talking at you now, so I'm sorry about that. Thanks for putting up with me. I think you're gonna find it hard to find people who actually know as much about training neural networks and practice as you do. It'll be really easy for you to overestimate how capable all these other people are and underestimate how capable you are. So the main thing I'd say is like, please practice. Please, just because you don't have this constant thing getting you to come back here every Monday night now, it's very easy to kind of lose that momentum. So find ways to keep it, organize a study group or a book reading group or get together with some friends and work on a project or do something more than just deciding I wanna keep working on X. Like it's gonna need to involve probably, unless you're the kind of person who's super motivated and you know that whenever you decide to do something, it happens, that's not me, right? It's like, I know if something to happen, I have to like say, yes, David, in October I will absolutely teach that course. And then it's like, okay, better actually write some material. That's the only way I can get stuff to happen. So we've got a great community there on the forums. If people have ideas for ways to make it better, please tell me, you know, if you think you can help with, you know, if you wanna create some new forum or moderate it in some different way or whatever, just let me know, right? You can always PM me. And there's a lot of projects going on through GitHub as well, lots of stuff. So yeah, I hope to see you all back here at something else and thanks so much for joining me on this journey.