 Welcome to lesson seven, the last lesson of part one. This will be a pretty intense lesson. And so don't let that bother you because partly what I want to do is to kind of give you enough things to think about to keep you busy until part two. And so, in fact, some of the things we covered today, I'm not going to tell you about some of the details. I'll just point out a few things where I'll say like, okay, that we're not talking about yet, that we're not talking about that. And so then come back in part two to get the details on some of these extra extra pieces, right? So, you know, today will be a lot of material. Pretty quickly might require a few viewings to probably understand it all a few experiments and so forth. And that's kind of intentional. I'm trying to give you stuff to to keep you amused for a couple of months. I wanted to start by showing some cool work done by a couple of students, Reshma and Npatter01, who have developed an Android and an iOS app. And so check out Reshma's post on the forum about that because they have a demonstration of how to create a both Android and iOS apps that are actually on the Play Store and on the Apple App Store. So that's pretty cool. First ones I know of that are on the app stores that are using FastAI. And let me also say a huge thank you to Reshma for all of the work she does both for the FastAI community and the machine learning community generally and also the women in machine learning community in particular. She does a lot of fantastic work, including providing lots of fantastic documentation and tutorials and community organizing and so many other things. So thank you Reshma and congrats on getting this app out there. We have lots of Lesson 7 notebooks today as you see and we're going to start with the one. So the first notebook we're going to look at is Lesson 7 ResNet MNIST. And what I want to do is look at some of the stuff we started talking about last week around convolutions and convolutional neural networks and start building on top of them to create a fairly modern. Deep learning architecture, largely from scratch. When I say from scratch, I'm not going to re-implement things we already know how to implement, but kind of use the pre-existing PyTorch bits of those. So we're going to use the MNIST dataset. So urls.mnist has the whole MNIST dataset. Often we've done stuff with a subset of it. So in there there's a training folder and a testing folder. And as I read this in, I'm going to show some more details about pieces of the data blocks API so that you see how to kind of see what's going on. Normally with the data blocks API we've kind of said blah, blah, blah, blah, blah, blah and done it all in one cell. But let's do them one cell at a time. So the first thing you say is what kind of item list do you have? So in this case it's an item list of images. And then where are you getting the list of file names from? In this case by looking in a folder recursively. And that's where it's coming from. You can pass in arguments that end up going to pillow because pillow or PIL is the thing that actually opens that for us. And in this case these are black and white rather than RGB. So you have to use pillow's convert mode equals L. For more details refer to the Python imaging library documentation to see what their convert modes are. But this one is going to be grayscale, which is what MNIST is. So inside an item list is an items attribute. And the items attribute is kind of the thing that you gave it. It's the thing that it's going to use to create your items. So in this case the thing you gave it really is a list of file names. That's what it got from the folder. Okay. When you show images normally it shows them in RGB. And so in this case we want to use a binary color map. So in Fast AI you can set a default color map. For more information about CMAP and color maps refer to the map plot lib documentation. And so this will set the default color map for Fast AI. Okay. So our image item list contains 70,000 items. And it's a bunch of images that are 1 by 28 by 28. Remember that PyTorch puts channel first. So they're 1 channel 28 by 28. You might think, well, why aren't they just 28 by 28 matrices? Rather than a 1 by 28 by 28 rank 3 tensor. It's just easier that way. All the comf2d stuff and so forth works on rank 3 tensors. So you want to include that unit axis at the start. And so Fast AI will do that for you even when it's reading 1 channel images. So the dot items attribute contains the thing that's kind of red to build the image, which in this case is the file name. But if you just index into an item list directly, you'll get the actual image object. Okay. And so the actual image object has a show method. And so there's the image. So once you've got an image item list, you then split it into training versus validation. You nearly always want validation. If you don't, you can actually use the dot no split method to create a kind of empty validation set. You can't skip it entirely. You have to say how to split. And one of the options is no split. And so remember that's always the order. First, create your item list, then decide how to split. In this case, we're going to do it based on folders. In this case, the validation folder for MNIST is called testing. So in kind of fast AI parlance, we use the same kind of parlance that Kaggle does, which is the training set is what you train on. The validation set has labels and you do it for testing that your model is working. The test set doesn't have labels. And you use it for doing inference or submitting to a competition or sending it off to somebody who's held out those labels for, you know, vendor testing or whatever. So just because a folder in your data set is called testing doesn't mean it's a test set, right? This one has labels. So it's a validation set. Okay, so if you want to do inference on lots, you know, lots of things at a time rather than one thing at a time, you want to use the test equals in fast AI to say this is stuff which has no labels and I'm just using for inference. Okay, so my split data is a training set and a validation set, as you can see. So inside the training set, there's a folder for each image, for each class. So now we can take that split data and say label from folder. So first you create the item list, then you split it, then you label it. And so you can see now we have an x and a y and the y are category objects. Category object is just a class, basically. So if you index into a label list, such as ll.train as a label list, you will get back an independent variable, independent variable, x and y. So in this case the x will be an image object which I can show and the y will be a category object which I can print. That's the number eight category and there's the eight. Next thing we can do is to add transforms. In this case, we're not going to use the normal get transforms function because we're doing digit recognition. And digit recognition, like you wouldn't want to flip it left right. That would change the meaning of it. You wouldn't want to rotate it too much. That would change the meaning of it. Also because these images are so small, kind of doing zooms and stuff is going to make them so fuzzy as to be unreadable. So normally for small images of digits like this you just add a bit of random padding. So I'll use the random padding function which actually returns two transforms, the bit that does the padding and the bit that does the random crop. So you have to use star to say put both these transforms in this list. So now we can call transform. This empty array here is referring to the validation set transforms. So no transforms with a validation set. Now we've got a transformed labeled list. We can pick a batch size and choose data bunch. We can choose normalize. In this case, we're not using a pre-trained model. So there's no reason to use image net stats here. And so if you call normalize like this without passing in stats, it will grab a batch of data at random and use that to decide what normalization stats to use. That's a good idea if you're not using a pre-trained model. Okay, so we've got a data bunch. And so in that data bunch is a data set which we've seen already. But what is interesting is that the training data set now has data augmentation because we've got transforms. So plot multi is a fast AI function that will plot the result of calling some function for each of this row by column grid. So in this case, my function is just grab the first image from the training set. And because each time you grab something from the training set, it's going to load it from disk, and it's going to transform it on the fly. So people sometimes ask like how many transformed versions of the image do you create? And the answer is kind of infinite. Each time we grab one thing from the data set, we do a random transform on the fly. Okay, so potentially everyone will look a little bit different. So you can see here if we plot the result of that lots of times, we get eights in slightly different positions because we did random padding. You can always grab a batch of data then from the data bunch because remember a data bunch has data loaders and data loaders are things that you grab a batch at a time. And so you can then grab our x batch and a y batch, look at their shape, batch size by channel by row by column. All fast AI data bunches have a show batch which will show you what's in it in some sensible way. Okay, so that's a quick walkthrough of the data block API stuff to grab our data. So let's start out creating a simple CNN, a simple component. So the input is 28 by 28. So let's define, I like to define when I'm creating architectures a function which kind of does the things that I do again and again and again, I don't want to call it with the same arguments because I'll forget how to make a mistake. So in this case, all of my convolutions are going to be kernel size 3, stride 2, padding 1. So let's just create a simple function to do a conv with those parameters. So each time I have a convolution, it's skipping over one pixel. So it's doing jumping two steps each time. So that means that each time we have a convolution, it's going to halve the grid size. So I've put a comment here showing what the new grid size is after each one. So after the first convolution, we have one channel coming in because it's, remember, it's a grayscale image with one channel. And then how many channels coming out? Whatever you like. So remember, you always get to pick how many filters you create regardless of whether it's a fully connected layer, in which case it's just the width of the matrix you're multiplying by, or in this case with a 2D conv, it's just how many filters do you want? So I picked eight. And so after this, it's stride 2. So the 28 by 28 image is now a 14 by 14 feature map with eight channels. So specifically therefore it's an 8 by 14 by 14 tensor of activations. Then we'll do batch norm, then we'll do value. So the number of input filters to the next conv has to equal the number of output filters in the previous conv. And we can just keep increasing the number of channels because we're doing stride 2. It's going to keep decreasing the grid size. Notice here it goes from 7 to 4 because if you're doing a stride 2 conv over 7 it's going to be kind of math dot ceiling of 7 divided by 2. Batch norm value conv. We're now down to 2 by 2. We're really a conv. We're now down to 1 by 1. So after this we have a batch size, a picture map of, let's see, 10 by 1 by 1. Does that make sense? We've got a grid size of 1 now. So it's not a vector of length 10. It's a rank 3 tensor of 10 by 1 by 1. So our loss functions expect generally a vector, not a rank 3 tensor. So you can chuck flatten at the end and flatten just means remove any unit axes. So that will make it now just a vector of length 10, which is what we always expect. So that's how we can create a CNN. So then we can turn that into a learner by passing in the data and the model and the loss function and if optionally some metrics. So we're going to use cross entropy as usual. So we can then call learn dot summary and confirm. After that first conv we're down to 14 by 14 and after the second conv 7 by 7 and 4 by 4, 2 by 2, 1 by 1. The flatten comes out calling it a lambda but as you can see it gets rid of the 1 by 1 and it's now just a length 10 vector for each item in the batch. So 128 by 10 metrics of the whole mini batch. So just to confirm that this is working okay, we can grab that mini batch of X that we created earlier. There's our mini batch of X. Pop it onto the GPU and call the model directly. Remember any PyTorch module we can pretend as a function and that gives us back as we hoped a 128 by 10 result. Okay, so that's how you can directly get some predictions out. Hello find, fit one cycle and bank. We already have a 98.6% accurate conf net and this is trained from scratch of course. It's not pre-trained. We literally created our own architecture. It's about the simplest possible architecture you can imagine 18 seconds to train. So that's how easy it is to create a pretty accurate digit detector. So let's refactor that a little rather than saying conf batch norm value all the time. Fast AI already has something called conf underscore layer which lets you create conf batch norm value combinations and it has various other options to do other tweaks to it but the basic version is just exactly what I just showed you. So we can refactor that like so. That's exactly the same neural net. And so let's just train it a little bit longer and it's actually 99.1% accurate if we train it for all of a minute. So that's cool. So how can we improve this? Well, what we really want to do is create a deeper network and so a very easy way to create a deeper network would be after every stride two conf add a stride one conf because the stride one conf doesn't change the feature map size at all. So you can add as many as you like. But there's a problem. There's a problem. And the problem was pointed out in this paper very, very, very influential paper called Deep Residual Learning for Image Recognition by Kaiming He and colleagues then at Microsoft Research. And there's something interesting. They said, let's look at the training error. So forget generalization even. Let's just look at the training error of a network trained on sci-fi 10. And let's try one network of 20 layers just basic three by three cons. Just basically the same network I just showed you but without batch norm. Let's try 20 layer one and a 56 layer one on the training set. So the 56 layer one has a lot more parameters. It's got a lot more of these stride one cons in the middle. So the one with more parameters should seriously overfit. So you would expect the 56 layer one to zip down to zero ish training error pretty quickly. And that is not what happens. It is worse than the shallower network. So when you see something weird happen, really good researchers don't go, oh no, it's not working. They go, that's interesting. So Kaiming He said, that's interesting. What's going on? And he said, I don't know. But what I do know is this. I could take this 56 layer network and make a new version of it which is identical, but has to be at least as good as the 20 layer network. And here's how every two convolutions. I'm going to add together the input to those two convolutions added together with the result of those two convolutions. So in other words, he's saying instead of saying output equals conv2 of conv1 of x. Instead, he's saying output equals x plus conv2 of conv1 of x. So that 56 layers worth of convolutions in that, his theory was has to be at least as good as the 20 layer version because it could always just set conv2 and conv1 to a bunch of zero weights for everything except for the first 20 layers. Because the x, the input could just go straight through. So this thing here is, as you see, called an identity connection. It's the identity function. Nothing happens at all. It's also known as a skip connection. So that was a theory, right? That's what the paper describes as the intuition behind this is what would happen if we created something which has to train at least as well as a 20 layer neural network because it kind of contains that 20 layer neural network. There's literally a path. You can just skip over all the convolutions. And so what happens? And what happened was he won ImageNet that year. He easily won ImageNet that year. And in fact, even today, we had that record breaking result on ImageNet speed training ourselves in the last year. We used this too. ResNet has been revolutionary. And anytime, here's a trick, if you're interested in doing research and the whole research, anytime you find some model for anything, whether it's like medical image segmentation or some kind of GAN or whatever, and it was written a couple of years ago, they might have forgotten to put ResNet in. ResBlocks. This is what we normally call a ResBlock. They might have forgotten to put ResBlocks in. So replace their convolutional path with a bunch of ResBlocks and you'll almost always get better results faster. That's a good trick. So at NeurIPS, which Rachel and I and David all just came back from and Sylvain, we saw a new presentation where they actually figured out how to visualize the loss surface of a neural net, which is really cool. This is a fantastic paper, and anybody who's watching this, Lesson 7, is at a point where they will understand most of the most important concepts in this paper. You can read this now. You won't necessarily get all of it, but I'm sure you'll find it, get enough to find it interesting. And so the big picture was this one. Here's what happens if you draw a picture where kind of X and Y here are two projections of the weight space. And Z is the loss. And so as you move through the weight space, a 56th layer neural network without skip connections is very, very bumpy. And that's why this got nowhere because it just got stuck in all these hills and valleys. The exact same network with identity connections, with skip connections, has this loss landscape. So it's kind of interesting how her recognized back in 2015, this shouldn't happen. Here's a way that must fix it. And it took three years before people were able to say, oh, this is kind of why it fixed it. It kind of reminds me of the batch norm discussion we had a couple of weeks ago. People realizing a little bit after the fact sometimes what's going on and why it helps. So in our code we can create a res block in just the way I described. We create an nn.module. We create two conf layers. Remember a conf layer is conf2d batch norm value. Sorry, conf2d value batch norm. So create two of those and then in forward we go conf1 of X, conf2 of that, and then add X. There's a res block function already in fastai. So you can just call res block instead and you just pass in something saying how many filters you want. So yeah, so there's the res block that I defined in our notebook. And so with that res block we can now take every one of those. I've just copied the previous cnn and after every conf2, except the last one, I added a res block. So this has now got three times as many layers. So it should be able to do more compute, right? But it shouldn't be any harder to optimize. So what happens? Well, let's just refactor it one more time. Since I got a conf2 res block so many times, let's just pop that into a little mini sequential model here and so I can refact to that like so. Like keep refactoring your architectures if you're trying novel architectures because you'll make less mistakes. Very few people do this. Most research codes you look at is clunky as all hell and people often make mistakes in that way. So don't do that. Be, you know, you're all coders. So use your coding skills to make life easier. Okay, so there's my ResNet-ish architecture and LRFind as usual fit for a while. And I get 99.54. So that's interesting because we've trained this literally from scratch with an architecture we built from scratch. I didn't look up this architecture anywhere. It was just the first thing that came to mind. But in terms of where that puts us, 0.45% error is around about the state of the art for this data set as of three or four years ago. Now, you know, today MNIST is considered a kind of trivially easy data set. So I'm not saying like, wow, we've broken some records here. People have got beyond 0.45% error. But what I'm saying is that, you know, we can't, you know, this kind of ResNet is a genuinely, extremely useful network still today. And this is really all we use in our fast ImageNet training still. And one of the reasons as well is that it's so popular, so the vendors at the library spend a lot of time optimizing it. So things tend to work fast. Whereas some more modern style architectures using things like separable or grouped convolutions tend not to actually train very quickly in practice. If you look at the definition of ResBlock in the fast AI code, you'll see it looks a little bit different to this. And that's because I've created something called a merge layer. And a merge layer is something which in the forward, just skip dense for a moment, the forward says x plus x dot a ridge. So you can see there's something ResNet-ish going on here. What is x dot a ridge? Well, if you create a special kind of sequential model called a sequential EX, so this is like fast AI's sequential extended. It's just like a normal sequential model, but we store the input in x dot a ridge, right? And so this here is sequential EX, conv layer, conv layer, merge layer. That will do exactly the same as this. So you can create your own variations of ResNet blocks very easily with just sequential EX and merge layer. So there's something else here, which is when you create your merge layer, you can optionally set dense equals true. What happens if you do? Well, if you do, it doesn't go x plus x dot a ridge. It goes cat x comma x dot a ridge. In other words, rather than putting a plus in this connection, it does a concatenate. So that's pretty interesting because what happens is that you have your input coming into your ResBlock. And once you use concatenate instead of plus, it's not called a ResBlock anymore. It's called a dense block. And it's not called a ResNet anymore. It's called a dense net. So the dense net was invented about a year after the ResNet. And if you read the dense net paper, it can sound incredibly complex and different. But actually it's literally identical. But plus here is replaced with cat. So you have your input coming into your dense block, right? And you've got a few convolutions in here. And then you've got some output coming out. And then you've got your identity connection. And remember, it doesn't plus. It concats. So if this is the channel access, it gets a little bit bigger, right? And then so we do another dense block, right? And at the end of that, we have, you know, all of this coming in. Oh, sorry, we have... Okay, so at the end of that, we have, you know, the result of the convolution as per usual. But this time, the identity block is that big, right? So you can see that what happens is that with dense blocks, it's getting bigger and bigger and bigger. And kind of interestingly, the exact input is still here, right? So it actually, no matter how deep you get, the original input pixels are still there and the original layer one features are still there and the original layer two features are still there. So as you can imagine, dense nets are very memory intensive. There are ways to manage this, from time to time, you can have a regular convolution that squishes your channels back down, but they are memory intensive. But they have very few parameters. So for dealing with small data sets, you should definitely experiment with dense blocks and dense nets. They tend to work really well on small data sets. Also, because it's possible to kind of keep those original input pixels all the way down the path, they work really well for segmentation, right? Because for segmentation, you kind of want to be able to reconstruct the original resolution of your picture. So having all of those original pixels still there is super helpful. So that's Res Nets. And one of the main reasons, other than in fact that Res Nets are awesome, to tell you about them is that these skip connections are useful in other places as well. And they're particularly useful in other places and other ways of designing architectures for segmentation. So in building this lesson, I always kind of, I keep trying to take old papers and saying, like, imagining, like, what would that person have done if they had access to all the modern techniques we have now, and I try to kind of rebuild them in a more modern style. So I've been really rebuilding this next architecture we're going to look at called a unit in a more modern style recently. And I got to the point now, I keep showing you this semantic segmentation paper with the state of the art for Canvaid, which was 91.5. This week I got it up to 94.1 using the architecture I'm about to show you. So we just, we keep pushing this further and further and further. And it really was all about, you know, adding all of the modern tricks, many of which I'll show you today, some of which we'll see in part two. So what we're going to do to get there is we're going to use this unit. So we've used a unit before. I've improved it a bit since then. So we've used a unit before. We used it when we did the Canvaid segmentation, but we didn't understand what it was doing. So we're now in a position where we can understand what it was doing. And so the first thing we need to do is kind of understand the basic idea of how you can do segmentation. So if we go back to our Canvaid notebook, in our Canvaid notebook you'll remember that basically what we were doing is we were taking these photos and adding a class to every single pixel. And so when you go data.showbatch for something which is a segmentation item list, it will automatically show you these color-coded pixels. So here's the thing. Like, in order to color-code this as a pedestrian, you know, but this as a bicyclist, it needs to know what it is. It needs to actually know that's what a pedestrian looks like, and it needs to know that's exactly where the pedestrian is, and this is the arm of the pedestrian and not part of their shopping basket. It needs to really understand a lot about this picture to do this task, and it really does do this task. Like, when you looked at the results of our top model, it's, you know, I can't see a single pixel by looking at my eye. I know there's a few wrong, it's that accurate. So how does it do that? So the way that we're doing it to get these really, really good results is, not surprisingly, using pre-training. So we start with a ResNet 34, and you can see that here, Unet learner data, model, ResNet 34. And if you don't say pre-trained equals false, by default, you get pre-trained equals true because, why not? So we start with a ResNet 34, which starts with a big image. So in this case, this is from the Unet paper now. Their images, they started with one channel by 572 by 572. This is for medical imaging segmentation. So after your stride two cons, you, they're doubling the number of channels to 128, and they're halving the size. So they're now down to 280 by 280. In this original Unet paper, they didn't add any padding, so they lost a pixel on each side each time they did a conv. That's why you're losing these two. But so basically half the size, and then half the size, and then half the size, and then half the size, until they're down to 28 by 28 with 1,024 channels. So that's what the Unet's down-sampling path, this is called the down-sampling path look like. Ours is just a ResNet 34. So you can see it here, learn.summary, right? This is literally a ResNet 34. So you can see that the size keeps halving, channels keep going up, and so forth. So eventually you've got down to a point where if you use a Unet architecture, it's 28 by 28 with 1,024 channels, with the ResNet architecture, with a true 24 pixel input, it would be 512 channels by 7 by 7. So it's a pretty small grid size on this feature map. Somehow we've got to end up with something which is the same size as our original picture. So how do we do that? How do you do computation which increases the grid size? Well, we don't have a way to do that in our current bag of tricks. We can use a Stride 1 conv to do computation and keeps grid size, or a Stride 2 conv to do computation and halve the grid size. So how do we double the grid size? We do a Stride half conv, also known as a deconvolution, also known as a transposed convolution. There is a fantastic paper called a Guide to Convolution Arithmetic for Deep Learning that shows a great picture of exactly what does a 3 by 3 kernel Stride half conv look like. And it's literally this. If you have a 2 by 2 input, so the blue squares are the 2 by 2 input, you add not only two pixels of padding all around the outside, but you also add a pixel of padding between every pixel. And so now if we put this 3 by 3 kernel here and then here and then here, you see how the 3 by 3 kernel is just moving across it in the usual way, you will end up going from a 2 by 2 output to a 5 by 5 output. So if you only added one pixel of padding around the outside, you would end up with a 3 by 3 output, right? So, sorry, 4 by 4. So this is how you can increase the resolution. This was the way people did it until maybe a year or two ago. That's another trick for improving things you find online because this is actually a dumb way to do it. And it's kind of obvious it's a dumb way to do it for a couple of reasons. One is that, like, have a look at this. Nearly all of those pixels are white. They're nearly all zeros. So, like, what a waste. What a waste of time. What a waste of computation. There's just nothing going on there. Also, this one, when you get down to, like, that 3 by 3 area, two out of the nine pixels are non-white, but this one, one out of the nine are non-white. So they're kind of like, there's different amounts of information going into different parts of your convolution. So, like, it just doesn't make any sense to kind of throw away information like this and to kind of do all this unnecessary computation and have different parts of the convolution and having access to different amounts of information. So, what people generally do nowadays is something really simple, which is if you have, let's say, a 2 by 2 input with these of your pixel values, A, B, C, and D, and you want to create a 4 by 4, why not just do this? A, A, A, A, B, B, B, B, C, C, C, C, D, D, D. So I've now upscaled from 2 by 2 to 4 by 4. I haven't done any interesting computation, but now on top of that I could just do a Stride 1 convolution and now I have done some computation. So an up sample, this is called nearest neighbor interpolation. Nearest neighbor interpolation. That's super fast, which is nice. So you can do a nearest neighbor interpolation and then a Stride 1 conv, and now you've got some computation which is actually kind of using, you know, there's no zeros here. This is kind of nice because it gets a mixture of A's and B's, which is kind of what you would want and so forth. Another approach is instead of using nearest neighbor interpolation, you can use bilinear interpolation, which basically means instead of copying A to all of those different cells, you take a kind of a weighted average of the cells around it. So for example, if you were, you know, looking at what should go here, you would kind of go like, oh, it's about three A's, two C's, one D and two B's, and you could have taken the average. Not exactly, but roughly, just a weighted average. Bilinear interpolation you'll find in any, you know, all over the place it's a pretty standard technique. Any time you look at a picture on your computer screen and change its size, it's doing bilinear interpolation. So you can do that and then astride one conv. So that was what people were using. Well, that's what people still tend to use. That's as much as I'm going to teach you this part. In part two, we'll actually learn what the fast AI library is actually doing behind the scenes, which is something called a pixel shuffle, also known as sub-pixel convolutions. It's not dramatically more complex, but complex enough that I won't cover it today. There's the same basic idea. All of these things is something which is basically letting us do a convolution that ends up with something that's twice the size. And so that gives us our upsampling path, right? So that lets us go from 28 by 28 to 54 by 54 and keep on doubling the size. So that's good. And that was it until UNET came along. That's what people did. And it didn't work real well, which is not surprising because, like, in this 28 by 28 feature map, how the hell is it going to have enough information to reconstruct a 572 by 572 output space? You know, that's a really tough ask. So you tended to end up with these things that lacked fine detail. So what Olaf Roneberger and Et Al did was they said, hey, let's add a skip connection, an identity connection. And amazingly enough, this was before ResNet's existed. So this was like a really big leap, really impressive. And so, but rather than adding a skip connection that skipped every two convolutions, they added skip connections where these gray lines are. In other words, they added a skip connection from the same part of the downsampling path to the same-sized bit in the upsampling path. And they didn't add. That's why you can see the white and the blue next to each other. They didn't add. They concatenated. So basically, these are like dense blocks, right? But the skip connections are skipping over larger and larger amounts of the architecture so that over here, you've literally got, or nearly, the input pixels themselves coming into the computation of these last couple of layers. And so that's going to make it super handy for resolving the fine details in these segmentation tasks, because you've literally got all of the fine details. On the downside, you don't have very many layers of computation going on here, just four, right? So you better hope that by that stage, you've done all the computation necessary to figure out, is this a bicyclist or is this a pedestrian? But you can then add on top of that something saying, like, is this, you know, is this exact pixel where their nose finishes or is that the start of the tree? So that works out really well. And that's a unit. So this is the unit code from FastAI. And the key thing that comes in is the encoder. The encoder refers to that part. In other words, in our case, a ResNet 34. In most cases, they have this specific older-style architecture. But like I said, replace any older-style architecture bits with ResNet bits and life improves, particularly if they're pre-trained. So that certainly happened for us. So we start with our encoder. So our layers of our unit is an encoder, then batch norm, then value, and then middle conv, which is just conv-layer, comma, conv-layer. Remember, conv-layer is a conv-relu batch norm in FastAI. And so the middle conv is these two extra steps here at the bottom. Just doing a little bit of computation. It's kind of nice to add more layers of computation where you can. So encoder, batch norm, value, and then two convolutions. And then we enumerate through these indexes. What are these indexes? I haven't included the code, but these are basically, we figure out what is the layer number where each of these stride two comms occurs, and we just store it in an array of indexes. So then we can loop through that, and we can basically say for each one of those points, create a unit block, telling us how many upsampling channels there are and how many cross-connection. These things here are called cross-connections. That's what I call them. So that's really the main works going on in the unit block. As I said, there's quite a few tweaks we do, as well as the fact we use a much better encoder. We also use some tweaks in all of our upsampling, using this pixel shuffle with another tweak called ICNR. And then another tweak, which I just did in the last week, is to not just take the result of the convolutions and pass it across, but we actually grab the input pixels and make them another cross-connection. That's what this last cross is here. You can see we're literally appending a res block with the original inputs. You can see our merge layer. So really all the work is going on in unit block. And unit block is it has to store the activations at each of these down-sampling points. And the way to do that, as we learned in the last lesson, is with hooks. So we put hooks into the ResNet 34 to store the activations each time there's a Stroud 2 conv. And so you can see here we grabbed the hook. And we grabbed the result of the stored value in that hook. And we literally just go torch.cat. So we can catenate the upsampled convolution with the result of the hook, which we chucked through batch norm. And then we do two convolutions to it. And actually, you know, something you could play with at home is pretty obvious here. Any time you see two convolutions like this, there's an obvious question is what if we used a ResNet block instead? So you could try replacing those two convs with a ResNet block. You might find you get even better results. Now, there are kind of things I look for when I look at an architecture is like, oh, two convs in a row probably should be a ResNet block. Okay, so that's Unant. And, you know, it's amazing to think, you know, it preceded ResNet, it preceded DenseNet. It wasn't even published in a major machine learning venue. It was actually published in MIKI, which is a specialized medical image computing conference. For years, actually, you know, it was largely unknown outside of the medical imaging community. And actually, what happened was Kaggle competitions for segmentation kept on being easily won by people using Unets. And that was the first time I saw it getting noticed outside the medical imaging community. And then, gradually, a few people in the academic machine learning community started noticing and now everybody loves Unet, which I'm glad, because it's just awesome. So, yeah, so identity connections, regardless of whether they're a plus style or a concat style, are incredibly useful. They can basically get us close to the state of the art on lots of important tasks. So, I want to use them on another task now. And so, the next task I want to look at is image restoration. So, image restoration refers to starting with an image. At this time, we're not going to create a segmentation mask, but we're going to try and create a better image. And there's lots of kind of versions of better. There could be different image. So, the kind of things we can do with this kind of image generation would be take a low res image, make it high res, take a black and white image, make it color, take an image where something's being cut out of it and try and replace the cut out thing, take a photo and try and turn it into what looks like a line drawing, take a photo and try and make it look like a Monet painting. These are all examples of kind of image-to-image generation tasks which you'll know how to do after this part of the class. So, in our case, we're going to try to do image restoration, which is going to start with low resolution, poor quality JPEGs with writing written over the top of them and get them to replace them with high resolution, good quality pictures in which the text has been removed. Two questions? Okay, let's go. Why do you concat before calling comf to comf one, not after? Because if you did comf one, you know, if you did your coms before you concat, then there's no way for the channels of the two parts to interact with each other. You don't get any, you know, remember, in a 2D comf, it's really 3D, right? It's moving across two dimensions, but in each case, it's doing a dot product of all three dimensions of a rank three tensor, row by column by channel. So generally speaking, we want as much interaction as possible. We want to say, you know, this part of the down-sampling path and this part of the up-sampling path. If you look at the combination of them, you find these interesting things. So generally, you know, you want to have as many interactions going on as possible in each computation that you do. How does concatenating every layer together in a dense network when the size of the image feature maps is changing through the layers? That's a great question. So if you have a stride two comf, you can't keep dense netting, right? So that's what actually happens in a dense net is you kind of go like dense block growing, dense block growing, dense block growing, so you're getting more and more channels. And then you do a stride two comf without a dense block. And so now it's kind of gone. And then you just do a few more dense blocks and then it's gone. So in practice, a dense block doesn't actually keep all the information all the way through, but just every, up into every one of these stride two comfs. And there's kind of various ways of doing these bottlenecking layers where you're basically saying, hey, let's reset. It also helps us keep memory under control because at that point we can decide how many channels we actually want. Good questions. Thank you. Right. So in order to create something which can turn crappy images into nice images, we needed data set containing nice versions of images and crappy versions of the same images. So the easiest way to do that is to start with some nice images and crapify them. And so the way to crapify them is to create a function called crapify, which contains your crapification logic. So my crapification logic you can pick your own is that I open up my nice image. I resize it to be really small, 96 by 96 pixels with bilinear interpolation. I then pick a random number between 10 and 70. I draw that number into my image at some random location. And then I save that image with a JPEG quality of that random number. And a JPEG quality of 10 is like absolute rubbish. A JPEG quality of 70 is not bad at all. So I end up with high quality images, low quality images that look something like these. And so you can see this one, you know, there's the image. And this is after transformation, so that's why it's been flipped. And you won't always see the image because we're zooming into them. So a lot of the time the image is cropped out. So yeah, it's trying to figure out how to take this incredibly JPEG artifact thing with text written over the top and turn it into this. So I'm using the Oxford Pets dataset again, the same one we used in lesson one. So there's nothing more high quality than pictures of dogs and cats. I think we can all agree with that. The crappification process can take a while, but FastAI has a function called parallel. And if you pass parallel a function name and a list of things to run that function on, it will run that function on them all in parallel. So this actually can run pretty quickly. The way you write this function is where you get to do all the interesting stuff in this assignment. Try and think of an interesting crappification which does something that you want to do, right? So if you want to, you know, colorize black and white images, you would replace it with black and white. If you want something which can, you know, take like large cutout blocks of image and replace them with kind of hallucinated image, you know, add a big black box to these. If you want something which can kind of take old family photo scans that have been like folded up and have crinkles in, try and find a way of like adding dust prints and crinkles and so forth, right? Anything that you don't include in Crappify, your model won't learn to fix because every time it sees that in your photos, the input and output will be the same, so it won't consider that to be something worthy of fixing. Okay, so we now want to create a model which can take an input photo that looks like that and output something that looks like that. So obviously what we want to do is use a unit because we already know that units can do exactly that kind of thing, and we just need to pass the unit that data, okay? So our data is just literally the file names of each of those, from each of those two folders. Do some transforms, data bunch, normalize. We'll use ImageNetStats because we're going to use a pre-trained model. Why are we using a pre-trained model? Well, because like if you're going to get rid of this 46, you need to know what probably was there and to know what probably was there, you need to know what this is a picture of, right? Because otherwise how can you possibly know what it ought to look like? So, you know, let's use a pre-trained model that knows about these kinds of things. So we create our unit with that data. The architecture is ResNet34. These three things are important and interesting and useful, but I'm going to leave them to part two, okay? For now, you should always include them when you use a unit for this kind of problem. And so now we're going to, and this whole thing I'm calling a generator, okay? It's going to generate, this is generative modeling. We're kind of, there's not a really formal definition, but it's basically something where the thing we're outputting is like a real object, in this case, an image. It's not just a number. So we're going to create a generator learner, which is this unit learner, and then we can fit. We're using MSC loss, right? So in other words, what's the mean-squared error between the actual pixel value that it should be and the pixel value that we predicted? MSC loss normally expects two vectors. In our case, we have two images. So we have a version called MSC loss flat, which simply flattens out those images into a big, long vector. There's never any reason not to use this. Even if you do have a vector, it works fine. If you don't have a vector, it'll also work fine. So we're already down to 0.05 mean-squared error on the pixel values, which is not bad, after 1 minute 35. Like all things in Fast AI, pretty much, because we're doing transfer learning by default, when you create this, it'll freeze the pre-trained part, and the pre-trained part of a unit is this part, the down-sampling part. So let's unfreeze that and train a little more and look at that. So with, you know, four minutes of training, we've got something which is basically doing a perfect job of removing numbers. It's certainly not doing a good job of up-sampling, but it's definitely doing a nice deal. Sometimes when it removes a number, it maybe leaves a little bit of JPEG artifact, but it's certainly doing something pretty useful. And so if all we wanted to do was kind of watermark removal, we'd be finished. We're not finished because we actually want this thing to look more like this thing. So how are we going to do that? The problem, the reason that we're not making as much progress with that as we'd like is that our loss function doesn't really describe what we want because actually the mean-squared error between the pixels of this and this is actually very small, right? If you actually think about it, most of the pixels are very nearly the right color. But we're missing the texture of the pillow and we're missing the eyeballs entirely pretty much, right? And we're missing the texture of the fur, right? So we want some loss function that does a better job than pixel mean-squared error loss of saying like, is this a good quality picture of this thing? So there's a fairly general way of answering that question and it's something called a generative adversarial network, or GAN. And GAN tries to solve this problem by using a loss function which actually calls another model. And let me describe it to you. So we've got our crappy image, right? And we've already created a generator. It's not a great one, but it's not terrible, right? And that's creating predictions, like this. We have a high-res image like that and we can compare the high-res image to the prediction with pixel MSE. Okay. We could also train another model which we would variously call either the discriminator or the critic. They both mean the same thing. I'll call it a critic. We could try and build a binary classification model that takes all the pairs of the generated image and the real high-res image and tries to classify, learn to classify which is which. So look at some picture and say like, hey, what do you think? Is that a high-res cat or is that a generated cat? How about this one? Is that a high-res cat or a generated cat? So just a regular standard binary cross-entropy classifier. So we know how to do that already. So if we had one of those, we could now train, we could fine-tune the generator and rather than using pixel MSE as the loss, the loss could be how good are we at fooling the critic? So can we create generated images that the critic thinks are real? So that would be a very good plan, right? Because if it can do that, if the loss function is am I fooling the critic, then it's going to learn to create images which the critic can't tell whether they're real or fake. So we could do that for a while, train a few batches. But the critic isn't that great. The reason the critic isn't that great is because it wasn't that hard. These images are really shitty, so it's really easy to tell the difference. So after we train the generator a little bit more using the critic as the loss function, the generator is going to get really good at fooling the critic. So now we're going to stop training the generator and we'll train the critic some more on these newly generated images. So now that the generator is better, it's now a tougher task for the critic to decide which is real and which is fake. So we'll train that a little bit more. And then once we've done that, and the critic's now pretty good at recognizing the difference between the better generated images and the originals, we'll go back and we'll fine-tune the generator some more using the better discriminator, the better critic as the loss function. And so we'll just go ping-pong, ping-pong, backwards and forwards. That's again. Well, that's our version of again. I don't know if anybody's written this before. We've created a new version of again, which is kind of a lot like the original GANs, but we have this neat trick where we pre-train the generator and we pre-train the critic. I mean, GANs have been kind of in the news a lot. They're a pretty fashionable tool. And if you've seen them, you may have heard that they're a real pain to train. But it turns out we realize that really most of the pain of training them was at the start. If you don't have a pre-train generator and you don't have a pre-train critic, then it's basically the blind leading the blind. You're basically like the critics. Well, the generator is trying to generate something which falls a critic, but the critic doesn't know anything at all. So it's basically got nothing to do. And then the critics kind of try to decide whether the generated images are real or not. And that's really obvious, so that just doesn't. And so they kind of like don't go anywhere for ages. And then once they finally start picking up steam, they go along pretty quickly. So if you can find a way to generate things without using a GAN, like means grid error pixel loss, and discriminate things without using a GAN, like predict on that first generator, you can make a lot of progress. So let's create the critic. So to create just a totally standard, fast AI binary classification model, we need two folders. One folder is containing high res images. One folder containing generated images. We already have the folder with high res images, so we just have to save our generated images. So here's a teeny, tiny bit of code that does that. We're going to create a directory called ImageGen. Pop it into a variable called PathGen. We've got a little function called SavePreds that takes a data loader. And we're going to grab all of the file names. Because remember that in an item list, the dot items contains the file names if it's an image item list. So here's the file names in that data loader's dataset. And so now let's go through each batch of the data loader. And let's grab a batch of predictions for that batch. And then reconstruct equals true, means it's actually going to create fast AI image objects for each of those, each thing in the batch. And so then we'll go through each of those predictions and save them. And the name we'll save it with is the original file, but we're going to pop it into our new directory. So that's it. That's how you save predictions. And so you can see I'm kind of increasingly not just using stuff that's already in the fast AI library, but trying to show you how to write stuff yourself, right? And generally it doesn't require heaps of code to do that. And so if you come back to part two, this is what, you know, but lots of part two were kind of like, here's how you use things inside the library. And of course, here's how we wrote the library. So increasingly writing our own code. Okay, so save those predictions. And then let's just do a pil.image.open on the first one. And yep, there it is. Okay, so there's an example of a generated image. So now I can train a critic in the usual way. It's really annoying to have to restart, you put a notebook to refresh, reclaim GPU memory. So one easy way to handle this is if you just set something that you knew was using a lot of GPU to none, like this learner, and then just go gc.collect. That tells Python to do memory garbage collection. And after that, you'll generally be fine. You'll be able to use all of your GPU memory again. If you're using NVIDIA SMI to actually look at your GPU memory, you won't see it clear because PyTorch still has a kind of allocated cache, but it makes it available. So you should find this is how you can avoid restarting your notebook. Okay, so we're going to create our critic. It's just an image item list from folder in the totally usual way. And the classes will be the image gen and images. We'll do a random split because we want to know how well we're doing with a critic to have a validation set. We just label it from folder in the usual way. That's in transforms, data bunch normalized. So it's a totally standard object classifier. Okay, so we've got a totally standard classifier. So here's what some of it looks like. So here's one from the real images, real images, generated images, generated images. Okay, so it's got to try and figure out which class is which. Okay, so we're going to use binary cross entropy as usual. However, we're not going to use a resnet here. And the reason we'll get into it in more detail in part two, but basically when you're doing a GAN, you need to be particularly careful that the generator and the critic can't kind of both push in the same direction and like increase the weights out of control. So we have to use something called spectral normalization to make GANs work nowadays. We'll learn about that in part two. But anyway, if you say GAN critic, that will give you, Fast.ai will give you a binary classifier suitable for GANs. I strongly suspect we probably can use a resnet here. We just have to create a pre-trained resnet with spectral norm. Hope to do that pretty soon. We'll see how we go. Now, this is kind of the best approach. There's this thing called GAN critic. Again, critic uses a slightly different way of averaging the different parts of the image when it does the loss. So anytime you're doing a GAN at the moment, you have to wrap your loss function with adaptive loss. Again, we'll look at the details in part two for now. We just know this is what you have to do and it'll work. So other than that, slightly odd loss function and that slightly odd architecture, everything else is the same. We can call that to create our critic. Because we have this slightly different architecture and slightly different loss function, we did a slightly different metric. This is the equivalent GAN version of accuracy, the critics. And then we can train it. And you can see it's 98% accurate at recognizing that kind of crappy thing from that kind of nice thing. And of course, we don't see the numbers here anymore, right? Because these are the generated images. The generator already knows how to get rid of those numbers that are written on top. Okay. So let's finish up this GAN. Now that we have pre-trained the generator and pre-trained the critic, we now need to get it to kind of ping-pong between training a little bit of each. And the amount of time you spend on each of those things and the learning rates you use is still a little bit on the fussy side. So we've created a GAN learner for you, which you just pass in your generator and your critic, which we've just simply loaded here from the ones we just trained. And it will go ahead and when you go learn.fit, it will do that for you. It'll figure out how much time to train the generator and then when to switch to training the discriminator, the critic, and it'll go back on forward. These weights here is that what we actually do is we don't only use the critic as the loss function. If we only use the critic as a loss function, the GAN could get very good at creating pictures that look like real pictures, but they actually have nothing to do with the original photo at all. So we actually add together the pixel loss and the critic loss. And so those two losses are kind of on different scales. So we multiply the pixel loss by something between about 50 and about 200. Again, something in that range generally works pretty well. Something else with GANs. GANs hate momentum when you're training them. It kind of doesn't make sense to train them with momentum because you keep switching between generator and critic. So it's kind of tough. Maybe there are ways to use momentum, but I'm not sure anybody's figured it out. So this number here, when you create an atom optimizer, is where the momentum goes. So you should set that to zero. So anyway, if you're doing GANs, use these hyperparameters. It should work. Okay, so that's what GAN learner does. And so then you can go fit and it trains for a while. And one of the tough things about GANs is that these loss numbers, they're meaningless. You can't expect them to go down, right? Because as the generator gets better, it gets harder for the discriminator, the critic. And then as the critic gets better, it gets harder for the generator. So the numbers should stay about the same, right? So that's one of the tough things about training GANs is it's kind of hard to know how are they doing. So the only way to know how are they doing is to actually take a look at the results from time to time. And so if you put show image equals true here, it'll actually print out a sample after every epoch. I haven't put that in the notebook because it makes it too big for the repo, so you can try that. So I've just put the results at the bottom, and here it is. So pretty beautiful, I would say. We already knew how to get rid of the numbers, but we now don't really have that kind of artifact of where it used to be. And it's definitely sharpening up this little kitty cat quite nicely. It's not great always. There's some weird kind of noise going on here. Certainly a lot better than the horrible original. This is a tough job to turn that into that. But there are some really obvious problems. Like here, these things ought to be eyeballs, and they're not. So why aren't they? Well, our critic doesn't know anything about eyeballs. And even if it did, it wouldn't know that eyeballs are particularly important. You know, we care about eyes. Like when we see a cat with our eyes, it's a lot less cute. I mean, I'm more of a dog person. But, you know, it just doesn't know that this is a feature that matters, particularly because the critic, remember, is not a pre-trained network. So I kind of suspect that if we replace the critic with a pre-trained network that's been pre-trained on ImageNet but is also compatible with GANs, it might do a better job here. But it's definitely a shortcoming of this approach. So we're going to have a break. Oh, question first. And then we'll have a break. And then after the break, I will show you how to find the cat's eyeballs again. For what kind of problems do you not want to use Unets? Well, Unets are for when the size of your output is similar to the size of your input and kind of aligned with it. There's no point kind of having cross connections if that level of spatial resolution in the output isn't necessary or useful. So, yeah, any kind of generative modeling and segmentation is kind of generative modeling. It's generating a picture which is a mask of the original objects. Yeah, so probably anything where you want that kind of resolution of the output to be the same kind of fidelity as resolution of the input. Obviously, something like a classifier makes no sense. In a classifier, you just want the down-sampling path because at the end you just want a single number that gives it a dog or a cat or what kind of pet is it or whatever. Great, okay, so let's get back together at 5 plus 8. Just before we leave GANs, I just mentioned there's another notebook you might be interested in looking at which is lesson 7W GAN. When GANs started a few years ago, people generally used them to kind of create images out of thin air, which I personally don't think is a particularly useful or interesting thing to do. But it's kind of a good, I don't know, it's a good research exercise, I guess. So, we implemented this WGAN paper which was kind of really the first one to do a somewhat adequate job somewhat easily. And so you can see how to do that with the fast AI library. It's kind of interesting because the data set we use is this L sun bedrooms data set which we provided in our URLs, which just, as you can see, has bedrooms. Lots and lots and lots of bedrooms. And the approach, you'll see in the pros here that Sylvain wrote, the approach that we use in this case is to just say, can we create a bedroom? And so what we actually do is that the input to the generator isn't an image that we clean up. We actually feed to the generator random noise. And so then the generator's task is, can you turn random noise into something which the critic can't tell the difference between that output and a real bedroom? And so we're not doing any pre-training here or any of the stuff that makes this kind of fast and easy. So this is a very traditional approach, but you can still see, you still just go, you know, GAN learner and there's actually a WGAN version which is, you know, this kind of older style approach. It is passing the data and the generator and the critic in the usual way and you call fit. And you'll see, in this case we have a show image on, you know, after epoch one, it's not creating great bedrooms or two or three. And you can really see that in the early days of these kinds of GANs, it's not a great job of anything. But eventually, after, you know, a couple of hours of training, producing somewhat like bedroom-ish things, you know. So anyway, it's a notebook you can never play with and it's a bit of fun. So I was very excited when we got FastAI to the point in the last week or so that we had GANs working in a way where kind of API-wise they're far more concise and more flexible than any other library that exists. But also kind of disappointed with them. They take a long time to train and the outputs are still like, so, so, and so the next step was like, well, can we get rid of GANs entirely? So the first step with that, I mean, obviously the thing we really want to do is come up with a better loss function. We want a loss function that does a good job of saying this is a high-quality image without having to go all the GAN trouble and preferably it also doesn't just say it's a high-quality image, but it's an image which actually looks like the thing it's meant to. So the real trick here comes back to this paper from a couple of years ago, Perceptural Losses for Real-Time Stale Transfer and Super Resolution. Justin Johnson et al. created this thing they call Perceptural Losses. It's a nice paper, but I hate this term because there's nothing particularly perceptual about them. I would call them feature losses. So in the fast AI library, you'll see this referred to as feature losses. And it shares something with GANs, which is that after we go through our generator, which they call the Image Transform Net, and you can see it's got this kind of UNET shaped thing. They didn't actually use UNETs because at the time this came out, nobody in the machine learning world nowadays, of course, we use UNETs. But anyway, something UNET-ish. I should mention, like in these kind of, these architectures where you have a down-sampling path followed by an up-sampling path, the down-sampling path is very often called the encoder. As you saw in our code, actually we call that the encoder. And the up-sampling path is very often called the decoder. In generative models, you know, generally, including generative text models, neural translations, stuff like that, they tend to be called the encoder and the decoder. Two pieces. Anyway, so we have this generator and we want a loss function that says, you know, is the thing that it's created like the thing that we want. And so the way they do that is they take the prediction. Remember, Y hat is what we normally use for a prediction from a model. We take the prediction and we put it through pre-trained image net network. So at the time that this came out, the pre-trained image net work they were using was VGG. People still, it's kind of old now, but people still tend to use it because it works fine for this process. So they take the prediction and they put it through VGG, the pre-trained image net network. It doesn't matter too much which one it is. And so normally the output of that would tell you, hey, is this generated thing, you know, a dog or a cat or an airplane or a fire engine or whatever, right? But in the process of getting to that final classification, it goes through lots of different layers. And in this case, they've color coded all the layers with the same grid size in the feature map with the same color. So every time we switch colors, we're switching grid size. So there's a Stride 2 Conv, or in VGG's case, they still used to use max pooling layers, which is kind of a similar idea. And so what we could do is say, hey, let's not take the final output of the VGG model on this generated image, but let's take something in the middle. Let's take the activations of some layer in the middle. So those activations, you know, might be a feature map of like 256 channels by 28 by 28, say. And so those kind of 28 by 28 grid cells will kind of roughly, semantically say things like, hey, in this part of that 28 by 28 grid, is there something that looks kind of furry or is there something that looks kind of shiny or is there something that looks kind of circular or is there something that kind of looks like an eyeball or whatever. So what we do is we then take the target, so the actual Y value, and we put it through the same pre-trained VGG network and we pull out the activations at the same layer, and then we do a mean squared error comparison. So it'll say like, okay, in the real image, grid cell 11 of that 28 by 28 feature map, you know, is furry and blue and round shaped, and in the generated image, it's furry and blue and not round shaped. So it's kind of like an okay match. So that ought to go a long way towards fixing our eyeball problem, because in this case, the feature map is going to say, there's eyeballs here, sorry, here, but there isn't here. So do a better job of that, please. Make better eyeballs. So that's the idea, okay? And so that's what we call feature losses or Johnson et al. called perceptual losses. So to do that, we're going to use the lesson seven super res notebook. And this time, the task we're going to do is kind of the same as the previous task, but I wrote this notebook a little bit before, the GAN notebook, before I came up with the idea of like putting text on it and having a random JPEG quality. So the JPEG quality is always 60. There's no text written on top, and it's 96 by 96. So, and it's before I realized what a great word crappify is, so it's called resize. So here's our crappy images and our original images. Kind of a similar task to what we had before. So I'm going to try and create a loss function, which does this. So the first thing I do is I define a base loss function, which is basically like, how am I going to compare the pixels and the features? And the choices mainly are like MSE or L1. It doesn't matter too much which you choose. I tend to like L1 better than MSE, actually. So I picked L1. So anytime you see base loss, we mean L1 loss, you could use MSE loss as well. So let's create a VGG model. So just using the pre-trained model. In VGG, there's an attribute called dot features, which contains the convolutional part of the model. So here's the convolutional part of the VGG model because we don't need the head because we only want the intermediate activations. So then we'll chuck that on the GPU. We'll put it into eval mode because we're not training it. And we'll turn off requires grad because we don't want to update the weights of this model. We're just using it for inference, right, for the loss. So then let's enumerate through all the children of that model and find all of the max pooling layers because in this, in the VGG model, that's where the grid size changes. And as you can see from this picture, we kind of want to grab features from every time just before the grid size changes. So we grab layer i minus 1. So that's the layer before it changes. So there's our list of layer numbers just before the max pooling layers. And so all of those are values, not surprisingly. So those are where we want to grab some features from. And so we put that in blocks. It's just a list of IDs. So here's our feature loss class, which is going to implement this idea. So basically, when we call the feature loss class, we're going to pass it some pre-trained model. And so that's going to be called M-feet. That's the model which contains the features which we want to generate, what we want our feature loss on. So we can go ahead and grab all of the layers from that network that we want the losses for. That we want the features for to create the losses. So we're going to need to hook all of those outputs because remember that's how we grab intermediate layers in PyTorch is by hooking them. So this is going to contain our hooked outputs. So now in the forward of feature loss, we're going to call make features passing in the target. So this is our actual Y, which is just going to call that VGG model and go through all of the stored activations and just grab a copy of them. And so we're going to do that both for the target, call that Out-feet, and for the input. So that's the output of our generator in-feet. And so now let's calculate the L1 loss between the pixels because we still want the pixel loss a little bit. And then let's also go through all of those layers features and get the L1 loss on them. So we're basically going through every one of these end of each block and grabbing the activations and getting the L1 on each one. So that's going to end up in this list called feature losses, which I then sum them all up. And by the way, the reason I do it as a list is because we've got this nice little callback that if you put them into a thing called dot metrics in your loss function, it'll print out all of the separate layer loss amounts for you, which is super handy. So that's it. That's our perceptual loss or feature loss class. And so now we can just go ahead and train a unit in the usual way with our data and our pre-trained architecture, which is a ResNet 34, passing in our loss function, which is using our pre-trained VGG model. And this is that callback I mentioned, loss metrics, which is going to print out all the different layers losses for us. These are two things that we'll learn about in part two of the course, but you should use them. LR find. I just created a little function called do fit that does fit one cycle and then saves the model and then shows the results. So as per usual, because we're using a pre-trained network in our unit, we start with frozen layers for the down-sampling path, train for a while, and as you can see, we get not only the loss, but also the pixel loss and the loss at each of our feature layers. And then also something we'll learn about in part two called gram loss, which I don't think anybody's used for super Res before as far as I know, but as you'll see, it turns out great. So that's eight minutes. So much faster than a GAN. And already, as you can see, this is our output, modeled output, pretty good. So then we unfreeze and train some more and it's a little bit better. And then let's switch up to double the size and so we need to also have the batch size to avoid running out of GPU memory and freeze again and train some more. So it's now taking half an hour, even better. And then unfreeze and train some more. So all in all, we've done about an hour and 20 minutes of training. And look at that. It's done it. It knows that eyes are important. So it's really made an effort. It knows that fur is important. So it's really made an effort. So it started with something with JPEG artifacts around the ears and all this mess and eyes that are just kind of vague, light blue things and it really created a lot of texture. This cat is clearly kind of like looking over the top of one of those little clawing frames covered in fuzz. So it actually recognized that this thing is probably kind of a carpeting material that's created a carpeting material for us. So I mean, that's just remarkable. So talking of remarkable, we can now... So I've never seen outputs like this before without again. So I was just so excited when we were able to generate this and so quickly, one GPU, hour and a half. So like if you create your own crapification functions and train this model, you'll build stuff that nobody's built before because like nobody else's that I know of is doing it this way. So there are huge opportunities I think. So check this out. What we can now do is we can now, instead of starting with our low res, I actually stored another set at size 256 which are called medium res. So let's see what happens if we upsize a medium res. So we're going to grab our medium res data and here is our medium res stored photo and so can we improve this? So you can see there's still a lot of room for improvement. Like you see the lashes here are very pixelated. Place where there should be hair here is just kind of fuzzy. So watch this area as I hit down on my keyboard. Bump. Look at that. It's done it. It's taken a medium res image and it's made a totally clear thing here. The furs reappeared. Look at the eyeball. Let's go back. The eyeball here is just kind of a general blue thing. Here it's added all the right texture. So I just think this is super exciting. Here's a model I trained in an hour and a half using standard stuff that you've all learned about. A unit, a pre-trained model, feature loss function and we've got something which can turn that into that or this absolute mess into this. It's really exciting to think what could you do with that? So one of the inspirations here has been a guy called Jason Entich. Jason was a student in the course last year and what he did very sensibly was decide to focus basically nearly quit his job and work four days a week or really six days a week on studying deep learning and as you should do, he created a kind of capstone project and his project was to combine GANs and feature losses together and his crappification approach was to take color pictures and make them black and white. So he took the whole of ImageNet, created a black and white ImageNet and then trained a model to recolorize it and he's put this up as de-oldify and now he's got these actual old photos from the 19th century that he's turning into color and like what this is doing is incredible. Like look at this, the model thought oh that's probably some kind of copper kettle so I'll make it like copper colored and oh these pictures are on the wall, they're probably like different colors to the wall and maybe that looks a bit like a mirror, maybe it would be reflecting stuff outside. These things might be vegetables, vegetables are often red, let's make them red. It's extraordinary what it's done and you could totally do this too, like you can take our feature loss and our GAN loss and combine them. So I'm very grateful to Jason because he's helped us build this lesson and it's been really nice because we've been able to help him too because he hadn't realized that he can use all this pre-training and stuff and so hopefully you'll see de-oldify in the next couple of weeks, be even better at de-oldification but hopefully you all can now add other kinds of de-crepification methods as well. So I like every course if possible to show something totally new because then every student has the chance to basically build things that have never been built before so this is kind of that thing but between the much better segmentation results and these much simpler and faster de-crepification results I think you can build some really cool stuff. Did you have a question? Is it possible to use similar ideas to UNET and GANs for NLP? For example, if I want to tag the verbs and nouns in a sentence or create a really good Shakespeare generator? Yeah, pretty much. We don't fully know yet. It's a pretty new area but there's a lot of opportunities there and we'll be looking at some in a moment, actually. So I actually tried training this... Well, I actually tried testing this on this... Remember this picture I showed you with a slide last lesson? And it's a really rubbishy-looking picture and I thought what would happen if we tried running this just through the exact same model and it changed it from that to that? So I thought that was a really good example. You can see something it didn't do which is this weird discoloration. It didn't fix it because I didn't crapify things with weird discoloration, right? So if you want to create really good image restoration, like I say, you need really good crapification. Okay, so here's what we've learned so far, right, in the course, some of the main things. So we've learned that neural nets consist of sandwiched layers of affine functions which are basically matrix multiplications, slightly more general version, and nonlinearities, like ReLU. And we learned that the results of those calculations are called activations and the things that go into those calculations that we learn are called parameters and that the parameters are initially randomly initialized or we copy them over from a pre-trained model and then we train them with SGD or faster versions and we learned that convolutions are a particular affine function that work great for auto-correlated data, so things like images and stuff. We learned about batch norm, drop-out data augmentation and weight decay as ways of regularizing models and also batch norm helps train models more quickly. And then today we've learned about res slash dense blocks. We've learned a lot about image classification regression, embeddings, categorical and continuous variables, collaborative filtering, language models and NLP classification, and then kind of segmentation unit and gains. So go over these things and make sure that you feel comfortable with each of them. If you've only watched this series once, you definitely won't. People normally watch it, you know, three times or so to really understand the detail. So one thing that doesn't get here is RNNs. So that's the last thing we're going to do, RNNs. So RNNs, I'm going to introduce a little kind of diagrammatic method here to explain RNNs. And the diagrammatic method, I'll start by showing you a basic neural net with a single hidden layer. Square means an input. So that'll be batch size by number of inputs. Right, so kind of, you know, batch size by number of inputs. An arrow means a layer, broadly defined, such as matrix product followed by value. A circle is activations. Okay, so in this case we have one set of hidden activations, and so given that the input was number of inputs, this here is a matrix of number of inputs by number of activations, so the output will be batch size by number of activations. It's really important you know how to calculate these shapes, right? So go learn.summary, lots, to see all the shapes. So then here's another arrow, so that means it's another layer, matrix product followed by non-linearity. In this case we're going to the output, so we use softmax, and then triangle means an output. Okay, and so this matrix product will be number of activations by number of classes, so our output is batch size by number of classes. Okay, so let's reuse that key, remember, triangle output, circle is activations, hidden state we also call that, and rectangle is input. So let's now imagine that we wanted to create, get a big document, split it into sets of three words at a time and grab each set of three words and then try to predict the third word using the first two words. So if we had the data set in place we could grab word one as an input, chuck it through an embedding, right, creates some activations, pass that through a matrix product and non-linearity, grab the second word, put it through an embedding, and then we could either add those two things together or concatenate them. Generally speaking when you see kind of two sets of activations coming together in a diagram, you normally have a choice of concatenate or add. And that's going to create a second bunch of activations and then you can put it through one more fully connected layer and softmax to create an output. So that would be a totally standard, fully connected neural net with one very minor tweak which is concatenating or adding at this point which we could use to try to predict the third word of every from pairs of two words. Okay. So remember arrows represent layer operations and I removed on this one the specifics of what they are because they're always an affine function followed by a non-linearity. Okay. Let's go further. What if we wanted to predict word four using words one and two and three? It's basically the same picture as last time except with one extra input and one extra circle. But I want to point something out which is each time we go from rectangle to circle we're doing the same thing. We're doing an embedding which is just a particular kind of matrix model play where you have a one-hot encoded input. Each time we go from circle to circle we're basically taking one piece of hidden state one set of activations and turning it into another set of activations by saying we're now at the next word. And then when we go from circle to triangle we're doing something else again which is we're saying let's convert the hidden state, these activations into an output. So it makes sense, so you can see I've colored each of those arrows differently. So each of those arrows should probably use the same weight matrix because it's doing the same thing. So why would you have a different set of embeddings for each word or a different set of, a different matrix to multiply by to go from this hidden state to this hidden state versus this one. So this is what we're going to build. So we're now going to jump into human numbers which is less than seven human numbers. And this is a data set that I created which literally just contains all the numbers from one to 9999 written out in English. And we're going to try and create a language model that can predict the next word in this document. It's just a toy example for this purpose. So in this case we only have one document. That one document is the list of numbers. So we can use a text list to create an item list with text in for the training and the validation. In this case the validation set is the numbers 1000 onwards and the training set is 1000. We can combine them together turn that into a data bunch. So we only have one document so train zero is the document grabits.txt that's how you grab the contents of a text list and here are the first 80 characters. It starts with a special token xxbos anything starting with xx is a special fast ai token bos is the beginning of stream token. It basically says this is the start of a document. It's very helpful in NLP to know when documents start so that your models can learn to recognize them. The validation set contains 13,000 tokens so 13,000 words or punctuation marks because everything between spaces is a separate token. The batch size that we asked for was 64 and then by default it uses something called BPTT of 70. BPTT as we briefly mentioned stands for back prop through time. That's the sequence length. So for each of our 64 document segments we split it up into lists of 70 words that we look at at one time. So what we do is we grab this for the validation set. Entire string of 13,000 tokens and then we split it into 64 roughly equal sized sections. People very very often think I'm saying something different. I did not say they are of length 64. They're not. They're 64 equally sized roughly segments. So we take the first 164th of the document piece 1. Second 64th of the document. Okay? And then for each of those 164th of the document we then split those into pieces of length 70. So each batch right, so let's now say okay for those 13,000 tokens how many batches are there? Well divide by batch size and divide by 70. So there's about 2.9 batches so 3. There's going to be 3 batches. So let's grab an iterator for our data loader. Grab 1, 2, 3 batches, the X and the Y. And let's add up the number of elements and we get back slightly less than this because there's a little bit left over at the end that doesn't quite make up a full batch. Okay? So this is the kind of stuff you should play around with a lot. Lots of shapes and sizes and stuff and iterators. As you can see it's 95 by 64. I claimed it was going to be 70 by 64. That's because our data loader for language models slightly randomizes BPTT just to give you a bit more kind of shuffling, get a bit more randomization, it helps the model. And so here you can see the first batch of X. Remember we've numericalized all these and here's the first batch of Y and you'll see here this is 2, 18, 10, 11, 8. This is 18, 10, 11, 8. So this one is offset by 1 from here because that's what we want to do with a language model. We want to predict the next word. So after 2 should come off 18 and after 18 should come 10. You can grab the vocab for this data set and a vocab has a textify. So if we call exactly the same, look at the same thing but with textify that will just look it up in the vocab. So here you can see X, XBOS, 8001 whereas in the Y there's no X, XBOS it's just 8001. So after X, XBOS is 8, after 8 is 1000, after 1000 is 1. Okay. And so then after we get 8023 comes X2 and look at this we're always looking at column 0. So this is the first batch, the first mini batch. It comes 8024 and then X3 all the way up to 8040. Right. And so then we can go right back to the start but look at batch 1. So index 1 which is batch number 2 and now we can continue. A slight skip from 8040 to 8046 that's because the last mini batch wasn't quite complete. So what this means is that every mini batch, so every yeah every mini batch joins up with the previous mini batch so you can go straight from X1 0 to X2 0 it can do 8023, 8024 right. And so if you do the same thing for column comma 1 you'll also see they join up. So all the mini batches join up. So that's the data we can do show batch to see it. And here is our model which is doing this. Right. So here is this is just the code copied over right. So it contains one embedding i.e. the green arrow one hidden to hidden brown arrow layer and one hidden to output. So each colored arrow has a single matrix. Okay. And so then in the forward pass we take our first input X0 and put it through input to hidden the green arrow okay create our first set of activations which we call H. Assuming that there is a second word because like sometimes we might be at the end of a batch where there isn't a second word. Assuming there is a second word then we would add to H the result of X1 put through the green arrow. Remember that's IH. And then we would say okay our new H is the result of those two added together put through our hidden to hidden orange arrow and then value them batched on. And then for the second word do exactly the same thing and then finally blue arrow put it through HO. So that's how we convert our diagram to code. So nothing new here at all. So now let's do so we check that in the learner and we can train it 46%. Let's take this code and recognize it's pretty awful there's a lot of duplicate code and as coders when we see duplicate code we refactor. So we should refactor this into a loop. So here we are. We've refactored it into a loop. So now we're going for each XI and X and doing it in the loop. Guess what? That's an RNN. An RNN is just a refactoring. It's not anything new. This is now an RNN. Okay and let's refactor our diagram from this to this. This is the same diagram but I've just replaced it with my loop. It does the same thing so here it is. It's got exactly the same in it, literally exactly the same just popped a loop here before I start I just have to make sure that I've got some, you know, a bunch of zeros to add to and of course I get exactly the same result when I train it. Okay so one nice thing that you might think then and one nice thing about the loop though is now this will work even if I'm not predicting the fourth word from the previous three but the ninth word from the previous eight. It'll work for any arbitrarily length long sequence, which is nice. So let's up the BPTT to 20 since we can now and let's now say, okay instead instead of just predicting the nth word from the previous n minus one let's try to predict the second word from the first and the third from the second and the fourth and the third and so forth, right, because previously like look at our loss function, previously we were comparing the result of our model to just the last word of the sequence. It's very wasteful because there's a lot of words in the sequence so let's compare every word in X to every word in Y. So to do that we need to change this so it's not just one triangle at the end of the loop but the triangle is inside this, right? So that in other words after every loop predict loop predict, loop predict so here's this code. It's the same as the previous code but now I've created an array and every time I go through the loop I append H-O-H to the array. So now for n inputs I create n outputs. So I'm predicting after every word. Previously I had 46% now I have 40% why is it worse? Well it's worse because now like when I'm trying to predict the second word, I only have one word of state to use, right? So like when I'm looking at the third word I only have two words of state to use. So it's a much harder problem for it to solve. So the obvious way to fix this then would, you know the key problem is here I go H equals torch.0 like I reset my state to 0 every time I start another BPTT sequence. Well let's not do that. Let's keep H, right? And we can because remember each batch connects to the previous batch. It's not shuffled like happens in image classification. So let's take this exact model and replicate it again but let's move the creation of H into the constructor. Okay there it is. So it's now self.H. Okay and so this is now exactly the same code but at the end let's put the new H back into self.H. Okay so it's now doing the same thing but it's not throwing away that state and so therefore now we actually get above the original. We get all the way up to 54% accuracy. So this is what a real RNN looks like. You know you always want to keep that state. But just keep remembering there's nothing different about an RNN. It's a totally normal, fully connected neural net. It's just that you've got a loop you refactored. What you could do though is at the end of your every loop you could not just spit out an output but you could spit it out into another RNN. So you could have an RNN going into an RNN and that's nice because we've now got more layers of computation. You would want that to work better. Well to get there let's do some more refactoring. So let's take this code and replace it with the equivalent built-in PyTorch code which is you just say that. So RNN.RNN basically says do the loop for me. Okay we've still got the same embedding. We've still got the same output. We've still got the same batch norm. We've still got the same initialization of H but we just got rid of the loop. So one of the nice things about RNN is that you can now say how many layers you want. So this is the same accuracy of course. So here I'm going to do it with two layers. But here's the thing. When you think about this, think about it without the loop. It looks like this. It's like we've got a BPTT of 20 so there's 20 layers of this and we know from visualizing the Lost Landscapes paper that deep networks have awful, bumpy lost surfaces. So when you start creating long time scales and multiple layers these things get impossible to train. So there's a few tricks you can do. One thing is you can add skip connections of course. But what people normally do is instead they put inside instead of just adding these together they actually use a little mini neural net to decide how much of the green arrow to keep and how much of the orange arrow to keep and when you do that you get something that's either called a GRU or an LSTM depending on the details of that little neural net and we'll learn about the details of those neural nets in Part 2. They really don't matter though frankly. So we can now say let's create a GRU instead. So it's just like what we had before but it'll handle longer sequences and deeper networks. Let's use two layers boom! And we're up to 75%. Okay. So that's our ends and the main reason I wanted to show it to you was to remove the last remaining piece of magic and this is one of the least magical things we have in deep learning. It's just a refactored, fully connected network. So don't let RNNs ever ever put you off. And with this approach where you basically have a sequence of N inputs and a sequence of N outputs we've been using for language modeling you can use that for other tasks, right? For example, the sequence of outputs could be for every word there could be something saying is this something that I is sensitive and I want to anonymize or not you know so like is this private data or not or it could be a part of speech tag for that word or it could be something saying how should that word be formatted or whatever and so these are called sequence labeling tasks and so you can use this same approach for pretty much any sequence labeling task or you can do what I did in the earlier lesson which is once you finish building your language model you can throw away the kind of this H.O. bit and instead pop there a standard classification and then you can now do NLP classification which as you saw earlier will give you state of the art results even on long documents so this is a super valuable technique and not remotely magical okay so that's it right that's deep learning or at least you know the kind of the practical pieces from my point of view um having watched this one time um you won't get it all and I don't recommend that you do watch this so slowly that you get it all the first time but you go back and look at it again take your time and there'll be bits that you go like oh now I see what he's saying and then you'll be able to like implement things you couldn't implement before and you'll be able to dig in more than you before so like definitely go back and do it again and as you do right write code not just for yourself but put it on github right it doesn't matter if you think it's great code or not you know the fact that you're writing code and sharing it is impressive and the feedback you'll get if you tell people on the forum you know hey I wrote this code it's not great but you know it's my first effort anything you see jump out at you people will say like oh that bit was done well hey but did you know for this bit you could have used this library and saved you some time you'll learn a lot by interacting with your peers um as you've noticed I've started introducing more and more papers now part two will be a lot of papers and so it's a good time to start um reading some of the papers that have been introduced in this section um all the bits that say like derivation and theorems and lemmas you can skip them I do they add almost nothing to your understanding of practical learning right but the bits that say like you know um why are we solving this problem and what are the results and so forth are really interesting um and then you know try and write English prose um not not English prose that you want to be read by Jeff Hinton and Jan LeCun but English prose that you want to be written read by you as of six months ago right because there's a lot more people in the audience of you as of six months ago than there is of Jeffrey Hinton and Jan LeCun right that's that's the person you best understand you know what they need right um go and get help and help others tell us about your success stories um but perhaps the most important one is get together with others right people's learning works much better if you've got that um social experience so start a book club get involved in meetups create study groups and build things right and again it doesn't have to be amazing like just build something that you think the world would be a little bit better if that existed or you think it would be kind of slightly delightful to your two year old to see that thing or you just want to show it to your brother the next time they come around to see what you're doing whatever right like just finish something you know finish something um and then try and make it a bit better so for example uh something I just saw this afternoon is the Elon Elon Musk tweet generator okay uh so looking at lots of older tweets creating a language model from um from uh Elon Musk and then creating new tweets such as humanity will also have an option to publish on its own journey as an alien civilization it will always like all human being Mars is no longer possible AI will definitely be the central intelligence agency okay so this is great I love this and I love that uh Dave Smith wrote and said um these are my first ever commits thanks for teaching a finance guy how to build an app in weeks right so you know um I think this is awesome and I think like clearly a lot of care and passion has been put into this project um you know will it systematically change the future direction of society as a whole maybe not you know but maybe Elon will look at this and think like oh you know like maybe I need to rethink my method of prose I don't know I think it's I think it's great um and so yeah create something put it out there put a bit of yourself into it um or get involved in fast AI the fast AI project there's a lot going on you know you can help with documentation and tests which might sound boring but you'd be surprised how incredibly not boring it is to like take a piece of code that hasn't been properly documented and research it and understand it and ask silver and I on the forum what's going on why did you write it this way we'll send you off to the papers that we are implementing you know writing a test requires deeply understanding that part of the machine learning world to really understand how it's meant to work um so that's always interesting um Stas Beckman has created this nice dev projects index which you can like go on to the forum in the fast AI dev section and find um actually the dev project section and find like here's some stuff going on that you might want to get involved in or maybe there's stuff you want to exist you could add your own um create a study group you know Dean has already created a study group for San Francisco starting in January this is how easy it is to create a study group right go on the forum find your little time zone subcategory and add a post saying let's create a study group okay but make sure you you know give people like a little google sheet to sign up some way to actually do something you know um a great example is Pierre who's been doing a fantastic job in Brazil of running um study groups for the last couple of parts of the course and uh you know he keeps posting these pictures of people having a good time and learning deep learning together um creating wikis together creating projects together great experience um and then come back for part two right where we'll be um looking at all of this interesting stuff in particular going deep into the fast AI code base to understand how did we build it exactly we'll actually go through um as we were building it we created notebooks of like here we're here is where we were at each stage so we're actually going to see the software development process itself we'll talk about the process of doing research um how to read academic papers how to turn math into code and then a whole bunch of additional um types of models that we haven't seen yet so it'll be kind of like going beyond practical deep learning into actually um cutting edge research so we've got um five minutes uh to um take some questions we had an AMA going on um online and so we're going to have time for a couple of the highest ranked AMA questions from the community and the first one is by Jeremy's request um although it's not the highest ranked what's your typical day like how do you manage your time across so many things that you do um yeah I thought that I hear that all the time so I thought I should um answer it and I think I got a few votes um um because I think um people who come to our study group uh are always shocked at how disorganized and incompetent I am and so I often hear people saying like oh wow I thought you were like this deep learning role model and I'd get to see how to be like you and now I'm not sure I want to be like you at all um so um yeah it's um for me it's all about just having a good time with it um I never really have many plans I just try to finish what I start um if you're not having fun with it it's really really hard to continue because there's a lot of frustration in deep learning because it's not like writing a web app where it's like you know authentication check you know um back end service watchdog check uh okay user credentials check you know like you just you're making progress where else for stuff like this and stuff that we've been doing the last couple of weeks it's just like it's not working it's not working it's not working no that also didn't work that also didn't work until oh my god it's amazing it's a cat that's kind of what it is right so you don't get that regular feedback so um yeah you know you gotta have fun with it um and so so my yeah my day is kind of um you know I mean the other thing I do I say I don't I don't do any meetings I don't do phone calls I don't do coffees I don't watch TV I don't play computer games um I spend a lot of time with my family a lot of time exercising and a lot of time reading and coding and doing things I like so um uh you know I think um you know the main thing is just finish finish something like properly finish it so when you get to that point where you think of the way through but you haven't quite created a read me yet and the install process is still a bit clunky and you know this is what 99% of github projects look like you'll see the read me says to do you know complete baseline experiments document blah blah blah it's like don't be that person like just do something properly and finish it and maybe get some other people around you to work with you so that you're all doing it together and you know get it done what are the up and coming deep learning machine learning things that you are most excited about also you've mentioned last year that you are not a believer in reinforcement learning do you still feel the same way yeah I still feel exactly the same way as I did three years ago when we started this which is um it's all about transfer learning it's underappreciated it's under researched every time we put transfer learning into anything we make it much better um you know academic paper on transfer learning for NLP has you know helped be one piece of kind of changing the direction of NLP this year made it all the way to the New York Times just a stupid obvious little thing that we threw together so I remain excited about that I remain unexcited about reinforcement learning for most things I don't see it used by normal people for normal things for nearly anything it's an incredibly inefficient way to solve problems which are often solved more simply and more quickly in other ways it probably has maybe a role in the world but a limited one and not in most people's day-to-day work for someone planning to take part two in 2019 what would you recommend doing learning practicing until the part two course starts just code yeah just code all the time I know it's perfectly possible I hear from people who get to this point with the course and they haven't actually written any code yet and if that's you it's okay you know you've just go through and do it again and this time do code and and look at the input the shapes of your inputs and look at your outputs and make sure you know how to grab a mini batch look at its mean and standard deviation and plot it and you know there's so much material that we've covered if you can get to a point where you can you know rebuild those notebooks from scratch without too much cheating when I say from scratch I mean using the Fast AI Library not from scratch from scratch you know you'll be in the top echelon of practitioners because you'll be able to do all of these things yourself and that's really really rare and that'll put you in a great position for part two should we do one more? 9 o'clock we always do one more where do you see the Fast AI Library going in the future say in five years? well like I said I don't make plans I just piss around so our only plan for Fast AI you know as an organization is to make deep learning accessible as a tool for normal people to use for normal stuff so as long as we need to code we failed at that so the big goal because 99.8% of the world can't code so the main goal would be to get to a point where it's not a library but it's a piece of software that doesn't require code and it certainly shouldn't require a goddamn lengthy hard working course like this one you know so I want to get rid of the course I want to get rid of the code I want to make it so you can just do useful stuff for me and easily maybe five years maybe longer I hope to see you all back here for part two thank you