 Welcome to lesson 12. Wow, we're moving along. And this is an exciting lesson because it's where we're going to wrap up all the pieces, both for computer vision and for NLP. And you might be surprised to hear that we're going to wrap up all the pieces for NLP because we haven't really done any NLP yet. But actually, everything we've done is equally applicable to NLP. So there's very little to do to get a state-of-the-art result on IMDB sentiment analysis from scratch. So that's what we're going to do. Before we do, let's finally finish off this slide we've been going through for three lessons now. I promised, not promised, that we would get something state-of-the-art on ImageNet. Turns out we did. So you're going to see that today. So we're going to finish off MixUp, LabelSmoothing, and ResNets. OK, so let's do it. Before we look at the new stuff, O9B learner, I've made a couple of minor changes that I thought you might be interested in. It's kind of like as you refactor things. So remember last week, we refactored the learner to get rid of that awful, separate runner. So there's just now one thing. Made a lot of our code a lot easier, that there's still this concept left behind that when you started fitting, you had to tell each callback what its learner or runner was. Because they're all totally attached now, I've moved that to the init. And so now you can call addCBs to add a whole bunch of callbacks, or addCB to add one callback, and that happens automatically at the start of training. So that's a very minor thing. More interesting was when I did this little reformatting exercise, where I took all these callbacks that used to be on the line underneath the thing before them, and lined them up over here, and suddenly realized that now I can answer all the questions I have in my head about our callback system, which is, what exactly are the steps in the training loop? What exactly are the callbacks that you can use in the training loop? Which step goes with which callback? Which steps don't have a callback? Are there any callbacks that don't have a step? So it's one of these interesting things where I really don't like the idea of kind of automating your formatting and creating rules for formatting when something like this can just, like as soon as I did this, I understood my code better. And for me, understanding my code is the only way to make it work, because debugging machine learning code is awful, so you've got to make sure that the thing you write makes sense. It's got to be simple. It's got to be really simple. So this is really simple. Then, more interestingly, we used to create the optimizer in init, and you could actually pass in an already created optimizer. I removed that, and the only thing now you can pass in is an optimization function, so something that will create an optimizer, which is what we've always been doing anyway. And by doing that, we can now create our optimizer when we start fitting. And that turns out to be really important, because when we do things like discriminative learning rates and gradual unfreezing and layer groups and stuff, we can change things. And then when we fit, it'll all just work. So that's a more significant, it's like one line of code, but it's conceptually a very significant change. OK, so that's some minor changes to 9b, and now let's move on to mix up and label smoothing. So I'm really excited about the stuff we saw at the end of the last lesson where we saw how we can use the GPU to do data augmentation, fully randomized, fully GPU accelerated data augmentation using just plain PyTorch operations. I think that's a big win. But it's quite possible we don't need that kind of data augmentation anymore, because in our experimentation with this data augmentation called mix up, we found we can remove most other data augmentation and get amazingly good results. So it's just kind of a simplicity result. And also when you use mix up, you can train for a really long time and get really good results. So let me show you mix up. And in terms of the results, you can get what happened in the bag of tricks paper was when they turned mix up on, they also started training for 200 epochs instead of 120. So be a bit careful when you interpret their paper table. When it goes from label smoothing 94.1 to mix up without distillation 94.6, they're also nearly doubling the number of epochs they do. But you can kind of get a sense that you can get a big decrease in error. The other thing they mentioned in the paper is distillation. I'm not going to talk about that, because it's a thing where you pre-train some much bigger model, like a ResNet 152. And then you try and train something that predicts the output of that. And to me, the idea of training a really big model to train a smaller model, it's interesting, but it's not exactly training in the way I normally think about it. So we're not looking at distillation. It would be an interesting assignment if somebody wanted to try adding it to the notebooks, though. You have all the information, I think, or the skills you need to do that now. All right. So mix up. We start by grabbing our ImageNet dataset. And we grab the makeRGB and resize and turn it into a float tensor. This is just our quick and dirty resize. We're already doing this for testing purposes. Split it up, create a data bunch, all the normal stuff. And what we're going to do is we're going to take an image like this and an image like this. And we're going to combine them. We're going to take 0.3 times this image plus 0.7 times this image. And this is what it's going to look like. Unfortunately, Sylvain and I have different orderings of file names on our thing. So I wrote it's a French horn and a tenche, but actually, Sylvain clearly doesn't have French horn or tenches, but you get the idea. It's a mix up of two different images. So we're going to create a data augmentation where every time we predict something, we're going to be predicting a mix of two things like this. So we're going to both take the linear combination, 0.3 and 0.7, of the two images, but then we're going to have to do that for the labels as well, right? There's no point predicting the one hot encoded output of this breed of doggy when there's also a bit of a gas pump. So we're not going to have one hot encoded output. We're going to have a 0.7 encoded doggy and a 0.3 encoded gas pump. So that's the basic idea. So the mix up paper was super cool. Wow, there are people talking about things that aren't deep learning. I guess that's their priorities. So the papers are pretty nice, easy read by paper standards, and I would definitely suggest you check it out. So I've told you what we're going to do. Implementation wise, we have to decide what number to use here. Is it 0.3 or 0.1 or 0.5 or what? And this is a data augmentation method. So the answer is we'll randomize it. But we're not going to randomize it from 0 to 1 uniform or not to 0.5 uniform. But instead, we're going to randomize it using shapes like this. In other words, when we grab a random number, most of the time it'll be really close to 0 or really close to 1. And just occasionally, it'll be close to 0.5. So that way, most of the time, it'll be pretty easy for our model because we'd be predicting one and only one thing. And just occasionally, it'll be predicting something that's a pretty evenly mixed combination. So the ability to grab random numbers, that this is basically the histogram, the smooth histogram of how often we're going to see those numbers, is called sampling from a probability distribution. And basically, in nearly all these cases, you can start with a uniform random number or a normal random number and put it through some kind of function or process to turn it into something like this. So the details don't matter at all. But the paper points out that this particular shape is nicely characterized by something called the beta distribution. So that's what we're going to use. So it was interesting drawing these because it requires a few interesting bits of math, which some of you may be less comfortable with or entirely uncomfortable with. For me, every time I see this function, which is called the gamma function, I kind of break out in sweats, not just because I got a cold, but it's like the idea of functions that I don't want, like how do you describe this thing. But actually, it turns out that, like most things, once you look at it, it's actually pretty straightforward. And we're going to be using this function. So I'll just quickly explain what's going on. We're going to start with a factorial function. So 1 times 2 times 3 times 4, whatever. And here, these red dots is just the value of the factorial function for a few different places. But don't think of the factorial function as being 1 times 2 times 3 times 4, or times n, whatever. But divide both sides by n. And now you've got, or divide both sides by n. And now you've got factorial n divided by n equals 1 times 2 times 3. So it equals the factorial of n minus 1. And so when you define it like that, you suddenly realize there's no reason that you kind of have a function that's not just on the integers, but is everywhere. This is the point where I stop with the math. Because to me, if I need a sine function, or a log function, or an x function, or whatever, I type it into my computer and I get it. So the actual how you get it is not at all important. But the fact of knowing what these functions are and how they're defined is useful. PyTorch doesn't have this function. Weirdly enough, they have a log gamma function. So we can take log gamma and go e to the power of that to get a gamma function. And you'll see here I am breaking my no Greek letters rule. And the reason I'm breaking that rule is because a function like this doesn't have a kind of domain-specific meaning or a pure physical analogy, which is how we always think about it. It's just a math function. And so we call it gamma. And so if you're going to call it gamma, you may as well write it like that. And why this matters is when you start using it. Like look at the difference between writing it out with the actual Unicode and operators versus what would happen if you wrote it out long form in Python. Like when you're comparing something to a paper, you want something that you can look at and straight away say like, oh, that looks very familiar. And as long as it's not familiar, you might want to think about how to make it more familiar. So I'll just briefly mention that writing these math symbols nowadays is actually pretty easy. On Linux, there's a thing called a compose key, which is probably already set up for you. And if you Google it, you can learn how to turn it on. And it's basically like you'll press the right alt button or the caps lock button. You can choose what your compose key is. And then a few more letters. So for example, all the Greek letters are compose and then star and then the English letter that corresponds with it. So for example, if I wanted to lambda, I would go compose star L. So it's just as quick as typing non-unicode characters. Most of the Greek letters are available on a Mac keyboard just with option. Unfortunately, nobody's created a decent compose key for Mac yet. There's a great compose key for Windows called win compose. Anybody who's working with Greek letters should definitely install and learn to use these things. So there's our gamma function. Nice and concise. It looks exactly like the paper. And so it turns out that this is how you calculate the value of the beta function, which is the beta distribution. And so now here it is. As I said, the details aren't important, but they're the tools that you can use. And the basic idea is that we now have something where we can pick some parameter, which is called alpha, where if it's high, then it's much more likely that we get equal mix. And if it's low, it's very unlikely. And this is really important because for data augmentation, we need to be able to tune a lever that says, how much regularization am I doing? How much augmentation am I doing? So you can move your alpha up and down. And the reason it's important to be able to print these plots out is that when you change your alpha, you want to plot it out and see what it looks like. Make sure it looks sensible. OK. So it turns out that all we need to do then is we don't actually have to 0.7 hot encode one thing and 0.3 hot encode another thing. It's actually identical to simply go, I guess, lambda times the first loss plus 1 minus lambda times the second loss. I guess we're using t here. So that's actually all we need to do. So this is our mix up. And again, as you can see, we're using the same letters that we'd expect to see in the paper. So everything should look very familiar. And mix up, remember, is something which is going to change our loss function. So we need to know what loss function to change. So when you begin fitting, you find out what the old loss function on the learner was when you store it away. And then when we calculate loss, we can just go ahead and say, oh, if it's in validation, there's no mix up involved. And if we're training, then we'll calculate the loss on two different sets of images. One is just the regular set. And the second is we'll grab all the other images and randomly permute one and randomly pick one to share with. So we do that for the image. And we do that for the loss. And that's basically it. Couple of minor things to mention. In the last lesson, I created an EWMA function, exponentially weighted moving average function, which is a really dumb name for it, because actually it was just a linear combination of two things. It's like v times alpha plus v1 times alpha plus v2 times 1 minus alpha. You create exponentially weighted moving averages with it by applying it multiple times. But the actual function is a linear combination. So I've renamed that to linear combination. And you'll see that so many places. So this mix up is a linear combination of our actual images and some randomly permuted images in that mini-bench. And our loss is a linear combination of the loss of our two different parts, our normal mini-batch and our randomly permuted mini-batch. One of the nice things about this is if you think about it, this is all being applied on the GPU. So this is pretty much instant. So a super powerful augmentation system which isn't going to add any overhead to our code. One thing to be careful of is that we're actually replacing the loss function. And loss functions have something called a reduction. And most PyTorch loss functions you can say, after calculating the loss function for everything in the mini-batch, either return a rank one tensor of all of the loss functions for the mini-batch, or add them all up, or take the average. We pretty much always take the average. But we just have to make sure that we do the right thing. So I've just got a little function here that does the mean or sum, or nothing at all, as requested. And so then we need to make sure that we create our new loss function that, at the end, it's going to reduce it in the way that they actually asked for. But then we have to turn off the reduction when we actually do mix-up, because we actually need to calculate the loss on every image for both halves of our mix-up. So this is a good place to use a context manager, which we've seen before. So we just created a tiny little context manager, which will just find out what the previous reduction was, save it away, get rid of it, and then put it back when it's finished. So there's a lot of minor details there. But with that in place, the actual mix-up itself is very little code. It's a single callback. And we can then run it in the usual way, just add mix-up. Our default alpha here is 0.4. I've been mainly playing with alpha at 0.2, so this is a bit more than I'm used to. But somewhere around that vicinity is pretty normal. So that's mix-up. And it's really interesting, because you could use this for layers other than the input layer. You could use it on the first layer, maybe with the embeddings. So you could do mix-up augmentation in NLP, for instance. It's something which people haven't really dug into deeply yet. But it seems to be an opportunity to add augmentation in many places where we don't really see it at the moment, which means we can train better models with less data, which is why we're here. So here's the problem. How does Softmax interact with this? So now we've drawn some random number. Lambda is 0.7. So I've got 0.7 of a dog and 0.3 of a gas station. And the correct answer would be a rank 1 tensor, which has 0.7 in one spot and 0.3 in the other spot and 0 everywhere else. Softmax isn't going to want to do that for me, because Softmax really wants just one of my values to be high, because it's got an e to the top, as we've talked about. So to really use MixUpWell, and not just to use MixUpWell, but any time the labels on the data you're not 100% sure they're correct, you don't want to be asking your model to predict 1. You want to be don't predict I'm 100% sure it's this label, because you've got label noise, you've got incorrect labels, or you've got mix up, mixing, whatever. So instead, we say, oh, don't use one hot encoding for the dependent variable, but use a little bit less than one hot encoding, so to say 0.9 hot encoding. So then the correct answer is to say I'm 90% sure this is the answer. And then all of your probabilities have to add to 1. So then all of the negatives, you just put 0.1 divided by n minus 1 on all the rest. And that's called label smoothing. And it's a really simple but astonishingly effective way to handle noisy labels. Like, I keep on hearing people saying, oh, we can't use deep learning in this medical problem because the diagnostic labels in the reports are not perfect, and we don't have a gold standard, whatever. It actually turns out that particularly if you lose label smoothing, noisy data is generally not an option. Like, there's plenty of examples of people using this where they literally randomly permute half the labels to make them like 50% wrong, and they still get good results, really good results. So don't listen to people in your organization saying, we can't start modeling until we do all this cleanup work. Start modeling right now, see if the results are okay. And if they are, then maybe you can skip all the cleanup work or do them simultaneously. So label smoothing ends up just being the cross entropy loss as before times, if epsilon is 0.1 and 0.9 plus 0.1 times the cross entropy for everything divided by n. And the nice thing is that's another linear combination. So once you kind of create one of these little mathematical refactoring, so tend to pop up everywhere and make your code a little bit easier to read and a little bit harder to stuff up. Every time I have to write a piece of code, there's a very high probability that I'm gonna screw it up. So the less I have to write, the less debugging I'm gonna have to do later. So we can just pop that in as a loss function and away we go. So that's a super powerful technique which it's been around for a couple of years, those two techniques, but not nearly as widely used as they should be. Then if you're using a Volta, Tensacor, 2080, any kind of pretty much any current generation Nvidia graphics card, you can train using half precision floating point in theory like 10 times faster. In practice, it doesn't quite work out that way because there's other things going on, but we certainly often see three X speed ups. So the other thing we've got is some work here to allow you to train in half precision floating point. Now the reason it's not as simple as saying model.half, which would convert all of your weights and biases and everything to half precision floating point, is because of this. This is from Nvidia's materials. And what they point out is that you can't just use half precision everywhere because it's not accurate. It's bumpy, so it's hard to get good useful gradients if you do everything in half precision. Particularly often things are round off to zero. So instead what we do is we do the forward pass, NFP 16, we do the backward pass, NFP 16. So all the hard work is done in half precision floating point. And pretty much everywhere else, we convert things to full precision floating point and do everything else in full precision. So for example, when we actually apply the gradients by multiplying by the learning rate, we do that in FB 32, single precision. And that means that like if your learning rate's really small, in FB 16 it might basically round down to zero. So we do it in FB 32. In fast AI version one, we wrote all this by hand. For the lessons, we're experimenting with using a library from Nvidia called Apex. Apex basically have some of the functions to do this there for you. So we're using it here. And basically you can see like there's a thing called model to half, where we just go model.half, batch norm, goes to float and so forth. So these are not particularly interesting, but they're just going through each one and making sure that the right layers have the right types. So once we've got those kind of utility functions in place, the actual callback's really quite small and you'll be able to map every stage to that picture I showed you before. So you'll be able to see like when we start fitting, we convert the network to half precision floating point, for example. One of the things that's kind of interesting is there's something here called loss scale. After the backward pass, well probably more interestingly, after the loss is calculated, we multiply it by this number called loss scale, which is generally something around 512. The reason we do that is that losses tend to be pretty small in a region where half precision floating point's not very accurate. So we just multiply it by 512, put it in a region it is accurate. And then later on in the backward step, we just divide by that again. So that's a little tweak, but it's the difference we find generally between things working and not working. So the nice thing is now we have something which you can just add mixed precision and train and you will get often 2x, 3x speed up. Certainly on vision models, also on transformers, quite a few places. One obvious question is 512 the right number. And it turns out getting this number right actually does make quite a difference to your training. And so something slightly more recently is called dynamic loss scaling, which literally tries a few different values of loss scale to find out at what point does it become infinity. And so it dynamically figures out the highest loss scale we can go to. And so this version just has the dynamic loss scaling added. It's interesting that sometimes training with half precision gives you better results than training with FP32, because there's just, I don't know, a bit more randomness, maybe it regularizes a little bit, but generally it's super, super similar just faster. We have a question about mixup. Great. Is there an intuitive way to understand why mixup is better than other data augmentation techniques? I think one of the things that's really nice about mixup is that it doesn't require any domain specific thinking. Like do we flip horizontally or also vertically? How much can we rotate? It doesn't create any kind of lossiness. Like in the corners, there's no reflection padding or black padding. So it's kind of like quite nice and clean. It's also like almost infinite in terms of the number of different images that can create. So you've kind of got this permutation of every image with every other image, which is already giant, and then in different mixes. So it's just a lot of augmentation that you can do with it. And there are other similar things. So there's another thing which is something called cut out where you just delete a square and replace it with black. There's another one where you delete a square and replace it with random pixels. Something I haven't seen, but I'd really like to see people do is to delete a square and replace it with a different image. So I'd love somebody to try doing mixup, but instead of taking the linear combination, instead pick an alpha sized, I'm sorry, a lambda percent of the pixels like in a square and paste them on top. There's another one which basically finds four different images and puts them in four corners. So there's a few different variations and they really get great results and surprised how few people are using them. So let's put it all together. So here's ImageNet. So let's use our random resize crop, a minimum scale of 0.35 we find works pretty well. And we're not gonna do any other other than flip. We're not gonna do any other augmentation. And now we need to create a model. So far all of our models have been boring convolutional models, but obviously what we really wanna be using is a ResNet model. We have the X ResNet, which there's some debate about whether this is the mutant version of ResNet or the extended version of ResNet. So you can choose what the X stands for. And basically the X ResNet is the bag of tricks, is basically the bag of tricks ResNet. So they have a few suggested tweaks to ResNet. So these are their little tweaks. So the first tweak is something that we've kind of talked about and they call it ResNet C and it's basically, hey, let's not do a big seven by seven convolution as our first layer because that's super inefficient and it's just a single linear model which doesn't have much kind of richness to it. So instead let's do three comms in a row, three by three, right? And so three, three by three comms in a row, if you think about it, the receptive field of that final one is still gonna be about seven by seven, right? But it's got there through a much richer set of things that it can learn because it's a three layer neural net. So that's the first thing that we do in our X ResNet. So here is X ResNet and when we create it, we set up how many filters are there gonna be for each of the first three layers? So the first three layers will start with channels in inputs. So that'll default to three because normally we have three channel images, right? And the number of outputs that we'll use for the first layer will be that plus one times eight. Why is that? It's a bit of a long story. One reason is that that gives you 32 at the second layer which is the same as what the bag of tricks paper recommends, as you can see. The second reason is that I've kind of played around with this quite a lot to try to figure out what makes sense in terms of the receptive field. And I think this gives you an at the right amount. The times eight is here because Nvidia graphics cards like everything to be a multiple of eight. So if this is not eight, it's probably gonna be slower. But one of the things here is now if you have like a one channel input like on white or a five channel input like some kind of hyperspectral imaging or microscopy, then you're actually changing your model dynamically to say, oh, I've got more inputs than my first layer should have more activations. Which is not something I've seen anybody do before but it's a kind of really simple, nice way to improve your ResNet for different kinds of domains. So that's the number of filters we have for each layer. So our stem, so the stem is the very start of a CNN. So our stem is just those three conf layers. So that's all the paper says. What's a conf layer? A conf layer is a sequential containing a bunch of layers which starts with a conf of some stride followed by a batch norm. And then optionally followed by an activation function. Then our activation function, we're just gonna use value for now because that's what they're using the paper. The batch norm, we do something interesting. This is another tweak from the bag of tricks although it goes back a couple more years than that. We initialize the batch norm sometimes to have weights of one and sometimes to have weights of zero. Why do we do that? Well, all right. Have a look here at ResNet D. This is a standard ResNet block. This path here normally, it doesn't have the conf in the average pool. So pretend they're not there. We'll talk about why they're there sometimes in a moment. But this is pretend this is just the identity. And the other goes one by one conf, three by three conf, one by one conf. And remember in each case, it's conf batch norm relu, conf batch norm relu. And then what actually happens is it then goes conf batch norm and then the relu happens after the plus. There's another variant where the relu happens before the plus which is called preact or preactivation ResNet. Turns out it doesn't work quite as well for smaller models. So we're using the non-preact version. Now, see this conf here? What if we set the batch norm layer weights there to zero? What's going to happen? Well, we get an input. This is identity. This does some conf, some conf, some conf and then batch norm where the weights are zero so everything gets multiplied by zero and so out of here comes zero. So why is that interesting? Because now we're adding zero to the identity block. So in other words, the whole block does nothing at all. That's a great way to initialize a model, right? Because we really don't want to be in a position as we've seen where if you've got a thousand layers deep model that any layer is even slightly changing the variance because they kind of cause the gradients to spiral off to zero or to infinity. This way literally the entire activations are the same all the way through. So that's what we do. We set the one, two, three third conf layer to have zero in that batch norm layer and this lets us train very deep models at very high learning rates. You'll see nearly all of the academic literature about this talks about large batch sizes because of course academics particularly at big companies like Google and OpenAI and Nvidia and Facebook love to show off their giant data centers. And so they like to say, oh, if we do a thousand TPUs how big a batch size can we create? But for us normal people, these are also interesting because the exact same things tell us how high a learning rate can we go, right? So the exact same things that let you create really big batch sizes. So you do a giant batch and then you take a giant step. Well, we can just take a normal sized batch and do it a much bigger than usual step and by using higher learning rates, we train faster and we generalize better and so that's all good. So this is a really good little trick. Okay, so that's conf layer. So there's our stem and then we're going to create a bunch of res blocks. So a res block is one of these except this is an identity path, right? Conf, conf, conf unless we're doing a ResNet 34 or a ResNet 18 in which case one of these comms goes away. So a ResNet 34 and ResNet 18 only have two cons here and ResNet 50 onwards have three cons here. So, and then in ResNet 50 and above the second con they actually squish the number of channels down by four and then they expanded back up again. So it could go like 64 channels to 16 channels to 64 channels. Let's call it a bottleneck layer. So a bottleneck block is the normal block for larger ResNets and then just two three by three comms is the normal for smaller ResNets. So you can see in our ResBlock that we pass in this thing called expansion and it's either one or four. It's one if it's ResNet 18 or 34 and it's four if it's bigger, right? And so if it's four well if it's expansion equals one then we just add one extra conv, right? Oh, sorry, the first conv is always a one by one and then we add a three by three conv or if expansion equals four we add two extra convs. So that's what the ResBlocks are. Now, I mentioned that there's two other things here. Why are there two other things here? Well, we can't use standard ResBlocks all the way through our model, can we? Because a ResBlock can't change the grid size. We can't have a stride two anywhere here because if we had a stride two somewhere here we can't add it back to the identity because they're now different sizes. Also we can't change the number of channels, right? Because if we change the number of channels we can't add it to the identity. So what do we do? Well, as you know from time to time we do like to throw in a stride two and generally when we throw in a stride two we like to double the number of channels and so when we do that we're going to add to the identity path two extra layers. Well, at an average pooling layer that's going to cause the grid size to shift down by two in each dimension and we'll add a one by one conv to change the number of filters. So that's what this is and this particular way of doing it is specific to the X ResNet and it gives you a nice little boost over the standard approach and so you can see that here if the number of inputs is different to the number of filters then we add an extra conv layer otherwise we just do no op no operation which is defined here and if the stride is something other than one we add an average pooling otherwise it's a no op and so here is our final ResNet block calculation. So that's the ResBlock, right? So tweak for ResNetD is this way of doing the they call it a down sampling path. Okay and then the final tweak is the actual ordering here of where the stride two is normally the stride two in normal ResNet is at the start and then there's a three by three after that doing a stride two on a one by one conv is a terrible idea because you're literally throwing away three quarters of the data and it's interesting that it took people years to realize they're literally throwing away three quarters of the data. So the bag of tricks folks said let's just move the stride two to the three by three and that makes a lot more sense, right? Because a stride two three by three you're actually hitting every pixel. So the reason I'm mentioning these details is so that you can read that paper and spend time thinking about for each of those ResNet tweaks do you understand why they did that? Right? It wasn't some neural architecture search try everything brainless use all our computers approach. It was a low let's sit back and think about how do we actually use all the inputs we have and how do we actually take advantage of all the computation that we're doing, right? So it's a very I mean most of the tweaks are stuff that's exists from before and they've cited all those but if you put them all together it's just a nice like here's how to think through architecture design. And that's about it, right? So we create a ResNet block for every ResLayer and so here it is creating the ResNet block and so now we can create all of our ResNets by simply saying this is how many blocks we have in each layer right so ResNet 18 is just 2222 34 is 3463 and then secondly is changing the expansion factor which as I said for 18 and 34 is 1 and for the bigger ones is 4. So that's a lot of information there and if you haven't spent time thinking about architecture before it might take you a few reads and listens to put to sync in but I think it's a really good idea to try to spend time thinking about that and also to like experiment, right? and try to think about what's going on. The other thing to point out here is that this the way I've written this it's like this is the whole this is the whole ResNet right other than the definition of Conflayer this is the whole ResNet it fits on a screen and this is really unusual most ResNets you see even without the bag of tricks 500, 600, 700 lines of code right and if every single line of code has a different arbitrary number at 16 here and 32 there and average pool here and something else there like how are you going to get it right? And how are you going to be able to look at it and say what if I did this a little bit differently? So for research and for production you want to get your code refactored like this for your architecture so that you can look at it and say what exactly is going on is it written correctly? Okay, I want to change this stride to to being in a different layer how do I do it? Like it's really important to for effective practitioners to be able to write nice, concise architectures so that you can change them and understand them. Okay, so that's our X ResNet we can train it with or without mix-up it's up to us label smoothing cross-entropies probably always a good idea unless you know that your labels are basically perfect let's just create a little ResNet 18 and let's check out to see what our model is doing so we've already got a model summary but we're just going to rewrite it to use our the new version of learner that doesn't have runner anymore and so we can print out and see what happens to our shapes as they go through the model and you can change this print mod here to true and it'll print out the entire blocks and then show you what's going on so that would be a really useful thing to help you understand what's going on the model. All right, so here's our architecture it's nice and easy we can tell you how many channels are coming in how many channels are coming out and it'll adapt automatically to our data that way so we can create our learner we can do our LR find and now that we've done that let's create a one-cycle learning rate annealing so a one-cycle learning rate annealing we've seen all this before we keep on creating these things like 0.3, 0.7 for the two phases or 0.3, 0.2, 0.5 for three phases so I added little create phases that will build those for us automatically this one we've built before so here's our standard one-cycle annealing and here's our parameter scheduler and so one other thing I did last week was I made it that callbacks you don't have to pass to the initializer you can also pass them to the fit function and it'll just run those callbacks to the fit functions this is a great way to do parameter scheduling and there we go and so 83.2 so I would love to see people beat my benchmarks here so here's the image net site and so far the best I've got for 128 5 epochs is 84.6 so yep, we're super close so maybe with some fiddling around you can find something that's even better and with these kind of leaderboards where a lot of these things can train in this is two and a half minutes on a standard I think it was a GTX 1080 Ti you can quickly try things out and what I've noticed is that the results I get in 5 epochs on 128 pixel image net models carry over a lot to image net training or bigger models so like you can learn a lot by not trying to train giant models so compete on this leaderboard to become a better practitioner to try out things right and if you do have some more time you can go the way to 400 epochs that might take a couple of hours and then of course also we've got image wolf which is just doggy photos and is much harder and actually this one I find an even better test case because it's a more difficult dataset so we've got a 90% is my best for this so I hope somebody can can beat me I really do so we can refactor all that stuff of adding all these different callbacks and stuff into a single function called CNN learner and we can just pass in an architecture and our data and our loss function and our optimization function and what kind of callbacks do we want just yes or no and we'll just set everything up right and if you don't pass in C in and C out we'll grab it from your data for you and then we'll just pass that off to the learner so that makes things easier so now if you want to create a CNN it's just one line of code adding in whatever we want mix up label smoothing blah blah blah and so we get the same result when we fit it so we can see this all put together in this ImageNet training script which is in FastAI in examples slash train ImageNet and this entire thing will look entirely familiar to you it's all stuff that we've now built from scratch with one exception which is this bit which is using multiple GPUs so we're not covering that but that's just a you know an acceleration tweak and you can easily use multiple GPUs by simply doing data parallel or to distributed other than that yeah this is all stuff that you see and there's label smoothing cross entropy there's mix up here's something we haven't written save the model after every epoch maybe you want to write that one that'd be a good exercise so what happens if we try to train this for just 60 epochs this is what happens so benchmark results on ImageNet these are all the Keras and PyTorch models it's very hard to compare them because they have different input sizes so we really should compare the ones with our input size which is 224 so a standard ResNet, oh, scrolled off the screen so ResNet 50 is so bad it's actually scrolled off the screen so let's take ResNet 101 as a 93.3% accuracy so that's twice as many layers as we used and it was also trained for 90 epochs so trained for 50% longer 93.3 when I trained this on ImageNet I got 94.1 so this like extremely simple architecture that fits on a single screen and was built entirely using common sense trained for just 60 epochs actually gets us even above ResNet 152 because that's 93.8 we've got 94.1 so the only things above it were trained on much, much larger images and also like NASNet large is so big I can't train it I just keep on running out of memory in time and Inception ResNet version 2 is really, really fiddly and also really, really slow so we've now got you know this beautiful nice ResNet XResNet 50 model which you know is built in this very first principles common sense way and gets astonishingly great results so you know I really don't think we all need to be running to neural architecture search and hyperparameter optimization and blah, blah, blah we just need to use you know good common sense thinking so I'm super excited to see how well that worked out so now that we have a nice model we want to be able to do transfer learning so how do we do transfer learning I mean you all know how to do transfer learning but let's do it from scratch so what I'm going to do is I'm going to transfer learn from ImageWolf to the Pets data set that we used in lesson one that's our goal so we start by grabbing ImageWolf we do the standard data block stuff let's use label smoothing cross entropy notice how we're using all the stuff we've built this is our admin optimizer this is our label smoothing cross entropy this is the data blocks API we wrote so this is we're still not using anything from FastAI v1 this is all stuff that if you want to know what's going on you can go back to that previous lesson and see what did we build and how did we build it and step through the code there's a CNN learner that we just built in the last notebook these five lines of code I got sick of typing so let's dump them into a single function called schedule one cycle going to create our phases it's going to create our momentum annealing and a learning rate annealing and create our schedulers so now with that we can just say schedule one cycle with a learning rate what percentage of the epochs are at the start batches I should say at the start and we can go ahead and fit so I thought okay for transfer learning we should try and fit a decent model so I did 40 epochs at 11 seconds per epoch on a 1080 Ti so a few minutes later we've got 79.6% accuracy which is pretty good you know training from scratch for 10 different dog breeds with a ResNet 18 so let's try and use this to create a good pets model that's going to be a little bit tricky because the pets data set has cats as well and this model's never seen cats and also this model has only been trained on I think less than 10,000 images so it's kind of unusually small thing that we're trying to do here so it's an interesting experiment to see if this works so the first thing we have to do is we have to save the model so that we can load it into a pets model so when we save a model what we do is we grab its state dict now we actually haven't written this but it would be like three lines of code if you want to write it yourself because all it does is it literally creates a dictionary an order dict is just a Python standard library dictionary that has an order where the keys are just the names of all the layers and for sequential the index of each one and then you can look up say 10.bias and it just returns the weights okay so you can easily turn a module into a dictionary and so then we can create somewhere to save our model and torch.save will save that dictionary you can actually just use pickle here works fine and actually behind the scenes torch.save is using pickle but they kind of like add some header to it to say like it's basically a magic number that when they read it back they make sure it is a PyTorch model file and that it's the right version and stuff like that but you can totally use pickle and so the nice thing is now that we know that the thing we've saved is just a dictionary so you can fiddle with it right if you have travel loading something in the future just open up just go torch.load put it into a dictionary and look at the keys and look at the values and see what's going on so let's try and use this for pets so we've seen pets before so the nice thing is that we've never used pets in part two but our data blocks API totally works and in this case there's one images directory that contains all the images and there isn't a separate validation set directory so we can't use that label with sorry, yeah, label with sorry, split with grandparent thing so we're going to have to split it randomly but remember how we've already created split by funk so let's just write a function that returns true or false depending on whether some random number is large or small and so now we can just pass that through our split by funk and we're done right so the nice thing is when you kind of understand the what's going on behind the scenes it's super easy for you to customize things and fast AI v1 is basically identical there's a split by funk that you do the same thing for so now that's split into training and validation and you can see how nice it is that we created that Dunder Repra so that we can print things out so easily to see what's going on so if something doesn't have a nice representation you should like monkey patch in a Dunder Repra so you can print out what's going on now we have to label it so we can't label it by folder because they're not put into folders instead we have to look at the file name so let's grab one file name so I had to build all this stuff in a Jupyter notebook just interactively to see what's going on so in this case we'll grab one name and then let's try to construct a regular expression that grabs just the doggies name from that and once we've got it we can now turn that into a function and we can now go ahead and use that category processor we built last week to label it and there we go there's all the kinds of doggie we have we're not just doggies now doggies and kitties okay so now we can train from scratchpets 37% not great so maybe with transfer learning we can do better so transfer learning we can read in that image-woof model and then we will customize it for pets so let's create a CNN for pets this is now the pets data bunch but let's tell it to create a model with 10 filters out 10 activations at the end because remember image-woof has 10 types of dog 10 breeds so to load in the pre-trained model we're going to need to ask for a learner with 10 activations so that is something we can now grab our state dictionary that we saved earlier and we can load it into our model okay so this is now an image-woof model but the learner for it is pointing out the pets data bunch so what we now have to do is remove the final linear layer and replace it with one that has the right number of activations to handle all these which I think is 37 pet breeds so what we do is we look through all the children of the model and we try to find the adaptive average pooling layer because that's that kind of penultimate bit and we grab the index of that and then let's create a new model that has everything up to but not including that bit right so this is everything before the adaptive average pooling right so this is the body so now we need to attach a new head to this body which is going to have 37 activations in the linear layer instead of 10 which is a bit tricky because we need to know how many inputs are going to be required in this new linear layer and the number of inputs will be however many outputs come out of this right so in other words the just before the average pooling happens in in the X-ResNet how many activations are there how many channels well there's an easy way to find out grab a batch of data put it through a cut down model and look at the shape and the answer is there's 512 okay so we've got a 128 mini batch of 512 4x4 activations so that pred dot shape 1 is the number of inputs to our head and so we can now create our head this is basically it here our linear layer but remember we tend to not just use a max pool or just an average pool we tend to do both and concatenate them together which is something we've been doing in this course forever but a couple of years somebody finally did actually write a paper about it so I think this is actually an official thing now and it generally gives a nice little boost so our linear layer needs twice as many inputs because we've got two sets of pooling we did so our new model contains the whole head plus a adaptive concat pooling platen and our linear and so let's replace the model with that new model we created and fit and look at that 71% by fine tuning versus 37% training from scratch so that looks good so we have a simple transfer learning working so what I did then I do this in Jupiter all the time I basically grabbed kind of all the cells I hit C to copy and then I hit V to paste and then I grabbed them all and I hit shift M to merge and chucked a function header on top so now I've got a function that does all the lines you saw just before and I've just stuck them all together into a function I call it adapt model it's going to take a learner and adapt it for the new data so these are all the lines of code you've already seen and so now we can just go CNN learner load the state dict adapt the model and then we can start training but of course what we really like to do is to first of all train only the head so let's grab all the parameters in the body and remember when we did that NN dot sequential the body is just the first thing that's the whole ResNet body so let's grab all the parameters in the body and set them to requires grad equals false so it's frozen and so now we can train just the head and we get 54% which is great so now we as you know unfreeze okay and train some more uh-oh so it's better than not fine-tuning but interestingly it's worse 71 versus 56 it's worse than the kind of naive fine-tuning where we didn't do any freezing so what's going on there anytime something weird happens in your neural net it's almost certainly because of batch norm because batch norm makes everything weird and that's true here too what happened was our frozen our frozen part of our model which was designed for image wolf those layers were tuned for some particular set of mean and standard deviations because remember the batch norm is going to divide by the subtract the mean and divide by the standard deviation but the pets data set has different means and standard deviations not for the input but inside the model so then when we unfroze this it basically said like you know this final layer was getting trained for everything being frozen but that was for a different set of batch norm statistics and so then when we unfroze it everything tried to kind of catch up and it would be very interesting to look at the kind of histograms and stuff that we did earlier in the course and like see what's really going on because I haven't really seen anybody I haven't really seen a paper about this there's something we've been doing in fast AI for a few years now but I think this is the first course where we've actually drawn attention to it that's kind of something that's been hidden away in the library before but as you can see it's a huge difference right the difference in 56 versus 71 so the good news is it's easily fixed and the trick is to not freeze all of the body parameters but freeze all of the body parameters that aren't in the batch norm layers and that way when we fine-tune the final layer we're also fine-tuning all of the batch norm layers weights and biases so we can create just like before at the model and let's create something called set gradient which says oh if it's a linear layer at the end or a batch norm layer in the middle return don't change the gradient otherwise if it's got weights set requires grad 2 whatever you asked for which we're going to start false here's a little convenient function that will apply any function you pass to it recursively to all of the children of a model so now that we have apply to a model or apply to a module I guess we can just pass in a module and that will be applied throughout so this way we freeze just the non-batch norm layers and of course not the last layer and so actually fine-tuning immediately is a bit better goes from 54 to 58 but more importantly then when we unfreeze we're back into the 70s again okay so this is just a super important thing to remember if you're doing fine-tuning and I don't think there's any library other than fastai that does this weirdly enough is if you're so if you're using TensorFlow or something you'll have to write this yourself to make sure that you don't freeze ever don't ever freeze the weights in the batch norm layers anytime you're doing partial partial layer training oh by the way that apply mod I only wrote it because we're not allowed to use stuff in PyTorch but actually PyTorch has its own it's called model.apply so you can use that now it's the same thing okay so finally for this half of the course we're going to look at discriminative learning rates so for discriminative learning rates there's a few things we can do with them one is it's a simple way to do layer freezing without actually worrying about setting requires grad we could just set the learning rate to zero for some layers so let's start by doing that so what we're going to do is we're going to split our parameters into two or more groups with a function here's our function it's called bn splitter it's going to create two groups of parameters and it's going to pass the body to underscore bn splitter which will recursively look for batch norm layers and put them in the second group or anything else with the weight goes in the first group and then do it recursively and then also the second group will add everything after the head so this is basically doing something where we're putting all our parameters into the two groups we want to treat differently so we can check for example that when we do bn splitter on a model that the number of parameters in the two halves is equal to the total number of parameters in the model and so now I want to check this works right I want to make sure that if I pass this because we now have a splitter function in the learner and that's another thing I added this week that when you start training it's literally just this when we get our grad optimizer it passes the model to self dot splitter which by default does nothing at all and so we're going to be using our bn splitter to split it into multiple parameter groups and so how do we debug that how do we make sure it's working because this is one of these things that if I screw it up I probably won't get an error but instead it probably like won't train my last layer it'll train all the layers at the same learning rate or like it would be hard to know if the model was bad because I screwed up my code or not so we need a way to debug it we can't just look inside and make sure it's working because what we're going to be doing is we're going to be passing it see this one we're going to be we're going to be passing it to the splitter parameter when we create the learner right so after this it set the splitter parameter and then when we start training we're hoping that it's going to create these two layer groups so we need some way to look inside the model so of course we're going to use a callback and this is something that's super cool do you remember how I told you that you can actually override Dundercall itself you don't just have to override a specific callback and by overriding Dundercall itself we can actually say which callback do we want to debug and when we hit that callback please run this function and if you don't pass in a function it just jumps into the debugger as soon as that callback is hit otherwise call the function so this is super handy right because now I can create a function called print details that just prints out how many parameter groups there are and what the hyper parameters there are and then immediately raises the cancel training exception to stop and so then I can fit with my discriminative LR scheduler and my debug callback and my discriminative LR scheduler is something that now doesn't just take a learning rate but an array of learning rates and creates a scheduler for every learning rate and so I can pass that in so I'm going to use 0 and .02 so in other words no training for the body and .03 rather for the head and the batch norm and so as soon as I fit it immediately stops because the cancel training exception was raised and it prints out and it says there's two parameter groups which is what we want and the first parameter group has a learning rate of zero which is what we want and the second is .003 which is right because it's .03 and we're using the learning rate scheduler so it starts out 10 times smaller so this is just a way of saying like if you're anything like me every time you write code it will always be wrong and for this kind of code you won't know it's wrong and you could be writing a paper or doing a project at work or whatever in which you're not using discriminative learning rates at all because of some bug because you didn't know how to check right so make sure you can check and always assume that you screw up everything okay so now we can train with zero learning rate on the first layer group and then we can use discriminative learning rates with 1 in egg three and 1 in egg two and train a little bit more and that all works okay so that's all the tweaks we have any questions Rachel? A bit too tangential questions come up they're my favorite the first is we heard that you're against cross-validation for deep learning we heard that you're against cross-validation for deep learning and wanted to know why that is and the second question let's do it one at a time okay so cross-validation is a very useful technique for getting a reasonably sized validation set if you don't have enough data to otherwise create a reasonably sized validation set so it's particularly popular in the days when most studies were like 50 or 60 rows if you've got a few thousand rows it's just pointless right like there's this kind of statistical significance is going to be there regardless sometimes I'm against it just most of the time you don't need it because if you've got a thousand things in the validation set and you want to care whether it's like plus or minus 1% it's totally pointless so yeah have a look and see how much your validation set accuracy is varying from run to run and if it's if it's too much that you can't make the decisions you need to make then you can add cross-validation and what are your best tips for debugging deep learning so Chris Latner asked me this today as well actually so I'll answer the same I answered him which is don't make mistakes in the first place and the only way to do that is to make your code so simple that it can't possibly have a mistake and to check every single intermediate result along the way to make sure it doesn't have a mistake otherwise your last month might have been like my last month what happened in my last month well a month ago I got 94.1% accuracy on ImageNet and I was very happy and then I started a couple of weeks ago trying various tweaks and none of the tweaks seem to help and after a while I got so frustrated I thought I'll just repeat the previous training to see if it was like what was going on with the fluke and I couldn't repeat it I was now getting 93.5% instead of 94.1% and I trained it like a bunch of times and every time I trained it it was costing me $150 of AWS credits so I wasn't thrilled about this and it was six hours of waiting so that was quite a process to even realize like it's broken this is the kind of thing like when something when you've written that kind of code wrong it gets broken in ways you don't even notice it was broken for weeks in fast AI and nobody noticed so eventually I realized yeah I mean so the first thing I'll say is you've got to be a great scientist which means you need a journal notebook you need to keep track of your journal results so I had a good journal I pasted everything that was going on all my models into a file so I went back I confirmed it really was 94.1 I could see exactly when it was and so then I could revert to the exact commit that was in fast AI at that time and I reran it and I got 94.1 so I now had to figure out which change in the previous month of the entire fast AI code base caused this to break so the first thing I tried to do was try to find a way to quickly figure out whether something was broken but after doing a few runs and plotting them in Excel it was very clear that the training was identical until epoch 50 so until epoch 50 out of 60 so there was no shortcut and so I did a bisection search one module at a time looking through the 15 modules that had changed in that diff until I eventually I find it was in the mixed precision module and then I went through each change that happened in the mixed precision module so like $5,000 later I finally found the one line of code where we had forgotten to write the four letters dot opt and so by failing to write dot opt it meant that we were wrapping an opt-in wrapper in an opt-in wrapper rather than wrapping an opt-in wrapper with an optimizer and that meant that weight decay was being applied twice so that tiny difference like it was so insignificant that no one using the library even noticed it wasn't working I didn't notice it wasn't working until I started trying to you know get state-of-the-art results on ImageNet in 60 epochs with ResNet 50 so yeah, I mean deep bugging is hard and worth still is most of the time you don't know so I mean honestly training models sucks and deep learning is a miserable experience and you shouldn't do it but on the other hand it gives you much better results than anything else and it's taking over the world so it's either that or get eaten by everybody else, I guess so yeah, I mean it's so much easier to write normal code where like oh, you have to implement a Walth authentication in your web service and so you go in and you say oh, here's the API and we have to take these five steps and after each one I check that this has happened and you check off each one and at the end you're done and you push it and you have integration tests and it's it, right? even testing it requires a totally different mindset so you don't want reproducible tests you want tests with randomness you want to be able to see if something's changing just occasionally because if it only tests if it tests correctly all the time with a random seat of 42 you're sure it's going to work with a random seat of 41 right, so you want like non reproducible tests you want randomness you want tests that aren't guaranteed to always pass but like the accuracy of this integration test should be better than 0.9 nearly all the time you want to be warned if something looks off you know, and this means it's a very different software development process because if you push something to the fast AI repo and a test fails it might not be your fault right, it might be that Jeremy screwed something up a month ago and one test fails one out of every thousand times so as soon as that happens um then we try to write a test that fails every time you know, so like once you realize there's a problem with this thing you try to find a way to make it fail every time but it's yeah, debugging is difficult and um, in the end you just have to go through each step look at your data make sure it looks sensible, plot it um then try not to make mistakes in the first place great well let's have a break and see you back here at 755 so, um we've all done ULM fit in um part one and there's been a lot of stuff happening in the um oh oh, okay, let's do the question what do you mean by a scientific journal? ah yeah, that's a good one this is something I'm quite passionate about um when you look at the great scientists in history they all uh, that I can tell had careful scientific journal practices in my case my scientific journal is a file and a piece of software called Windows Notepad um, and I paste things into it at the bottom and when I want to find something I press Ctrl F um uh it just needs to be something that has a record of what you're doing and what the results of that are um because scientists who make breakthroughs generally make the breakthrough because they look at something that shouldn't be and they go oh that's odd I wonder what's going on so the discovery of the noble gases was because the scientists saw like one little bubble left in the beaker which they were pretty sure there shouldn't have been a little bubble there anymore you know, most people would just be like ups, there's a bubble or we wouldn't even notice but they studied the bubble and they found noble gases um, or penicillin um, was discovered because of uh, oh that's odd um and I find in deep learning this is true as well like I spent a lot of time studying batch normalization and transfer learning because a few years ago in Keras I was getting terrible transfer learning results for something I thought should be much more accurate and I thought oh, that's odd and I spent weeks changing everything I could and then almost randomly tried changing batch norm so the problem is that all this fiddling around you know, 90% of it doesn't really go anywhere but it's the other 10% that you won't be able to pick it out unless you can go back and say like okay, that really did happen you know, I copied and pasted the log here um so that's, that's all I mean are you also linking to your github commits and datasets, sir? um no, because the github, because I've got the date there and the time so I know the github commit um, so I do make sure I'm pushing uh, all the time um so, yeah uh, okay um yeah, so there's been a lot happening in NLP transfer learning recently uh, the famous GPT2 from OpenAI and BERT and stuff like that lots of interest in transformers which we'll cover um, in a future lesson um one could think that LSTMs are out of favor and not interesting anymore um, but when you look at actually recent competitive machine learning results you see ULM fit beating BERT now, I should say this is not just ULM fit beating BERT the guys at M-Waves uh super smart amazing people so it's like two super smart amazing people using ULM fit bit some other people doing BERT but it's definitely not true that RNNs are in the past um, I think what's happened is in fact as you'll see transformers and CNNs for text have a lot of problems um, they basically they don't have state so like if you're doing speech recognition every sample you look at you have to do an entire analysis of all the samples around it again and again and again like it's ridiculously wasteful um, or else RNNs have state um, but they're they're fiddly um, and they're hard to deal with as you'll see when you want to actually do research and change things um but you know partly you know RNNs have state but also partly RNNs are the only thing which has had the level of carefulness around regularization um, that AWD LSTM did so Stephen Merity um looked at what are all the ways I can regularize this model and came up with a great set of hyperparameters for that and there's nothing like that um outside of the RNN world um, so at the moment my go-to choice definitely is still ULM fit for most you know real world NLP tasks and if people find uh, BERT or GPT2 or whatever better for some real world tasks that would be fascinating I would love that to happen but I haven't been hearing that from people that are actually working in industry yet um, I'm not seeing them win competitive machine learning stuff and so forth so I still think RNNs should be our focus but we will also learn about Transformers later and so ULM fit is just the normal transfer learning path applied to an RNN which could be on text but interestingly there's also been a lot of state-of-the-art results recently on genomics applications and on chemical bonding analysis and drug discovery there's lots of things that are sequences and it turns out and we're still just at the tip of the iceberg right because most people that are studying like drug discovery or chemical bonding or genomics have never heard of ULM fit right so it's still the tip of the iceberg but those who are trying it are consistently getting breakthrough results so I think it's really interesting not just for NLP but for all kinds of sequence classification tasks so the basic process is going to be create a language model on some large data set and notice a language model is a very general term it means predict the next item in a sequence so it could be an audio language model that predicts the next sample in a piece of music or speech there could be predicting the next genome in a sequence or whatever right so that's what I mean by language model and then we fine-tune it that language model using our in-domain corpus which in this case is going to be IMDB and then in each case we first have to pre-process our data sets to get them ready for using an RNN on them language models require one kind of pre-processing classification models require another one and then finally we can fine-tune our IMDB language model for classification so this is the process we're going to go through from scratch so Sylvia has done an amazing thing in the last week which is basically to recreate the entire AWD, LSTM and ULM fit process from scratch in the next four notebooks and there's there's quite a lot in here but a lot of it's a lot of it's kind of specific to text processing and so some of it I might skip over a little bit quickly but we'll talk about which bits are interesting so we're going to start with the IMDB dataset as we have before and to remind you it contains a training folder an unsupervised folder and a testing folder so the first thing we need to do is we need to create a data blocks item list subclass for text believe it or not that's the entire code because we already have a get files so here's a get files with dot text and all you have to do is override get to open a text file like so and we're now ready to create an item list so this is like the data blocks API is just so super easy to create you know to handle your domain so if you've got genomic sequences or audio or whatever this is basically what you'll need to do so now we've got an item list with 100,000 things in it we've got the train the test and the unsupervised and we can index into it and see a text so here's a movie review and we can use all the same stuff that we've used before so for the previous notebook we just built a random splitter so now we can use it on texts so the nice thing about this decoupled API is that we can mix and match things and things just work right and we can see the representation of them they just work okay so we can't throw this movie review into a model it needs to be numbers and so as you know we need to tokenize and you miracleize this so let's look at the details we use spacey for tokenizing and we do a few things as we tokenize one thing we do is we have a few pre rules these are these are bits of code that get run before tokenization so for example if we find br slash we replace it with a new line or if we find a slash or a hash we put spaces around it if we find more than two spaces in a row we just make it one space then we have these special tokens and this is what they look like as strings that we use symbolic names for them mainly and these different tokens have various special meanings for example if we see some non-whitespace character more than three times in a row we replace it with this is really cool right in python substitution you can pass in a function right so rep.sub here is going to look for this and then it's going to replace it with the result of calling this function which is really nice and so what we're going to do is we're going to stick in the tkrep special token so this means that there was a repeating token we're then going to put a number which is how many times it repeated and then the thing that was actually there we'll do the same thing with words there's a lot of bits of little crappy things that we see in texts that we replace mainly HTML entities and we call those our default pre rules and then this is our default list of special tokens so for example replace rep cccc would be xxrep4c replace wrep word word word word word word would be xxwrep5 word why? well think about the alternatives so what if you read a tweet that said this was amazing 28 exclamation marks so you can either treat those 28 exclamation marks as one token and so now you have a vocab item that is specifically 28 exclamation marks you probably never see that again so it probably won't even end up in your vocab and if it did you know it's it's going to be so rare that you won't be able to learn anything interesting about it but if instead we replaced it with xxrep28 exclamation mark then this is just three tokens where it can learn that lots of repeating exclamation marks is a general concept that has certain semantics to it so that's what we're trying to do in NLP is we're trying to make it so that the things in our vocab are as meaningful as possible and the nice thing is that because we're using an LSTM we can have multi-word sequences and be confident that the LSTM will create some stateful computation that can handle that sequence another alternative is we could have turned the 28 exclamation marks into 28 tokens in a row each one of the single exclamation mark but now we're asking our LSTM to hang on to that state for 28 time steps which is just a lot more work for it to do and it's not going to do as good a job so we want to make things easy for our models, that's what pre-processing is all about so same with all caps if you've got I am shouting then it's pretty likely that there's going to be exclamation marks after that there might be swearing after that like the fact that there's lots of capitalized words is semantic of itself so we replace capitalized words with a token saying this is a capitalized word and then we replace it with the lowercase word so we don't have a separate vocab item for capital AM, capital shouting, capital every damn word in the dictionary okay, same thing for mixed case okay so I don't know, I haven't come across other libraries that do this kind of pre-processing there's little bits and pieces in various papers but I think this is a pretty good default set of rules notice that these rules have to happen after tokenization because they're happening at a word level so we have default post rules and then this one here adds a beginning of stream and an end of stream on either side of a list of tokens why do we do that? these tokens turn out to be very important because when your language model sees like an end of stream character token meaning like that's the end of a document then it knows the next document is something new so it's going to have to learn the kind of reset its state to say like oh we're not talking about the old thing anymore so we're doing Wikipedia we were talking about Melbourne Australia oh and now there's a new token then we're talking about the Emmys right so when it sees EOS it has to learn to kind of reset its state somehow so you need to make sure that you have the tokens in place to allow your model to know that these things are happening tokenization is kind of slow because Basie does it so carefully I thought it couldn't possibly be necessary to do it so carefully because it just doesn't seem that important so last year I tried removing Basie and replacing it with something much simpler my IMDB accuracy went down a lot so actually it seems like Basie's sophisticated parser based tokenization actually does matter so at least we can try and make it fast so Python comes with something called a process pool executor which runs things in parallel and I wrap it round with this little thing called parallel and so here's my thing that runs look compose appears everywhere compose the pre rules on every chunk run the tokenizer compose the post rules on every doc that's processing one chunk so run them all in parallel for all the chunks so that's that so this is a processor which we saw last week and this is a processor which tokenizes and so we can try it out so we can create one and try here's a bit of text and let's try tokenizing and so you can see we've got beginning of stream did int so int is a token comma is a token xx-mage-d1 so that was a capital D and so forth alright so now we need to turn those into numbers not just to have a list of words we can turn them into numbers by numericalizing which is another processor which basically when you call it we find out do we have a vocab yet because numericalizing is just saying what are all the unique words and the list of unique words is the vocab so if we don't have a vocab we'll create it okay and then after we create it it's just a case of calling object to int on each one so o to i is just a dictionary right or if deprocessing is just grabbing each thing from the vocab so that's just in the right okay so we can tokenize, numericalize run it for two and a half minutes and so we've got an x-obj is the thing which returns the object version so as opposed to the numericalized version and so we put it back together and this is what we have after it's been turned into numbers and back again so since that takes a couple of minutes good idea to dump the labeled list so that we can then load it again later without having to rerun that alright this is the bit which a lot of people get confused about which is how do we batch up language model data so here's a bit of text it's very meta, it's a bit of text which is from this notebook so the first thing we're going to do is we're going to say let's create some batch sizes, create a small one for showing you what's going on, six so let's go through and create six batches which is just all the tokens for each of those six batches so here's in this notebook we will go back over the example of is the first element of so this is the first row and then of, classifying movie reviews starting in part one this is second, okay so we just put it into six groups, right? and then let's say we have a BPTT of five so it's kind of like our back prop through time sequence length of five then we can split these up into groups of five and so that'll create three of them in this notebook we will go back over the example of classifying movie reviews we studied in part one, okay? these three things then are three mini batches and this is where people get confused because it's not that each one has a different bunch of documents each one has the same documents over consecutive time steps this is really important why is it important? because this this row here in the RNN is going to be getting some state about this document so when it goes to the next batch it needs to use that state and then it goes to the next batch, it needs to use that state so from batch to batch the state that is building up needs to be consistent that's why we do the batches this way I wanted to ask if you did any other preprocessing such as removing stop words, stemming or limitization yeah, great question so in traditional NLP those are important things to do removing stop words, it's removing words like ah and on and the stemming is like getting rid of the ing suffix or stuff like that it's kind of like universal in traditional NLP it's an absolutely terrible idea never ever do this because well the first question is like why would you do it? why would you remove information from your neural net which might be useful? and the fact is it is useful like stop words your use of stop words tells you a lot about like what style of language right so you'll often have a lot less kind of articles and stuff if you're like really angry and speaking really quickly you know the tense you're talking about is obviously very important so stemming gets rid of it so yeah all that kind of stuff it's in the past you basically never want to do it and in general preprocessing data for neural nets leave it as raw as you can is the kind of rule of thumb right so for a language model each mini batch is basically going to look something like this for the independent variable and then the dependent variable will be exactly the same thing but shifted over by one word so let's create that this thing's called lm preloader it would actually be better off being called an lm data set why don't we do it right now lm pre preloader lm data set that's really what it is so an lm data set is a data set for a language model remember that a data set is defined as something with a length and a get item so this is a data set which you can index into it and it will grab an independent variable and a dependent variable and the independent variable is just the text from wherever you asked for for a BPTT and the dependent variable is the same thing offset by one so you can see it here we can create a data loader using that data set remember that's how data loaders work you pass them a data set and now we have something that we can iterate through grabbing a mini batch at a time and you can see here x is x x b o s well worth watching and y is just well worth watching and then you can see the second batch best performance to date so make sure you print out things that it all makes sense so that's stuff that we can all dump into a single function and use it again later and chuck it into a data bunch so that's all we need for a data bunch for language models we're also going to need a data bunch for classification and that one's going to be super easy because we already know how to create data bunches for classification because we've already done it for lots of image models and for NLP it's going to be exactly the same so we create an item list we split we label that's it right so the stuff we did for image is not different the only thing we've added is two preprocessors question what are the trade-offs to consider between batch size and back propagation through time for example B P T T 10 with BS 100 versus B P T T 100 with BS 10 both would be passing a thousand tokens at a time to the model what should you consider when tuning the ratio? it's a great question I don't know the answer I would love to know so try it because I haven't had time to fiddle with it I haven't seen anybody else experiment with it so that would make a super great experiment like you know I think you the batch size is the thing that lets it parallelize right so if you don't have a large enough batch size it's just going to be really slow but on the other hand a large batch size with a short BP T T depending on how you use it you may end up kind of ending up with less state that's being back propagated so the question of like how much that matters I'm not sure and when we get to our ULM fit classification model I'll actually show you this kind of where this comes in okay so here's a couple of examples of a document and a dependent variable and what we're going to be doing is we're going to be creating data loaders for them but we do have one trick here which is that with images our images were always by the time we got to modeling they were all the same size now this is probably not how things should be and we have started doing some experiments with training with rectangular images of different sizes but we're not quite ready to show you that work because it's still a little bit fiddly but for text we can't avoid it you know we've got different sized texts coming in so we have to deal with it and the way we deal with it is almost identical to how actually we're going to end up dealing with when we do do rectangular images so if you are interested in rectangular images try and basically copy this approach here's the approach we are going to pad each document by adding a bunch of padding tokens so we just pick some arbitrary token which we're going to tell PyTorch this token isn't text it's just thrown in there because we have to put in something to make a rectangular tensor if we have a mini batch with a 1,000 word document and then a 2,000 word document and then a 20 word document the 20 word document is going to end up with 1,980 padding tokens on the end and as we go through the R&N we're going to be totally pointlessly calculating on all these padding tokens we don't want to do that so the trick is to sort the data first by length so that way your first mini batch will contain your really long documents and your last mini batch will create to contain your really short documents and each mini batch will not contain a very wide variety of lengths of documents so there won't be much padding and so there won't be much wasted computation so we've already looked at samplers if you've forgotten go back to when we created our data loader from scratch and we actually created a sampler and so here we're going to create a different type of sampler and it is simply one that goes through our data looks at how many documents is in it creates the range from 0 to the number of documents sorts them by some key and returns that iterator sorts them in reverse order so we're going to use sort sampler passing in the key which is a lambda function that grabs the length of the document so that way our our sampler is going to cause each mini batch to be documents of similar lengths proper means we can only do this for validation not for training because for training we want to shuffle and sorting would undo and show any shuffling because sorting is deterministic so that's why we create something called sort ish sampler and the sort ish sampler approximately orders things by length so every mini batch has things of similar lengths but with some randomness and the way we do this the details don't particularly matter but basically I've created this idea of a mega batch which is something that's 50 times bigger than a batch and basically I sort those okay and so you end up with these kind of like sorted mega batches and then I have random permutations within that so you can see random permutations there and there so you can look at the code if you care the details don't matter in the end it it's a random sort in which things of similar lengths tend to be next to each other and the biggest ones tend to be at the start so now we've got a mini batch of numericalized tokenized documents of similar lengths but they're not identical lengths right and so you might remember the other thing when we first created a data loader we gave it two things a sampler and a collate function and the collate function that we wrote simply said torch dot stack because all our images were the same size because all our images were the same size so we could just literally just stick them together we can't do that for documents because they're different sizes so we've written something called pad collate and what Sylvain did here was he basically said let's create something that's big enough to handle the longest document in the mini batch and then go through every document and dump it into that big tensor either at the start or at the end depending on whether you said pad first so now we can pass the sampler and the collate function to our data loader and that allows us to grab some mini batches which as you can see contain padding at the end and so here's our normal convenience functions that do all those things for us and that's that okay so that's quite a bit of pre-processing and I guess the main tricky bit is this dealing with different lengths and at that point we can create our AWD LSTM so these are just the steps we just did to create our data loader and now we're going to create an RNN so an RNN remember is just a multi-layer network but it's a multi-layer network that could be very very very many layers there could be like if it's a 2000 word document this is going to be 2000 layers so to avoid us having to write 2000 layers we used a for loop and between every pair of hidden layers we use the same weight matrix that's why they're the same color and that's why we can use a for loop problem is as we've seen trying to handle 2000 layers of neural net we get vanishing gradients or exploding gradients it's really really difficult to get it to work so what are we going to do because it's even worse than that because often we have layers going into RNNs going into other RNNs so we actually have stacked RNNs which when we unstack them it's going to be even more thousands of layers effectively so the trick is we create something called an LSTM cell rather than just doing a matrix multiply as our layer we instead do this thing called an LSTM cell as our layer this is it here so this is a sigmoid function and this is a tanch function so the sigmoid function remember goes from 0 to 1 and kind of nice and smooth between the two and the tanch function is identical to a sigmoid because it goes from minus 1 to 1 rather than 0 to 1 so sigmoid is 0 to 1 tanch is minus 1 to 1 so here's what we're going to do we're going to take our input and we're going to have some hidden state as we've already always had in our RNNs this is just our usual hidden state and we're going to multiply our input by some weight matrix in the usual way then we're going to multiply our hidden state by some weight matrix in the usual way and then we add the two together in the way we've done before for RNNs and then we're going to do something interesting we're going to split the result into four equal size tensors so the first one quarter of the activations will go through this path the next will go through this path the next will go through this path the next will go through this path so what this means is we kind of have like four little neural nets effectively and so this path goes through a sigmoid and it hits this thing called the cell now this is the new thing so the cell just like hidden state is just a rank 1 tensor or for a mini batch rank 2 tensor so it's just some activations and what happens is we multiply it by the output of this sigmoid so the sigmoid can go between 0 and 1 so this gate has the ability to basically zero out bits of the cell state so we have the ability to basically take this state and say like delete some of it so we could look at some of these words or whatever in this LSTM and say based on looking at that we think we should zero out some of our cell state and so now the cell state has been selectively forgotten so that's the forget gate we then add it to the second chunk the second little mini neural net which goes through sigmoid so this is just our input and we multiply it by the third one which goes through a tench so this basically allows us to say which bits of input do we care about and then this gives us the numbers from minus one to one multiply them together and this adds so this is how do we update our cell state so we add on some new state and so now we take that cell state and we put it through another well one thing that happens is it goes through to the next time step and the other thing that happens is it goes through one more tench to get multiplied by the fourth little mini neural net which is the output so this is the actual this actually creates the output hidden state so it looks like there's a lot going on but actually it's just this right so you've got one neural neural net that goes from input to hidden it's a linear layer one that goes from hidden to hidden each one is going to be four times the number of hidden because after we compute it and add them together chunk splits it up into four equal size groups three of them go through a sigmoid one of them goes through a tench and then this is just the multiply and add that you saw so there's kind of like conceptually a lot going on an LSTM and it's certainly worth doing some more reading about why this particular architecture but one thing I will say is there's lots of other ways you can set up a layer which has the ability to selectively update and selectively forget things for example there's only got a GRU which has one less gate the key thing seems to be giving it some way to make a decision to forget things because if you do that then it has the ability to not push state through all thousand time steps or whatever so that's our LSTM cell and so an LSTM layer so an LSTM assuming we only have one layer is just that for loop that we've seen before and we're just going to call whatever cell we asked for so we're going to ask for an LSTM cell and you just loop through and see how the state we can take the state and we update the state so this you can see this is the classic deep learning it's like an nn.sequential right it's looping through a bunch of functions that are updating itself that's what makes it a deep learning network so that's an LSTM so that takes 105 milliseconds for a small net on the CPU we could pop it onto CUDA then it's 24 milliseconds on GPU it's not that much faster because this loop every time step it's having to push off another kernel launch off to the GPU and it's you know that's that's just slow right so that's why we use the built-in version and the built-in version behind the scenes calls a library from Nvidia called cuDNN which has created a C++ version of this that it's about the same on the CPU right not surprisingly it's really not doing anything different but on the GPU goes from 24 milliseconds to 8 milliseconds so it's dramatically faster the good news is we can create a faster version by taking advantage of something in PyTorch called JIT and what JIT does is it reads our Python and it converts it into C++ that does the same thing CUDA C++ that does the same thing it compiles it the first time you use it and then it uses that compiled code and so that way it can create an on GPU loop and so the result of that is again pretty similar on the CPU but on the GPU 12 milliseconds so you know not as fast as the the cuDNN version but certainly a lot better than our non-JIT version so this seems like some kind of magic thing that's going to save our lives and not require us to have to come to the Swift for TensorFlow lectures but I've got bad news for you trying to get JIT working has been honestly a bit of a nightmare this is the third time we've tried to introduce it in this course and the other two times and the other two times we've just not gotten it working or we've gotten worse results it doesn't work very well that often and it's got a lot of weird things going on like for example if you decide to comment out a line right and then run it oh I got a tick and then run it you'll get this error saying like unexpected indent like literally it's not Python right so it doesn't even know how to comment out lines it's it's this kind of like weird thing where they try to like it's it's heroic it's amazing that it works at all but the idea that you could try and turn Python which is so not C++ into C++ is really pushing at what's possible so it's astonishing this works at all and occasionally it might be useful but it's very very hard to use and when something isn't as fast as you want it's very very hard to you can't profiler you can't debug it not in the normal ways but you know obviously it will improve it was pretty early days it will improve but the idea of trying to parse Python and turn it into C++ literally they're doing like string interpolation behind the scenes is kind of trying to reinvent all the stuff that compilers already do converting a language that was very explicitly not designed to do this kind of thing into one that does and I just I don't think this is the future so I say for now be aware that it exists be very careful in the short term I found places where it literally gives the wrong gradients so it like goes down a totally different autograd path and I've had models that trained incorrectly without any warnings because it was just wrong so be very careful but sometimes like for a researcher if you want to play with different types of RNNs this is your only option unless you write your own C++ or unless you try out Julia or Swift I guess is there a question? Yeah why do we need why do we need torch dot CUDA dot synchronize is it kind of a lock to synchronize CUDA threads or something? Yeah, this is something that thanks to Tom on the forum for pointing this out it's just when we're timing without the synchronize it's let's find it so I just created a little timing function here without the synchronize the CUDA thing will just keep on running things in the background but will return it will let your CPU thread keep going so it can end up looking much faster than it actually is so synchronize says don't keep going in my Python world until my CUDA world is finished so now we need dropout and this is the bit that really is fantastic about AWD LSTM is it's Steven Merity thought about all the ways in which we can regularize a model so basically dropout is just bannoui random noise so bannoui random noise simply means create ones and zeros and it's one with this probability right so create a bunch of random ones and zeros and then divide by 1 minus p so that makes them in this case to 0.5 it's randomly zeros and twos and the reason there's zeros and twos is because that way the standard deviation doesn't change so we can remove dropout for inference time and the activations will be still scaled correctly and we talked about that a little bit in part one and so now we can create our RNN dropout and one of the nifty things here is the way that Silver wrote this is you don't just pass in the thing to drop out but you also pass in a size now normally you would just pass in the size of the thing to drop out like this but what he did here was he passed in for the size size 0, 1, size 2 and so if you remember back to broadcasting this means that this is going to create something with a unit axis in the middle and so when we multiply that so here's our matrix when we multiply the dropout by that our zeros get broadcast this is really important right because this is the sequence dimension so every time step if you drop out time step number 3 but not time step 2 or 4 you've basically broken that whole sequence's ability to calculate anything because you just killed it right so this is called RNN dropout or also called variational dropout there's a couple of different papers that introduce the same idea and it's simply this that you do dropout on the entire sequence at a time so there's RNN dropout the second one that Stephen Merity showed was something he called weight drop it actually turns out that this already existed in the computer vision world where it was called drop connect so there's now two things with different names but are the same weight drop and drop connect and this is dropout not on the activations but on the weights themselves so you can see here when we do the forward pass we go set weights that applies dropout to the actual weights so that's our second type of dropout the next one is embedding dropout and this one as you can see it drops out an entire row this is actually a coincidence that all these rows are in order but it drops out an entire row so by dropping it so what it does is it says okay you've got an embedding and what I'm going to do is I'm going to drop out the all of the embedding the entire embedding vector for whatever word this is so it's dropping out entire words at a time so that's embedding dropout so with all that in place we can create an LSTM model it can be a number of layers so we can create lots of LSTMs for however many layers you want and we can loop through them and we can basically call each layer and we've got all our different dropouts and so basically this code is just calling all the different dropouts so that is an AWD LSTM so then we can put on top of that a simple linear model with dropout and so this simple linear model since literally just a linear model where we go dropout and then call our linear model is we're going to create a sequential model which takes the RNN so the AWD LSTM and passes the result to a single linear layer with dropout and that is our language model because that final linear layer is a thing which will figure out what is the next word so the size of that is the size of the vocab it's good to look at these little tests that we do along the way these are the things we use to help us check that everything looks sensible and we found yep everything does look sensible and then we added something that AWD LSTM did which is called gradient clipping which is a callback that just checks after the backward pass what are the gradients and if the total norm of the gradients so the root sum of grads of gradients is bigger than some number then we'll divide them all so that they're not bigger than that number anymore so it's just clipping those gradients so that's how easy it is to add gradient clipping this is a super good idea not as used as much as it should be because it really lets you train things at higher learning rates and avoid kind of gradients blowing out then there's two other kinds of regularization this one here is called activation regularization and it's actually just an L2 loss an L2 penalty just like weight decay except the L2 penalty is not on the weights it's on the activations so this is going to make sure that our activations are never too high and then this one's really interesting this is called temporal activation regularization this checks how much does each activation change by from sequence step to sequence step and then take the square of that so this is regularizing the R&N to say try not to have things that massively change from time step to time step because if it's doing that that's probably not a good sign okay so that's our R&N trainer callback we set up our loss functions which are just normal cross entropy loss and also a metric which is normal accuracy but we just make sure that our batch and sequence length is all flattened so we can create our language model add our callbacks and fit so once we've got all that we can use it to train a language model on wikitex 103 so I'm not going to go through this because it literally just uses what's in the previous notebook but this shows you here's how you can download wikitex 103 split it into articles create the text lists split it into train and valid tokenize, numericalize data bunchify create the model that we just saw and train it for in this case about five five hours because it's quite a big model right so because we don't want you to have to train for five hours this R&N you'll find that you can download that small pre-trained model from this link so you can now use that on IMDB so you can again grab your IMDB data set download that pre-trained model load it in and then we need to do one more step which is that the embedding matrix for the pre-trained wikitex 103 model is for a different bunch of words to the IMDB vocab so they've got different vocabs with some overlap so I won't go through the code but what we just do is we we just go through each vocab item in the IMDB vocab and we find out if it's in the wikitex 103 vocab and if it is we copy wikitex 103's vocab over it's embedding over so that way we'll end up with an embedding matrix for IMDB that is the same as the wikitex 103 embedding matrix anytime there's a word that's the same and anytime there's a word that's missing we're just going to use the mean bias and the mean weights so that's all that is okay so once we've done that we can then define a splitter just like before to create our layer groups we can set up our callbacks our learner we can fit and so then we'll train that for an hour or so and at the end of that we have a fine-tuned IMDB language model so now we can load up our classifier data bunch which we created earlier that's the same data that's exactly the same lines of code we had before I've got to ignore this pack-padded-sequence stuff but basically there's a neat little trick in PyTorch where you can take data that's of different lengths and call pack-padded-sequence pass that to an RNN and then call pack-pad-pad-pad-sequence and it basically takes things of different lengths and kind of optimally handles them in an RNN so we basically update our AWD LSTM to use that you might remember that for ULM fit we kind of create our hidden state in the LSTM for lots of time steps and we want to say like oh which which bit of state do we actually want to use for classification people used to basically use the final state something that I tried and it turned out to work really well so I ended up in the paper was that we actually do an average pool and a max pool and use the final state and we concatenate them all together so this is like the concat pooling we do for images we do the same kind of thing for text so we put all that together just kind of checking that everything looks sensible and that gives us something that we call the pooling linear classifier which is just a list of batch norm dropout linear layers and our concat pooling and that's about it so we just go through our sentence one BPTT at a time and keep calling that thing and keep appending the results so once we've done all that we can train it so here's our normal set of callbacks we can load our fine-tuned encoder and we can train and 92% accuracy which is pretty close to where the state of the art was a very small number of years ago and this is not the same as we got about 94.5 or something like that or 95% for the paper because that used a bigger model that we trained for longer okay so that was a super fast zip through ULM fit and plenty of stuff which is probably worth reading in more detail and we can answer questions on the forum as well so let's spend the last 10 minutes talking about Swift because the next two classes are going to be about Swift so I think anybody who's got to lesson 12 in this course should be learning Swift for TensorFlow the reason why is I think basically that Python stays unnumbered that's stuff I showed you about JIT the more I use JIT the more I think about it the more it looks like failed examples of software development processes I've seen in the last 25 years whenever people try to convert one language into a different language and then you're kind of using the language that you're not really using it requires brilliant brilliant people like the PyTorch team years to make it almost kind of work so I think Julia or Swift will eventually in the coming years take over I just don't think Python can survive because we can't write CUDA kernels in Python we can't write RNN cells in Python and have them work reliably and fast so maybe your libraries change all the time anyway so if you're spending all your time just studying one library and one language then you're not going to be ready for that change so you'll need to learn something new anyway it'll probably be Swift or Julia and I think they're both perfectly good things to look at regardless I've spent time using in real world scenarios at least a couple of dozen languages and every time I learn a new language I become a better developer so it's just a good idea to learn a new language and like the for tensorflow bit might put you off a bit because I've complained a lot about tensorflow but it's actually there's a flow in the future is going to look almost totally different to tensorflow in the past the things that are happening with Swift for tensorflow are so exciting so there's basically almost no data science ecosystem for Swift which means the whole thing is open for you to contribute to so you can make serious contributions look at any python little library or just one function that doesn't exist in Swift and and write it the Swift community doesn't have people like us people that understand deep learning they're just not people who are generally in the Swift community right now with some exceptions so we are valued and you'll be working on stuff that will look pretty familiar because we're building something a lot like fastai but hopefully much better so with that I have here Chris Latner who who started the Swift project and is now running Swift for tensorflow team at Google and we have time for I think three questions from the community for Chris and I assuming someone has zero knowledge of Swift what would be the most efficient way to learn it and get up to speed with using Swift for tensorflow sure so the courses we're teaching will assume they don't have prior Swift experience but if you're interested you can go to Swift.org in the documentation tab there's a whole book online the thing I recommend is there's a thing called a Swift tour you can just google for that give you a really quick just that's what it looks like and it explains the basic concepts it's super accessible and that's where I want to start the best version of the Swift book is on the iPad it uses something called Swift Playgrounds which is one of these amazing things that Chris built which basically lets you go through the book in a very interactive way it'll feel a lot like the experience of using a Jupyter notebook but it's even more fancy in some ways so you can read the book as you experiment as Swift for tensorflow evolves what do you think will be the first kind of machine work accessible to people who don't have access to big corporate data centers where Swift for tensorflow's particular strengths will make it a better choice than the more traditional Python frameworks sure I don't know what that first thing will be but I think you have to look at the goals of the project and I think there's two goals for this project overall one is to be very subtractive and subtractive of complexity and I think that one of the things that Jeremy's highlighting is that in practice being effective in the machine learning field means you end up doing a lot of weird things at different levels and so you may be dropping down to C++ or writing cuda code depending on what you're doing you may be playing with horse shit or playing with these other systems or these other C libraries that get wrapped up with Python but these become leaky abstractions you have to deal with and so we're trying to make it so you don't have to deal with a lot of that complexity so you can stay in one language and that's one aspect of it the other pieces were thinking about it from the bottom up including the compiler bits all the systems integration pieces the application integration pieces and I have a theory that once we get past the world of Python here that people are going to start doing a lot of really interesting things where you integrate deep learning into applications and right now the application world and the ML world are different I mean people literally export their model into like an ONNX or TF serving or whatever and dump it into some C++ thing where it's a whole new world it's a completely different world and so now you have this barrier between the training, the learning and the ML pieces and you have the application pieces and often these are different teams or different people thinking about things in different ways and breaking down those kinds of barriers I think is a really big opportunity that enables new kinds of work to be done and that leads well into the next pair of questions does it make sense to spend efforts learning and writing in Swift only or is it worth to have some understanding of C++ as well to be good in numerical computations and then secondly after going through some of the Swift documentations it seems like it's a very versatile language if I understand correctly deep learning, robotics web development and systems programming all seem well under its purview do you foresee Swift's influence flourishing in all these separate areas and allowing for a tighter and more fluid development between disciplines sure so I think these are two sides of the same coin I totally agree with Jeremy, learning new programming language is good just because often you learn to think about things in a new way or they open up new kinds of approaches and having more different kinds of mental frameworks gives you the ability to solve problems that otherwise you might not be able to do like learning C++ in the abstract is a good thing having to use C++ is like a little bit of a different thing in my opinion and so C++ has lots of drawbacks this is coming from somebody who's written a C++ compiler yeah I've written way too much C++ myself and maybe I'm a little bit damaged here but C++ is a super complicated language it's also full of like memory safety problems and security vulnerabilities and a lot of other things that are pretty well known it's a great language supports tons of really important work but one of the goals of Swift is to be a full stack language and really span from the scripting all the way down to the things C++ is good at and getting C++ level performance in the same language that you can do high level machine learning frameworks in is pretty cool and so I think that's one of the really unique aspects of Swift is it was designed for accessibility and I'm not aware of a system that's similar in that way great I'm really looking forward to it can we ask one more question? I think we're out of time one more question next time thanks everybody thank you Chris Latner and we'll see you next week