 Okay, so last class of part one, I guess the theme of part one is classification and regression with deep learning, and specifically it's about identifying and learning the best practices classification and regression. And we started out with the kind of, here are three lines of code to do image classification. And gradually we've been, well the first four lessons were then kind of going through NLP, structured data, collaborative filtering and kind of understanding some of the key pieces and most importantly understanding, you know, how to actually make these things work well in practice. And then the last three lessons are then kind of going back over all of those topics in kind of reverse order to understand more detail about what was going on and understanding what the code looks like behind the scenes and wanting to kind of write them from scratch. Part two of the course will move from a focus on classification and regression which is kind of predicting our thing, like our number or at most a small number of things like a small number of labels. And we'll focus more on generative modeling. Generative modeling means predicting kind of lots of things. For example, creating a sentence such as in neural translation or image captioning or question answering or creating an image such as in style transfer, super resolution, segmentation and so forth. And then in part two it'll move away from being just here are some best practices, you know, established best practices either through people that have written papers or through research that FastAI has done and kind of got convinced that these are best practices to some stuff which will be a little bit more speculative, you know, some stuff which is maybe recent papers that haven't been fully tested yet and sometimes in part two like papers will come out in the middle of the course and will change direction with the course and study that paper because it's just, you know, interesting. And so if you're interested in kind of learning a bit more about how to read a paper and how to implement it from scratch and so forth, then that's another good reason to do part two. It still doesn't assume any particular math background, but it does as beyond kind of high school, but it does assume that you're prepared to spend time like, you know, digging through the notation and understanding it and converting it to code and so forth. All right, so where we're up to is RNNs at the moment. And I think one of the issues I find most with teaching RNNs is trying to ensure that people understand they're not in any way different or unusual or magical. They're just a standard, fully connected network. And so let's go back to the standard fully connected network, which looks like this, right? So to remind you, the arrows represent one or more layer operations, generally speaking, a linear, followed by a nonlinear function. In this case, they're matrix modifications, followed by value or fan. And the arrows of the same color represent the same, exactly the same weight matrix being used. And so one thing which was just slightly different from previous fully connected networks we've seen is that we have an input coming in at the not just at the first layer, but also at the second layer and also at the third layer. And we tried a couple of approaches. One was concatenating the inputs and one was adding the inputs. But there was nothing at all conceptually different about this. So that code looked like this. We had a model where we basically defined the three arrows colors we had as three different weight matrices. And by using the linear class, we got actually both the weight matrix and the bias vector wrapped up for free for us. And then we went through and we did each of our embeddings, put it through our first linear layer. And then we did each of our, we call them hidden's, I think they were orange, orange arrows. And in order to avoid the fact that there's no orange arrow coming into the first one, we decided to kind of invent an empty matrix. And that way every one of these rows looked the same. And so then we did exactly the same thing except we used a loop just to refactor the code. So it was just a code refactoring. There was no change of anything conceptually. And since we were doing a refactoring, we took advantage of that to increase the number of characters to eight because I was too lazy to type eight linear layers but I'm quite happy to change the loop index to eight. So this now looped through this exact same thing but we had eight of these rather than three. So then we refactored that again by taking advantage of nn.rnn which basically puts that loop together for us and keeps track of this h as it goes along for us. And so by using that we were able to replace the loop with a single call. And so again, that's just a refactoring, doing exactly the same thing. Okay, so then we looked at something which was mainly designed to save some training time which was previously we had, if we had a big piece of text, right, so we've got like a movie review, right, we were basically splitting it up into eight character segments and we'd grab like segment number one and use that to predict the next character, right? But in order to make sure that we kind of used all of the data, we didn't just split it up like that, we actually said like, okay, here's our whole thing, let's grab, the first will be to grab this section, the second will be to grab that section, then that section, then that section, and each time we're predicting the next one character along, okay? And so, you know, I was a bit concerned that that seems pretty wasteful because like as we calculate this section, nearly all of it overlaps with the previous section, okay? So instead, what we did was we said, all right, well, what if we actually did put it into non-overlapping pieces, right? And we said, all right, let's grab this section here and use it to predict every one of the characters one along, right? And then let's grab this section here and use it to predict every one of the characters one along. So after we look at the first character in, we try to predict the second character. And then after we look at the second character, we try to predict the third character and so forth, okay? And so that's where we got to it. And then what have you perceptive folks asked a really interesting question or expressed a concern, which was, hey, after we got through the first point here, after we got through the first point here, we kind of, we threw away our H activations and started a new one, which meant that when it was trying to use character one to predict character two, it's got nothing to go on. You know, it hasn't built, it's only built, it's only done one linear layer. And so that seems like a problem, which indeed it is. Okay, so we're going to do the obvious thing, which is let's not throw away H. Okay, so let's not throw away that matrix at all. So in code, the big problem is here. Every time we call forward, so in other words, every time we do a new mini batch, we're creating our hidden state, which remember is the orange circles, we're resetting it back to a bunch of zeros. And so as we go to the next non-overlapping section, we're saying forget everything that's come before. But in fact, the whole point is we know exactly where we are. We're at the end of the previous section and we're about to start the next contiguous section. So let's not throw it away. So instead, the idea would be to cut this out, move it up to here, store it away in self, and then kind of keep updating it. Now, so we're going to do that. And there's going to be some minor details to get right. So let's start by looking at the model. So here's the model. It's nearly identical. Here's the model. It's nearly identical, but I've got, as expected, one more line in my constructor where I call something called init hidden. And as expected, init hidden sets self.h to be a bunch of zeros. So that's entirely unsurprising. And then, as you can see, our RNN now takes in self.h and it, as before, spits out our new hidden activations. And so now the trick is to now store that away inside self.h. And so here's wrinkle number one. If you think about it, if I was to simply do it like that, and now I train this on a document that's a million characters long, then the size of this unrolled RNN has a million circles in it. And so that's fine going forwards. But then when I finally get to the end and I say here's my character, and actually remember we're doing multi-output now. So multi-output looks like this. Or if we were to draw the unrolled version of multi-output, we would have a triangle coming off at every point. So the problem is that then when we do backpropagation, we're calculating how much does the error at character one impact the final answer? How much does the error at character two impact the final answer? And so forth. And so we need to go back through and say, how do we have to update our weights based on all of those errors? And so if there are a million characters, my unrolled RNN is a million layers long. I have a one million layer fully connected network. And I didn't have to write the million layers because I have for loop and the for loop is hidden the self.rnn, but it's still there. So this is actually a one million layer fully connected network. And so the problem with that is it's going to be very memory intensive because in order to do the chain rule, I have to be able to multiply at every step like f dash u times g dash x. And so I have to remember that those values are the value of every set of layers. So I'm going to have to remember all those million layers and I'm going to have to do a million multiplications and I'm going to have to do that every batch. So that would be that. So to avoid that, we basically say, all right, well from time to time, I want you to forget your history. So we can still remember the state, which is to remember like what's the actual values in our hidden matrix. But we can remember the state without remembering everything about how we got there. So there's a little function called repackage variable, which literally is just this. It just simply says grab the tensor out of it because remember the tensor itself doesn't have any concept of history and create a new variable out of that. And so this variable is going to have the same value but no history of operations and therefore when it tries to back propagate, it'll stop there. So basically what we're going to do then is we're going to call this in our forward. So that means it's going to do eight characters. It's going to go back propagate through eight layers. It's going to keep track of the actual values in our hidden state. But it's going to throw away at the end of those eight, it's history of operations. So this approach is called back prop through time. And when you read about it online, people make it sound like a different algorithm or some big insight or something, but it's not at all, right? It's just saying hey after our for loop, just throw away your history of operations and start afresh. So we're keeping our hidden state and we're not keeping our hidden state's history. Okay? So that's wrinkle number one and that's what this repackage bar is doing. And so when you see BPTT, that's referring to back prop through time and you might remember we saw that in our original RNN lesson. We had a variable called BPTT equals 70. And so when we set that, we're just saying how many layers to back prop through. Another good reason not to back prop through too many layers is if you have any kind of gradient instability like gradient explosion or gradient spanishing, the more layers you have, the harder the network gets to train. So it's slower and less resilient. On the other hand, a longer value for BPTT means that you're able to explicitly capture a longer kind of memory, more state. Okay? So that's something that you get to tune when you create your RNN. All right, wrinkle number two is how are we going to put the data into this, right? It's all very well the way I described it just now where we said, you know, we could do this. And we can first of all look at this section, then this section, then this section. But we want to do a mini batch at a time. Right? We want to do a bunch at a time. So in other words, we want to say, let's do it like this. So mini batch number one would say, let's look at this section and predict that section. And at the same time in parallel, let's look at this totally different section and predict this. And at the same time in parallel, let's look at this totally different section and predict this, right? And so then, because remember in our hidden state, we have a vector of hidden state for everything in our mini batch, right? So it's going to keep track of at the end of this there's going to be a vector here, a vector here, a vector here. And then we can move across to the next one and say, okay, for this part of the mini batch, use this to predict that and use this to predict that and use this to predict that, right? So you can see that we're moving, we've got like a number of totally separate bits of our text that we're moving through in parallel, right? So hopefully this is going to ring a few bells for you because what happened was back when we started looking at torch text for the first time, we started talking about how it creates these mini batches. And I said what happened was we took our whole big, long document consisting of like, you know, the entire works of Nietzsche or all of the IMDb reviews concatenated together or whatever. And a lot of you, not surprisingly, this is really weird at first, a lot of you didn't quite hear what I said correctly. What I said was we split this into 64 equal sized chunks. And a lot of your brains went, Jeremy just said we split this into chunks of size 64. But that's not what Jeremy said. Jeremy said we split it into 64 equal sized chunks, right? So if this whole thing was length 64 million, which would be a reasonable sized corpus, not an unusual sized corpus, then each of our 64 chunks would have been of length 1 million, right? And so then what we did was we took the first chunk of 1 million and we put it here. And then we took the second chunk of 1 million and we put it here. And the third chunk of 1 million we put it here. And so forth to create 64 chunks. And then each mini batch consisted of us going, let's split this down here and here and here. And each of these is of size B, P, T, T, which I think we had something like 70, right? And so what happened was we said, all right, let's look at our first mini batch, is all of these, right? So we do all of those at once and predict everything offset by one. And then at the end of that first mini batch, we went to the second chunk, right? And used each one of these to predict the next one offset by one, okay? So that's why we did that slightly weird thing, right? Is that we wanted to have a bunch of things we can look through in parallel, each of which like, hopefully are far enough away from each other, you know, that we don't have to worry about the fact that, you know, the truth is this starting, the start of this million characters was actually in the middle of a sentence, but you know, who cares, right? Because it's, you know, that only happens once every million characters. I was wondering if you could talk a little bit more about augmentation for this kind of data set and how to... Data augmentation of this kind of data set? Yeah. No, I can't because I don't really know a good way. It's one of the things I'm going to be studying between now and part two. There have been some recent developments, particularly something we talked about in the machine learning course, and I think we briefly mentioned here, which was somebody for a recent Kaggle competition won it by doing data augmentation by randomly inserting parts of different rows, basically. Something like that may be useful here, and I've seen some papers that do something like that. But yeah, I haven't seen any kind of recent-ish data-the-art NLP papers that are doing this kind of data augmentation. So it's something we're planning to work on. Thanks. So Jeremy, how do you choose a BPTT? So there's a couple of things to think about when you pick your BPTT. The first is that you'll note that the matrix size for a mini-batch has BPTT by batch size. So one issue is your GPU RAM needs to be able to fit that by your embedding matrix. Every one of these is going to be of length embedding, plus all of the hidden state. So one thing is if you get a CUDA out of memory error, you need to reduce one of those. If your training is very unstable, like your loss is shooting off to NAN suddenly, then you could try decreasing your BPTT because you've got less layers to gradient explode through. It's too slow. You could try decreasing your BPTT because it's got to kind of do one of those steps at a time like that for loop. Can't be parallelized. Well, I say that. There's a recent thing called QRN, which we'll hopefully talk about in part two, which kind of does paralyze it, but the versions we're looking at don't paralyze it. So that would be the main issues, I think. Look at performance, look at memory, and look at stability, and try and find a number that's as high as you can make it, but all of those things work for you. Okay, so trying to get all that chunking and lining up and everything to work is more code than I want to write. So for this section, we're going to go back and use TorchText again. So when you're using APIs like FastAI and TorchText, which in this case, these two APIs are designed to, or at least from the FastAI site designed to work together, you often have a choice which is like, okay, this API has a number of methods that expect the data in this kind of format, and you can either change your data to fit that format, or you can write your own dataset subclass to handle the format that your data's already in. I've noticed on the forum, a lot of you are spending a lot of time writing your own dataset classes, whereas I am way lazier than you, and I spend my time instead changing my data to fit the dataset classes I have. Like, either's fine, and if you realize like, oh, there's a kind of a format of data that me and other people are likely to be seeing quite often, and it's not in the FastAI library, then by all means, write the dataset subclass, submit it as a PR, and then everybody can benefit. But in this case, I just kind of thought, I want to have some Nietzsche data fed into TorchText. I'm just going to put it in the format that TorchText kind of already supports. So TorchText already has, or at least the FastAI wrapper around TorchText, already has something where you can have a training path and a validation path, and one or more text files in each path containing a bunch of stuff that's concatenated together for your language model. So in this case, all I did was I made a copy of my Nietzsche file, copied it into training, made another copy, stuck it into the validation, and then in one of them, in the training set, I deleted the last 20% of rows, and in the validation set, I deleted all except for the last 20% of rows. And I was done, right? So I found this, in this case, I found that easier than writing a custom data set class. The other benefit of doing it that way was that I felt like it was more realistic to have a validation set that wasn't a random shuffled set of rows of text, but was like a totally separate part of the corpus, because I feel like in practice, you're very often going to be saying like, oh, I've got, I don't know, I've got books or these authors I'm learning from, and then I want to apply it to these different books and these different authors. So I felt like for getting a more realistic validation of my Nietzsche model, I should use like a whole separate piece of the text. So in this case, it was the last 20% of the rows of the corpus. So I haven't created this for you, right, intentionally, because this is the kind of stuff that you're noticing is making sure that you're familiar enough, comfortable enough with bash or whatever that you can create these and that you understand what they need to look like and so forth. So in this case, you can see I've now got, you know, a train and a validation here, and then I could look inside here. Okay, so you can see I've literally just got one file of it, because when you're doing a language model, i.e. predicting the next character or predicting the next word, you don't really need separate files. It's fine if you do have separate files, but they just get catenated together anyway. All right, so that's my source data and so here is, you know, the same lines of code that we've seen before, and let's go over them again because it's a couple of lessons ago, right. So in Torch Text, we create this thing called a field and the field initially is just a description of how to go about pre-processing the text, okay. Now in this case, I'm going to say, hey, lowercase it, you know, because I don't, I mean, now I think about it, there's no particular reason to have done this lowercase, uppercase would work fine too, and then how do I tokenize it? And so you might remember last time we used a tokenization function which kind of largely split on white space and tried to do clever things with punctuation, right, and that gave us the word model. In this case, I want a character model so I actually want every character put into a separate token so I can just use the function list in Python, because list in Python does that, okay. So this is where you can kind of see, like, understanding how libraries like Torch Text and FastAI are designed to be extended can make your life a lot easier, right. So when you realize that very often both of these libraries kind of expect you to pass a function that does something and then you realize, like, oh, I can write any function I like, right. Okay, so this is now going to mean that each mini-batch is going to contain a list of characters. And so here's where we get to define all our different parameters and so to make it the same as previous sections of this notebook I'm going to use the same batch size, the same number of characters and then I'm going to rename it to BTTT since we know what that means. The number of the size of the embedding and the size of our hidden state, okay. Remembering that the size of our hidden state simply means going all the way back to the start, right, and hidden simply means the size of the state that's created by each of those orange arrows. So it's the size of each of those circles. Here, okay. So having done that, we can then create a little dictionary saying what's our training validation and test set. In this case, I don't have a separate test set so I'll just use the same thing. And then I can say, alright, I want a language model data subclass with model data. I'm going to grab it from text files and this is my path and this is my field which I defined earlier and these are my files and these are my hyper-parameters. Minfract's not going to do anything actually in this case because there's not, I don't think there's going to be any character that appears less than three times. So that's probably redundant. Okay, so at the end of that it says there's going to be 963 batches to go through and so if you think about it, that should be equal to the number of tokens divided by the batch size divided by BPTT because that's like the size of each of those rectangles. You'll find that in practice it's not exactly that and the reason it's not exactly that is that the authors of TorchText did something pretty smart which I think we briefly mentioned this before they said, okay, we can't shuffle the data like with images, we like to shuffle the order so every time we see them in a different order so there's a bit more randomness. We can't shuffle because we need to be contiguous but what we could do is randomise the length of, you know, basically randomise BPTT a little bit each time and so that's what PyTorch does. It's not always going to give us exactly eight characters long five percent of the time it'll actually cut it in half and then it's going to add on a small little standard deviation to make it slightly bigger or smaller than four or eight. So it's going to be slightly different to eight on average. Yes? Just to make sure is it going to be constant per mini batch? Yeah, exactly, that's right. So a mini batch, you know, has to kind of it needs to do a matrix multiplication and the mini batch size has to remain constant because we've got this H weight matrix that has to, you know, has to line up in size with the size of the mini batch. Yeah, but the number, you know, the sequence length can change, no problem. Okay, so that's why we have 963, so the length of a data loader is how many mini batches. In this case it's sort of a bit approximate. Okay, number of tokens is how many unique things are in the vocabulary. And remember, after we run this line text now does not just contain a description of what we want but it also contains an extra attribute called vocab, right, which contains stuff like a list of all of the unique items in the vocabulary and a reverse mapping from each item to its number. Okay, so that text object is now an important thing to keep chat of. So let's now try this. So we started out by looking at the class. So the class is exactly the same as the class we've had before. The only key difference is to call in at hidden, which calls sets out. So h is not a variable anymore. It's now an attribute. So often h is a variable containing a bunch of zeros. Now, I mentioned that batch size means constant each time. But unfortunately, when I said that, I lied to you. And the way that I lied to you is that the very last mini-batch will be shorter. The very last mini-batch is actually going to have less than 64. Well, it might be exactly the right size if it so happens that this data set is exactly divisible by BPTT times batch size, but it probably isn't. So the last batch will probably have less. So that's why I do a little check here that says let's check that the batch size inside self.h right? And so self.h is going to be the height sorry, the height is going to be the number of activations and the width is going to be the mini-batch size. Check that that's equal to the actual sequence length sorry, the actual batch size length that we've received. And if they're not the same then set it back to zeros again. So this is just a minor little ring call that basically at the end of each epoch it's going to do like a little mini-batch. And so then as soon as it starts the next epoch, it's going to see that they're not the same again and it will reinitialize it to the correct full batch size. So that's why if you're wondering there's an in it hidden not just in the constructor but also inside forward it's to handle this kind of end of each epoch start of each epoch difference not an important point by any means but potentially confusing when you see it. Okay So the last ring call the last ring call is something which I think is something that slightly sucks about PyTorch and maybe somebody can be nice enough to try and fix it with a PR if anybody feels like it which is that the loss functions such as Softmax are not happy receiving a rank 3 tensor. Remember a rank 3 tensor is just another way of saying a dimension 3 array. Okay There's no particular reason they ought to not be happy receiving a rank 3 tensor you know like somebody could write some code to say hey a rank 3 tensor is probably you know a sequence length by batch size by you know results thing and so you should just do it for each of the two initial axes but no one's done that and so it expects it to be a rank 2 tensor. Funnily enough it can handle rank 2 or rank 4 but not rank 3. So we've got um so we've got a rank 2 tensor containing you know for each time period I can't remember which way around the axes are but whatever for each time period for each batch we've got our predictions okay and then we've got our actuals for each time period for each batch we've got our predictions and we've got our actuals okay and so we just want to check whether they're the same and so in an ideal world a lost function would check you know item 1 1 then item 1 2 and then item 1 3 but since that hasn't been written we just have to flatten them both out okay and we can literally just flatten them out put rows to rows um and so that's why here I have to use dot view okay and so dot view um says the number of columns will be equal to the size of the vocab because remember we're going to end up with a prediction you know a probability for each letter and then the number of rows is however big is necessary which will be equal to batch size times B P T T okay um and then you may be wondering where I do that so that's for the predictions you may be wondering where I do that for the target and the answer is torch text knows that the target needs to look like that so torch text has already done that for us okay so torch text automatically changes the target to be flattened out um and you might actually remember if you go back to lesson 4 when we actually looked at a mini batch that spat out of torch text we did we noticed actually that it was flattened and I said we'll learn about why later and so later is now arrived uh okay so there are the three wrinkles um get rid of the history um um well I guess four wrinkles um recreate the hidden state if the batch size changes um flatten out uh and then use torch text to create mini batches that line up nicely uh so once we do those things we can then uh create our model uh create our optimizer with that model's parameters and fit it um one thing to be careful of here is that um softmax now uh as of PyTorch 0.3 uh requires that we pass in a number here saying which axis do we want to do the softmax over so at this point this is a three dimensional tensor right and so we want to do the softmax over the final axis right so we're going to say which axis do we do the softmax over remember we divide by so we go e to the xi divided by the sum of e to the xi so it's saying which axis do we sum over so which axis do we want to sum to one and so in this case clearly we want to do it over the last axis because the last axis is the one that contains the probability per letter of the alphabet and we want all of those probabilities to sum to one okay um so therefore uh to run this notebook you're going to need PyTorch 0.3 which just came out this week okay so if you're doing this on the MOOC you're fine I'm sure you've got at least 0.3 or later okay um where else the students here if you just go conda and update it'll automatically update you to 0.3 um the really great news is that uh 0.3 although it does not yet officially support windows um it does in practice uh I successfully installed 0.3 from conda yesterday by typing conda install PyTorch in windows I then attempted to use the entirety of lesson one and every single part worked uh so I actually ran it on this very laptop um so for those who are interested in doing deep learning on their laptop can definitely recommend the new surface book um the new surface book 15 inch has a GTX 1060 6 gig GPU in it and I was getting it was running about um three times slower than my um uh 1080 Ti which I think means it's about the same speed as an awsp2 instance uh and as you can see it's also a nice convertible tablet that you can write on and it's thin and light and so it's like I've never seen a such good deep learning box uh also I successfully installed Linux on it and all of the fast AI stuff worked on the Linux as well so really good option if you're interested in a laptop that can run deep learning stuff um alright so that's uh that's going to be aware of with this dm equals minus one um so then we can go ahead and construct this and we can call fit uh we're basically going to get pretty similar results to um what we got before alright so then um we can go a bit further with our RNN by just kind of unpacking it a bit more and so this is now again exactly the same thing gives exactly the same answers but I have removed the um the call to RNN so I've got rid of this self.rnn okay um and so this is just something I won't spend time on it but you can check it out so instead I've now defined RNN as RNN cell and I've copied and pasted the code above don't run it this is just for your reference from PyTorch this is this is the definition of RNN cell in PyTorch and I want you to see that you can now read PyTorch source code and understand it not only that but recognize it as being something we've done before it's a matrix multiplication of the weights by the inputs plus biases so f.linear simply does a matrix product followed by an addition and interestingly you'll see they do not concatenate the um the input bit and the hidden bit they sum them together which is our first approach and as I said you can do either one's right or wrong but it's interesting to see this is the definition here yes Yannette can give us some insight about what are they using that particular activation function fan yeah yeah um I think we might have briefly covered this last week but very happy to do it again if I did um basically fan that's positive one and negative one fan looks like that in other words it's a sigmoid function double the height minus one eventually they're equal so it's a nice function in that it's forcing it to be you know no smaller than minus one no bigger than plus one and since we're multiplying by this weight matrix again and again and again um we might worry that a value because it's unbounded have more of a gradient explosion problem that's basically the theory having said that you can actually um ask PyTorch for an RNN cell which uses a different nonlinearity so you can see by default it uses fan but you can ask for a value as well but yeah most people seem to pretty much everybody still seems to use then as far as I can tell um so you can basically see here this is all the same except now I've got an RNN cell which means now I need to put my for loop back and you can see every time I call my little linear function I just append the result onto my list and at the end the result is that all stacked up together okay so like just trying to show you how nothing inside PyTorch is mysterious, right? you should find you get basically the in fact I found I got exactly the same answer from this as the previous one okay um in practice you would never write it like this but what you may well find in practice is that somebody will come up with like a new kind of RNN cell or a different way of kind of keeping track of things over time or a different way of doing regularization and so inside fast ai's code we find that we do this exactly this basically we have this by hand because we use some regularization approaches that aren't supported by PyTorch alright so then another thing I'm not going to spend much time on but I'll mention briefly is that nobody really uses this RNN cell in practice and the reason we don't use that RNN cell in practice is even though the fan is here you do tend to find gradient explosions are still a problem and so we have to use pretty low learning rates to get these to train and pretty small values for BPTT to get them to train so what we do instead is we replace the RNN cell with something like this this is called a GRU cell and a GRU cell here it is there's a picture of it and there's the equations for it so basically I'll show you both quickly but we'll talk about it much more in part 2 we've got our input and our input normally goes straight in gets multiplied by a weight matrix to create our new activations that's not what happens and then of course we also add it to the existing activations that's not what happens here in this case our input goes into this H tilde temporary thing and it doesn't just get added to our previous activations but our previous activations get multiplied by this value R and R stands for reset it's a reset gate and how do we calculate this value it goes between 0 and 1 in our reset gate well the answer is it's simply equal to a matrix product between some weight matrix and the concatenation of our previous hidden state and our new input in other words this is a little one hidden layer neural net and in particular it's a one hidden layer neural net because we then put it through the sigmoid function one of the things I hate about mathematical notation is symbols are overloaded a lot sometimes when you see sigma that means standard deviation when you see it next to a parenthesis like this it means the sigmoid function so in other words that which looks like that so this is like a little mini neural net with no hidden layers another way it's like a little logistic regression and this is I'll mention this briefly because it's going to come up a lot in part 2 so it's a good thing to start learning about this idea that in the very learning itself you can have little mini neural nets inside your neural nets and so this little mini neural net is going to be used to decide how much of my hidden state am I going to remember and so it might learn that in this particular situation everything you know for example oh there's a full stop when you see a full stop you should throw away nearly all of your hidden state that is probably something it would learn and that's very easy for it to learn using this little mini neural net and so that goes through to create my new hidden state along with the input and then there's a second thing that happens which is there's this gate here called z and what z says is you've got your some amount of your previous hidden state plus your new input and it's going to go through to create your new state and I'm going to let you decide to what degree do you use this new input version of your hidden state and to what degree will you just leave the hidden state the same as before so this thing here is called the update gate right and so it's got two choices it can make the first is to throw away some hidden state when deciding how much to incorporate that versus my new input and how much to update my hidden state versus just leave it exactly the same and the equation hopefully is going to look pretty similar to you which is check this out here remember how I said you want to start to recognize some common ways of looking at things or here I have a one minus something by a thing and a something without the one minus by a thing which remember is a linear interpolation right so in other words the value of z is going to decide to what degree do I have keep the previous hidden state and to what degree do I use the new hidden state right so that's why they draw it here as this kind of like it's not actually a switch but like you can put it in any position you can be like oh it's here or it's here or it's here decide how much to update so so they're basically the equations it's a little mini neural net with its own weight matrix to decide how much to update little mini neural net with its own weight matrix to decide how much to reset and then that's used to do an interpolation between the two hidden states so that's called a GRU gated recurrent network there's the definition from the PyTorch source code they have some slight optimizations here that if you're interested in we can talk about them on the forum but it's exactly the same formula we just saw and so if you go nn.gru that it uses this same code but it replaces the RNN cell with this cell and as a result rather than having something where we needed where we're getting a 1.54 we're now getting down to 1.40 and we can keep training even more get right down to 1.36 so in practice a GRU or very nearly equivalently we'll see in a moment an LSTM in practice what pretty much everybody always uses so the RT are ultimately scalers after they go through the sigmoid but they're applied element-wise is that correct to both the HAS? although of course one for each mini-batch but yeah it's a scalar yeah okay great thanks and on the excellent Chris Ola's blog there's an understanding LSTM networks which you can read all about this in much more detail if you're interested and also the other one I was dealing from here is WildML I also have a good blog post on this if somebody wants to be helpful feel free to put them in the lesson wiki if you can find them okay so then putting it all together I'm now going to replace my GRU with an LSTM I'm not going to bother showing you all for this it's very similar to GRU but the LSTM has one more piece of state in it called the cell state not just the hidden state so if you do use an LSTM you're now inside your hidden have to return a tuple of matrices they're exactly the same size as the hidden state but you just have to return the tuple the details don't matter too much but we can talk about it during the week if you're interested when you pass in you still pass in self.h still returns a new value of h you can repackage it in the usual way so this code is identical to the code before one thing I've done though is I've added dropout inside my RNN which you can do with the PyTorch RNN function so that's going to do dropout after each time step and I've doubled the size of my hidden layer since I've now added 0.5 dropout and so my hope was that this would make it be able to learn more but be more resilient as it does so so then I wanted to show you how to take advantage of a little bit more fast AI magic without using the layer class and so I'm going to show you how to use callbacks and specifically we're going to do sgdr without using the learner class so to do that we create our model again just a standard PyTorch model and this time rather than going remember the usual PyTorch approach is opt equals optim.atom and you pass in the parameters and the learning rate I'm not going to do that I'm going to use the fast AI layer optimizer class which takes my optim class constructor from PyTorch it takes my model it takes my learning rate and optionally takes weight decay and so this class is tiny it doesn't do very much at all the key reason it exists is to do differential learning rates and differential weight decay right but the reason we need to use it is that all of the mechanics inside fast AI assumes that you have one of these so if you want to use callbacks or sgdr or whatever encode where you're not using the learner class then you need to use rather than saying opt equals optim.atom and here's my parameters you instead say layer optimizer so that gives us a layer optimizer object and if you're interested basically behind the scenes you can now grab a.opt property which actually gives you the optimizer you don't have to worry about that yourself but that's basically what happens behind the scenes the key thing we can now do is that we can now when we call fit we can pass in that optimizer and we can also pass in some callbacks and specifically we're going to use the cosine and annealing callback so the cosine and annealing callback requires a layer optimizer object so what this is going to do is it's going to do cosine and annealing by changing the learning rate inside this object so the details aren't terribly important we can talk about them on the forum it's really the concept I wanted to get across here but now that we've done this we can say create a cosine and annealing callback which is going to update the learning rates in this layer optimizer the length of an epoch is equal to this here how many mini batches are there in an epoch well it's whatever the length of this data loader is because it's going to be doing the cosine and annealing it needs to know how often to reset and then you can pass in the cycle mode in the usual way and then we can even save our model automatically remember how there was that cycle save name parameter that we can pass to learn.fit this is what it does behind the scenes behind the scenes it sets an on cycle end callback and so here I have to find that callback as being something that saves my model so there's quite a lot of cool stuff that you can do with callbacks callbacks are basically things where you can define at the start of training or at the start of an epoch or at the end of a batch or at the end of an epoch please call this code and so we've written some for you including SGDR which is the cosine and annealing callback and then Sahar recently wrote a new callback to implement the new approach to decoupled weight decay we use callbacks to draw those little graphs of the loss of a time so there's lots of cool stuff you can do with callbacks so in this case by passing in that callback we are getting SGDR and that's able to get us down to 1.31 here and then we can train a little bit more and eventually get down to 1.25 and so we can now test that out and so if we pass in a few characters of text we get not surprisingly an E after for a DOS let's do then 400 and now we have our own Nietzsche so Nietzsche tends to start his sections with a number and a dot so 2.03 perhaps that every life have values of blood of intercourse when it senses there is unscrupulous his very rights and still impulse love so it's slightly less clear than Nietzsche normally but it gets the tone right and it's actually quite interesting to play around with training these character based language models to run this at different levels of loss to get a sense of what does it look like like you really notice that this is like 1.25 and at slightly worse like 1.3 this looks like total junk you know there's like punctuation in random places and you know nothing makes sense and like you start to realize that the difference between you know Nietzsche and random junk is not that far in kind of language model terms and so if you train this for a little bit longer you'll suddenly find like oh it's making more and more sense so if you are playing around with NLP stuff particularly generative stuff like this and you're like the results are like kind of okay but not great don't be disheartened because that means you're actually very very nearly there you know the difference between like something which is starting to create something which almost vaguely looks English if you squint and something that's actually a very good generation it's not it's not far in loss function terms okay great so let's take a 5 minute break we'll come back at 7.45 and we're going to go back to computer vision okay so now we come full circle back to vision so now we're looking at lesson 7 sci-fi 10 notebook you might have heard of sci-fi 10 it's a really well known dataset in academia and it's partly it's well known it's actually pretty old by you know computer vision standards well before image net was around there was sci-fi 10 you might wonder why we're going to be looking at such an old dataset and actually I think small datasets are much more interesting than image net because like most of the time you're likely to be working with stuff with a small number of thousands of images rather than one and a half million images some of you will work with one and a half million images most of you won't right so learning how to use these kind of datasets I think is much more interesting often also a lot of the stuff we're looking at like in medical imaging we're looking at like the specific area where there's a lung nodule you're probably looking at like 32 by 32 pixels at most is being the area where that lung nodule actually exists right and so sci-fi 10 is small both in terms of it doesn't have many images and the images are very small and so therefore I think this is like in a lot of ways it's much more challenging than something like image net in some ways it's much more interesting right and also most importantly you can run stuff much more quickly on it so it's much better to test out your algorithms with something you can run quickly and it's still challenging and so I hear a lot of researchers complain about like how they can't afford to study all the different versions of their algorithm properly because it's too expensive and they're doing it on image net so like it's literally a week of you know expensive CPU work for every study they do and like I don't understand why you would do that kind of study on image net it doesn't make sense and so this has been a particularly you know there's been a particular lot of kind of debate about this this week because really interesting researcher named Ali Rahami at NIPS this week gave a talk a really great talk about kind of the need for rigor in experiments in deep learning and you know he felt like there's a lack of rigor and I've talked to him about it quite a bit since that time and I'm not sure we yet quite understand each other as to where we're coming from but we have very similar kinds of concerns which is basically people aren't doing carefully tuned carefully thought about experiments but instead they kind of throw lots of GPUs and lots of data and consider that a day and so this idea of like saying well you know is my data set is my algorithm meant to be good at small images at small data sets well if so let's study it on sci-fi 10 rather than studying it on image net and then do more studies of different versions of the algorithm and turning different bits on and off understand which parts are actually important people also complain a lot about MNIST which we've looked at before and I would say the same thing about MNIST which is like if you're actually trying to understand which parts of your algorithm make a difference and why using MNIST for that kind of study is a very good idea and all these people who complain about MNIST I think they're just showing off they're saying like oh I work at google and I have you know a pot of TPUs and I have a hundred thousand dollars a week of time to spend on it but I think that's all it is it's just signalling rather than actually academically rigorous okay so sci-fi 10 you can download from here this person has very kindly made it available in image form if you google for sci-fi 10 you'll find us a much less convenient form so please use this one it's already in the exact form you need once you download it you can use it in the usual way so here's a list of the classes that are there now you'll see here I've created this thing called stats normally when we've been using pre-trained models we have been saying transforms from model and that's actually created the necessary transforms to convert our data set into a normalized data set based on the means standard deviations of each channel in the original model that was trained in our case though this time we've got a trainer model from scratch so we have no such thing so we actually need to tell it the mean and standard deviation of our data to normalize it and so in this case I haven't included the code here to do it you should try and try this yourself to confirm that you can do this and understand where it comes from but this is just the mean and the standard deviation per channel of all of the images alright so we're going to try and create a model from scratch and so the first thing we need is some transformations so for sci-fi 10 people generally do data augmentation of simply flipping randomly horizontally so here's how we can create a specific list of augmentation to use and then they also tend to add a little bit of padding black padding around the edge and then randomly pick a 32x32 spot from within that padded image so if you add the pad parameter to any of the fastai transform creators it'll do that for you and so in this case I'm just going to add pixels around each size and so now that I've got my transforms I can go ahead and create my image classifier data dot from paths in the usual way okay I'm going to use a batch size of 256 because these are pretty small so it's going to let me do a little bit more at a time so here's what the data looks like so for example here's a boat and just to show you how tough this is what's that okay it is it's not chicken frog so I guess it's this big thing whatever the thing is called there's your frog so these are the kinds of things that we want to look at so I'm going to start out so our student Karim we saw one of his posts earlier in this course he made this really cool notebook which shows how different optimizers work there we go so Karim made this really cool notebook I think it was maybe last week in which he showed how to create various different optimizers from scratch so this is kind of like the Excel thing I had but this is the Python version of momentum and Adam and Nesterov and Adagrad all written from scratch which is very cool one of the nice things he did was he showed a tiny little general purpose fully connected network generator so we're going to start with his so he called that simple net so are we so here's a simple class which has a list of fully connected layers whenever you create a list of layers in PyTorch you have to wrap it in nn.module list just to tell PyTorch to register these as attributes and so then we just go ahead and flatten the data that comes in because it's fully connected layers and then go through each layer and call that linear layer do the value to it and at the end do a softmax so there's a really simple approach and so we can now take that model and now I'm going to show you how to step up one level of the API higher rather than calling the fit function as a learn object but we're going to create a learn object from a custom model and so we can do that by say we want a convolutional learner we want to create it from a model and from some data and the model is this one so this is just a general PyTorch model and this is a model data object of the usual kind and that will return a learner so this is a bit easier to run with layer optimizers and cosine annealing callbacks and whatever this is now a learner that we can do all the usual stuff with but we can do it with any model that we create so if we just go learn that'll go ahead and print it out so you can see we've got 3072 features coming in because we've got 32 by 32 pixels by 3 channels and then we've got 40 features coming out of the first layer that's going to go into the second layer 10 features coming out because we've got the 10 sci-fi 10 categories you can call dot summary to see that in a little bit more detail we can do LR find we can plot that and we can then go fit and we can use cycle length and so forth so with a simple how many hidden layers do we have one hidden layer one hidden layer model with and here we can see the number of parameters we have is that over 120,000 we get a 47% accuracy so not great so let's kind of try and improve it and so the goal here is we're going to try and eventually replicate the basic kind of architecture of a ResNet so that's what we're going to try and get to here is gradually build up to a ResNet so the first step is to replace our fully connected model with a convolutional model okay so to remind you a fully connected layer is simply doing a dot product so if we had all of these data points all of these weights right then we basically do some product of all of those together in other words it's a matrix model plane then that's a fully connected layer okay and so we need the weight matrix is going to contain an item for every element of the input for every element of the output so that's why we have here a pretty big weight matrix right and so that's why we had despite the fact that we have such a crappy accuracy we have a lot of parameters because in this very first layer we've got 3072 coming in and 40 coming out so that gives us 3000 x 40 parameters and so we end up not using them very efficiently because we're basically saying every single pixel in the input has a different weight so what we want to do is kind of find groups of 3x3 pixels that have particular patterns to them and remember we call that a convolution so a convolution looks like so we have like a little 3x3 section of our image and a corresponding 3x3 set of filters or our filter with a 3x3 kernel and we just do a sum product just that 3x3 by that 3x3 and then we do that for every single part of our image so when we do that across the whole image that's called a convolution and remember in this case we actually had multiple filters so the result of that convolution actually had multiple it was a tensor with an additional third dimension to it effectively so let's take exactly the same code that we had before but we're going to replace nn.linear with nn.com2d now what I want to do in this case though is each time I have a layer I want to make the next layer smaller and so the way I did that in my Excel example was I used max pooling so max pooling took every 2x2 section and replaced it with its maximum value nowadays we don't use that kind of max pooling much at all instead nowadays what we tend to do is do what's called a stride 2 convolution a stride 2 convolution rather than saying let's go through every single 3x3 it says let's go through every second 3x3 so rather than moving this 3x3 1 to the right we move it 2 to the right and then when we get to the end of the row rather than moving 1 row down we move 2 rows down so that's called a stride 2 convolution and so a stride 2 convolution has the same kind of effect as a max pooling which is you end up halving the resolution in each dimension by saying stride equals 2 we can say we want it to be 3x3 by saying kernel size and then the first 2 parameters are exactly the same as nn.linear they're the number of features coming in and the number of features coming out so we create a module list of those layers and then at the very end of that so in this case I'm going to say I've got 3 channels coming in the first one layer will come out with 20, then 40 and then 80 so if we look at the summary we're going to start with a 32x32 we're going to spit out a 15x15 and then a 7x7 and then a 3x3 right and so what do we do now to get that down to a prediction of one of 10 classes what we do is we do something called adaptive max pooling and this is what is pretty standard now for state-of-the-art algorithms is that the very last layer we do a max pool but rather than doing like a 2x2 max pool we say like it doesn't have to be 2x2, it could have been 3x3 which is like replace every 3x3 pixels with its maximum, it could have been 4x4 adaptive max pool is where you say I'm not going to tell you how big an area to pool but instead I'm going to tell you how big a resolution to create right so if I said for example I think my input here is like 28x28 right if I said do a 14x14 adaptive max pool that would be the same as a 2x2 max pool because in other words it's saying please create a 14x14 output if I said do a 2x2 adaptive max pool right then that would be the same as saying do a 14x14 max pool and so what we pretty much always do in modern CNNs is we make our penultimate layer a 1x1 adaptive max pool so in other words find the single largest cell and use that as our new activation right and so once we've got that we've now got a 1x1 tensor or actually 1x1 by number of features tensor so we can then on top of that go x.view, x.size, minus 1 and actually there are no other dimensions to this basically so this is going to return a matrix of mini-batch by number of features and so then we can feed that into a linear layer with however many classes we need right so you can see here the last thing I pass in is how many classes am I trying to predict and that's what's going to be used to create that last layer so it goes through every convolutional layer does a convolution does a relu does an adaptive max pool this .view just gets rid of those trailing unit axes the 1,1 axis which is not necessary that allows us to feed that into our final linear layer that spits out something of size C which here is 10 so you can now see how it works it goes 32 to 15 to 7 by 7 to 3 by 3 the adaptive max pool makes it 80 by 1 by 1 and then our .view makes it just mini-batch size by 80 and then finally a linear layer which takes it from 80 to 10 which is what we wanted so that's our most basic you'd call this a fully convolutional network so a fully convolutional network is something where every layer is convolutional except for the very last so again we can now go lr.find and now in this case when I did lr.find it went through the entire data set and was still getting better and so in other words even a default final learning rate it tries is 10 and even at that point it was still pretty much getting better so you can always override the final learning rate by saying end lr equals and that'll get it to try more things and so here is the learning rate finder and so I picked 10 to the minus 1 trained that for a while and that's looking pretty good so then I tried it with a cycle length of 1 and it's starting to flatten out at about 60% right so you can see here the number of elements, the number of parameters I have here are 507,000 28,000 about 30,000 right so I have about a quarter of the number of parameters but my accuracy has gone up from 47% to 60% and the time per epoch here is under 30 seconds and here also so the time per epoch is about the same and that's not surprising because when you use small simple architectures most of the time is the memory transfer the actual time during the compute is trivial so I'm going to refactor this slightly because I want to try and put less stuff inside my forward and so calling value every time doesn't seem ideal so I'm going to create a new class called comflayer and the comflayer class is going to contain a convolution with a kernel size of 3 and a stride of 2 one thing I'm going to do now is I'm going to add padding did you notice here the first layer went from 32 by 32 to 15 by 15 not 16 by 16 and the reason for that is that at the very edge of your convolution right here see how this first convolution like there isn't a convolution where the middle is the top left point because there's like nothing outside it whereas if we had put a row of zeros at the top and a row of zeros at the edge of each column we now could go all the way to the edge so pad equals 1 adds that little layer of zeros around the edge for us and so this way we're going to make sure that we go 32 by 32 to 16 by 16 to 8 by 8 it doesn't matter too much when you've got these bigger layers but by the time you get down to like say 4 by 4 you really don't want to throw away a whole piece so padding becomes important so by refactoring it to put this with its defaults here and then in the forward I'll put the value in here as well it makes by conf net you know a little bit smaller and you know more to the point it's going to be easier for me to make sure that everything's correct in the future by always using this conf layer class so now you know not only how to create your own neural network model but how to create your own neural network layer so here now I can use conf layer and this is such a cool thing about PyTorch is a layer definition and a neural network definition are literally identical they both have a constructor and a forward and so anytime you've got the layer you can use it as a neural net anytime you have a neural net you can use it as a layer so this is now the exact same thing as we had before one difference is I now have padding and another thing just to show you you can do things differently back here my max pool I did as an object like I used the class nn.adaptive max pool and I stuck it in this attribute and then I called it but this actually doesn't have any state there's no weights inside max pooling so I can actually do it with a little bit less code by calling it as a function right so everything that you can do as a class you can also do as a function it's inside this capital F which is nn.functional so this should be a tiny bit better because this time I've got the padding I didn't train it for as long to actually check so let's skip over that alright so one issue here is that in the end this is having when I tried to add more layers I had trouble training it okay and the reason I was having trouble training it is it was if I used larger learning rates it would go off to nn and if I use smaller learning rates that it kind of takes forever and doesn't really have a chance to explore properly so it wasn't resilient so to make my model more resilient I'm going to use something called batch normalization which like literally everybody calls batch norm and batch norm is I guess it's a couple of years old now and it's been pretty transformative since it came along because it suddenly makes it really easy to train deeper networks so the network I'm going to create is going to have more layers I've got 1, 2, 3, 4, 5 convolutional layers plus a fully connected layer so like back in the old days that would be considered a pretty deep network and would be considered pretty hard to train nowadays it's super simple thanks to batch norm now to use batch norm you can just write nn.batch norm but to learn about it we're going to write it from scratch so the basic idea of batch norm is that we've got some vector of activations any time I draw a vector of activations obviously I mean you can repeat it for the mini batch so like pretend it's a mini batch of one so we've got some vector of activations and it's coming into some layer so probably some convolutional matrix multiplication and then something comes out the other side so imagine this is just a matrix multiply which was like I don't know say it was a identity matrix then every time I multiply it by that across lots and lots of layers my activations are not getting bigger they're not getting smaller they're not changing at all that's all fine but imagine if it was actually like 2, 2, 2 right and so if every one of my weight matrices or filters was like that then my activations are doubling each time and so suddenly I've got exponential growth and that in deep models that's going to be a disaster because my gradients are exploding and so the challenge you have is that it's very unlikely unless you try carefully to deal with it that your matrices your weight matrices on average are not going to cause your activations to keep getting smaller and smaller or keep getting bigger and bigger you have to carefully control things to make sure that they stay at a reasonable size keep them at a reasonable scale so we start things off with zero mean standard deviation one by normalizing the inputs but what we'd really like to do is to normalize every layer not just the inputs and so okay, fine let's do that so here I've created a bn layer which is exactly like my commf layer commf2d with my stride my padding I do my con for my value and then I calculate the mean of each channel or of each filter and the standard deviation of each channel or each filter and then I subtract the means and divide by the standard deviations so now I don't actually need to normalize my input at all because it's actually going to do it automatically it's normalizing it per channel and for later layers it's normalizing it per filter so it turns out that's not enough because SGD is bloody minded and so if SGD decided that it wants the weight matrix to be like so where that matrix is something which is going to increase the values overall repeatedly then trying to subtract the means and divide by the standard deviations just means the next mini batch is going to try and do it again and it will try and do it again so it turns out that this actually doesn't help like it literally does nothing because SGD is just going to go ahead and undo it the next mini batch what we do is we create a new multiplier for each channel and a new added value for each channel literally just, we just start them out as the addition is just a bunch of zeros so for the first layer three zeros and the multiplier for the first layer is just three ones so number of filters for the first layer is just three and so we then like basically undo exactly what we just did or potentially we undo them so by saying this is an nn.parameter that tells PyTorch you're allowed to learn these as weights so initially it says okay so track the means divide by the standard deviations multiply by one add on zero okay that's fine nothing much happened there but what it turns out is that now rather than like if it wants to kind of scale the layer up it doesn't have to scale up every single value in the matrix it can just scale up this single trio of numbers self.m if it wants to shift it all up or down a bit it doesn't have to shift the entire weight matrix it can just shift this trio of numbers self.a so I will say this at this talk I mentioned at Nip's Ali Rahimi's talk about rigor he actually pointed to this paper this batch norm paper as being a particularly useful particularly interesting paper where a lot of people don't necessarily we quite quite know why it works right and so if you're thinking like okay we're starting out the means and then adding some learned weights of exactly the same rank and size sounds like a weird thing to do there are a lot of people that feel the same way right so at the moment I think the best is I can say like intuitively is what's going on here is that we're normalizing the data and then we're saying you can then shift it and scale it using far fewer parameters than would have been necessary if I was asking you to actually shift and scale the entire set of convolutional filters right that's the kind of basic intuition more importantly in practice what this does is it adds is it basically allows us to increase our learning rates and it increases the resilience of training and allows us to add more layers so once I added a BN layer rather than a COMF layer I found I was able to add more layers to my model and it's still trained effectively Jeremy are we worried about anything that maybe we are divided by something very small or anything like that yeah probably I think in the PyTorch version it would probably be divided by self.studs plus epsilon or something yeah this worked fine for me but that is definitely something to think about if you were trying to make this more reliable I'll mention oh sorry so the self.m and self.a is getting updated through back propagation as well yeah so by saying it's an nn.parameter that's how we flag to PyTorch to learn it through backprop exactly right the other interesting thing it turns out the batch norm does is it regularizes in other words you can often decrease or remove dropout or decrease or remove weight decay when you use batch norm and the reason why when you think about it each minibatch is going to have a different mean and a different standard deviation to the previous minibatch so these things keep changing and because they keep changing it's kind of changing the meaning of the filters in this subtle way and so it's adding a regularization effect because it's noise when you add noise of any kind it regularizes your model I'm actually cheating a little bit here in the real version of batch norm you don't just use this batch as mean and standard deviation but instead you take an exponentially weighted moving average standard deviation and so if you wanted to exercise to try during the week that would be a good thing to try but I will point out something very important here which is if self.training when we are doing our training loop this will be true when it's being applied to the training set and it will be false when it's being applied to the validation set and this is really important because when you're going through the validation set you do not want to be changing the meaning of the model so this is this really important idea is that there are some types of layer that are actually sensitive to what the mode of the of the network is whether it's in training mode or as PyTorch calls it, evaluation mode or we might say test mode and actually we actually had a bug a couple of weeks ago when we did our mini-net for movie lens the collaborative filtering we actually had f.dropout in our forward pass without protecting it with a if self.training if self.dropout as a result of which we were actually doing dropout in the validation piece as well as the training piece which obviously isn't what you want so I've actually gone back and fixed this by changing it to using nn.dropout and nn.dropout has already been written for us to check whether it's being used in training mode or not or alternatively I could have added a if self.training before I use the dropout here so it's important to think about that and the main two or pretty much the only two built into PyTorch where this happens is dropout and batch null and so interestingly this is also a key difference in fast.ai which no other library does is that these means and standard deviations get updated in training mode in every other library as soon as you basically say I'm training regardless even of whether that layer is set to trainable or not and it turns out that with a pre-trained network that's a terrible idea if you have a pre-trained network the specific values of those means and standard deviations in batch norm if you change them it changes the meaning of those pre-trained layers fast.ai always by default it won't touch those means and standard deviations if your layer is frozen as soon as you unfreeze it it'll start updating them unless you've set learn.bnfrees true if you set learn.bnfrees true it says never touch these means and standard deviations and I found in practice that that often seems to work a lot better for pre-trained models particularly if you're working with data that's quite similar to what the pre-trained model was trained with. So I have two questions so it looks like you did a lot of work calculating the aggregates you know as you looks like I did a lot of work did you say like quite a lot of code here well you're doing more work than you would normally do essentially you're calculating all these aggregates as you go through each layer yes wouldn't this mean you're training like your epoch time? No this is like super fast like if you think about what a conv has to do a conv has to go through every 3x3 with a stride and do this multiplication and then addition like that is a lot more work than simply calculating the per channel mean it adds a little bit of time but it's less time intensive than the convolution So how would you basically position the batch norm would it be like right after the convolutional layer or would it be after the real loop? Yeah we'll talk about that in a moment so at the moment we have it after the real year and in the original batch norm paper I believe that's where they put it so there's this idea of something called an ablation study and an ablation study is something where you basically try kind of turning on and off different pieces of your model to see like which bits make which impacts and one of the things that wasn't done in the original batch norm paper was any kind of really effective ablation study and one of the things therefore that was missing was this question which you just asked which is like where do you put the batch norm before the real year after the real year whatever and so since that time you know that oversight has caused a lot of problems because it turned out the original paper didn't actually put it in the best spot and so then other people since then have now figured that out and now like every time I show people code where it's actually in the spot that turns out to be better people always say your batch norm is in the wrong spot and I have to go back and say no I know that's what the paper said but it turned out that's not actually the right spot and so it kind of causes confusion so there's been a lot of question about that so a little bit of a higher level question so we started out with CIFAR data yes so is the basic reasoning that you use a smaller data set to quickly train a new model and then you take it the same model and you're using a much much bigger data set to get a higher accuracy level is that the basic maybe so if you had a large data set or if you were like interested in the question of like how good is this technique on a large data set then yes what you just said would be what I would do I would do lots of testing on a small data set which I had already discovered had the same kinds of properties as my larger data set and therefore my conclusions would likely carry forward and then I would test them at the end having said that personally I'm actually more interested in actually studying small data sets for their own sake because I find most people I speak to in the real world don't have a million images they have you know somewhere between about 2,000 and 20,000 images seems to be much more common so I'm very you know very interested in having fewer rows because I think it's more valuable in practice I'm also pretty interested in small images not just for the reason you mentioned which is it allows me to test things out more quickly but also as I mentioned before often a small part of an image actually turns out to be what you're interested in that's certainly true in medicine I have two questions the first is on what you mentioned in terms of small data sets particularly medical imaging if you've heard of a I guess is it vicarious to start up in the specialization of one-shot learning so your opinions on that and then the second being this is related to I guess Ali's talk at NIPP so it was I don't say it's controversial but like Yann LeCun there was like a really I guess controversial thread attacking it in terms of what you're talking about as a baseline of theory just not keeping up with practice and so I mean I guess I was starting with Yann whereas Ali actually he tweeted at me quite a bit trying to defend like he wasn't attacking Yann at all but in fact he was you know trying to support him but I just kind of feel like a lot of theory as you go is just sort of at a date and it's hard to keep up other than you know from Andre Caparthi to keep up but if the theory isn't keeping up but the industry is the one that's actually setting the standard then doesn't that mean that you know people who are actual practitioners are the ones like Yann LeCun that are publishing the theory that are keeping up to date or is like academic research institutions are actually behind so I don't have any comments on the vicarious papers because I haven't read them I'm not aware of any of them as actually showing you know better results than other papers but I think they've come a long way in the last 12 months so that might be wrong yeah I think the discussion between Yann LeCun and Ali Rahimi is very interesting because they're both smart people who have interesting things to say unfortunately a lot of people talk Ali's talk as meaning something which he says it didn't mean and when I listen to his talk I'm not sure he didn't actually mean it at the time but he clearly doesn't mean it now which is he's now said many times he didn't he was not talking about theory he was not saying we need more theory at all actually he thinks we need more experiments and so specifically he's also now saying he wished he hadn't used the word rigour which I also wish because rigour is it's kind of meaningless and everybody can kind of say when he says rigour he means the specific thing I study you know so lots of people have kind of taken his talk as being like oh yes this proves that nobody else should work in neural networks unless they are experts at the one thing I'm an expert in so yeah so I'm going to catch up with him and talk more about this in January and hopefully we'll figure some more stuff out together but basically what we can clearly agree on and I think Yann LeCun also agrees on is careful experiments are important just doing things on massive amounts of data using massive amounts of TPUs or GPUs is not interesting of itself and we should instead try to design experiments that give us the maximum amount of insight into what's going on so Jeremy is it a good statement to say something like so a dropout and bash norm are very different things dropout is regularization technique and bash norm has maybe some regularization effect but it's actually just about convergence of the optimization method yeah and I would further say like I can't see any reason not to use bash norm there are versions of bash norm that in certain situations turned out not to work so well but people have figured out ways around that for nearly every one of those situations now so I would always seek to find a way to use bash norm it may be a little harder in RNNs at least but even there there are ways of doing bash norm in RNNs as well so try and always use bash norm on every layer if you can and the question that somebody asked is does it mean I can stop normalizing my data yeah it does although do it anyway because it's not at all hard to do it and at least that way the people using your data I don't know they kind of know how you've normalized it and particularly with these issues around a lot of libraries in my opinion at least my experiments don't deal with bash norm correctly for pre-trained models just remember that when somebody starts retraining those averages and stuff are going to change for your data set and so if your new data set has very different input averages it could really cause a lot of problems so yeah I went through a period where I actually stopped normalizing my data and things kind of worked but it's probably not worth it so the rest of this is identical all I've done is I've changed conf layer to bn layer but I've done one more thing which is I'm trying to get closer and closer to modern approaches which I've added a single convolutional layer at the start a bigger kernel size and a stride of one and why have I done that so the basic idea is that I want my first layer to kind of have a richer input so before my first layer had an input of just three because it's just three channels but if I start with my image and I kind of take a bigger area different color I kind of take a bigger area and I do a convolution using that bigger area in this case I'm doing five by five then that kind of allows me to try and find more interesting richer features in that five by five area and so then I spit out a bigger output in this case I spit out a filter size, I spit out 10 five by five filters and so the idea is like pretty much every state-of-the-art convolutional architecture now starts out with a single conflator with like a five by five or seven by seven or sometimes even like 11 by 11 convolution with like quite a few filters something like 32 filters coming out it's just a way of kind of trying to and like because I used a straight of one and the padding of kernel size minus one over two that means that my output is going to be exactly the same size as my input but just got more filters so this is just a good way of trying to create a richer starting point for my sequence of convolutional layers so that's the basic theory of why I've added this convolution which I just do once at the start and then I just go through all my layers and then I do my adaptive max pooling and my final classifier layer okay so it's a minor tweak but it helps right and so you'll see now I kind of can go from where did I have 60% and after a couple it was 45% now after a couple it's 57% and after a few more I'm up to 68% okay so you can see it's the batch norm and tiny bit the convaler at the start it's helping and what's more you can see this is still increasing right so that's looking pretty encouraging so given that this is looking pretty good an obvious thing to try might be to see is to try increasing the depth of the model and now I can't just add more of my stride 2 layers because remember how it halved the size of the image each time I'm basically down to 2x2 at the end right so I can't add much more so what I did instead was I said okay here's my original layers these are my stride 2 layers for everyone also create a stride 1 layer so a stride 1 layer doesn't change the size and so now I'm saying zip stride 2 layers and my stride 1 layers together and so first of all do the stride 2 and then do the stride 1 so this is now actually twice as deep okay so this is so this is now twice as deep but I end up with the exact same you know 2x2 that I had before and so if I try this you know here after 1 2 3 4 epochs is at 65% after 1 2 3 epochs I'm still at 65% it hasn't helped and so the reason it hasn't helped is I'm now too deep even for batch norm to handle it so my depth is now 1 2 3 4 5 times 2 is 10 11 conv 1 12 okay so 12 layers deep it's possible to train a standard conv net 12 layers deep but it starts to get difficult to do it properly right and it certainly doesn't seem to be really helping much if at all so that's where I'm instead going to replace this with a resnet alright so a resnet is our final stage and what a resnet does is I'm going to replace our bn layer right I'm going to inherit from bn layer and replace our forward with that and that's it everything else is going to be identical except now I'm going to do like way lots of layers I'm going to make it 4 times deeper right and it's going to train beautifully just because of that so why does that help so much so this is called a resnet block and as you can see I'm saying that's not what I meant to do I'm saying my predictions equals my input plus some function you know in this case a convolution of my input that's what I've written here and so I'm now going to shuffle that around a little bit and I'm going to say I'm going to say f of x equals y minus x so that's the same thing shuffled around right that's my prediction from the previous layer right and so what this is then doing is it's trying to fit a function to the difference between these two right and so the difference is actually the residual so if this is what I'm trying to calculate my actual y value and this is the thing that I've most recently calculated then the difference between the two is basically the error in terms of what I've calculated so far and so this is therefore saying that okay try to find a set of convolutional weights that attempts to fill in the amount I was off by right so in other words if we have some inputs coming in right and then we have this function which is basically trying to predict the error it's like how much are we off by right and then we add that on so we basically add on this additional like prediction of how much were we wrong by and then we add on another prediction of how much were we wrong by that time and add on another prediction of how much were we wrong by that time we're kind of zooming in getting closer and closer to our correct answer and each time we're saying like okay we've got to a certain point but we've still got an error we've still got a residual so let's try and create a model that just predicts that residual and add that on to our previous model and then let's build another model that predicts the residual and add that on to our previous model and if we keep doing that again and again we should get closer and closer to our answer and this is based on a theory called boosting which people that have done some machine learning will have certainly come across right and so basically the trick here is that by specifying that as being the thing that we're trying to calculate then we kind of get boosting for free right it's like because we can just juggle that around to show that actually it's just calculating a model on the residual so that's kind of amazing and it totally works as you can see here I've now got my standard batch norm layer okay which is something which is going to reduce my size by 2 because it's got the stride 2 and then I've got a ResNet layer of stride 1 and another ResNet layer of stride 1 right and sorry I think I said that was 4 of these 3 of these so this is now 3 times deeper I zipped through all of those and so I've now got a function of a function of a function so 3 layers per group and then my conflict at the start and my linear at the end right so this is now 3 times bigger than my original and if I fit it you can see it's just keeps going up and up and up and up I keep fitting it more and keeps going up and up and up and up and up and it's still going up when I kind of got bored okay so the ResNet has been a really important development and it's allowed us to create these really deep networks right now the full ResNet does not quite look the way I've described it here the full ResNet doesn't just have one convolution right but it actually has two convolutions so the way people normally draw ResNet blocks is they normally say you've got some input coming into the layer it goes through one convolution two convolutions and then gets added back to the original input right that's the full version of a ResNet block in my case I've just done one convolution and then you'll see also in every block one of them it's actually the first one it's actually the first one here is not a ResNet block but a standard convolution with a stride of two this is called a bottleneck layer the idea is this is not a ResNet block so from time to time we actually change the geometry we're doing the stride two in ResNet we don't actually use just a standard convolutional layer there's actually a different form of bottleneck block that I'm not going to teach you in this course I'm going to teach you in part two but as you can see even this somewhat simplified version of a ResNet still works pretty well and so we can make it a little bit bigger and so here I've just increased all of my sizes I have still got my three and also I've added dropout so at this point I'm going to say this is other than the minus simplification of ResNet a reasonable approximation of a good starting point for a modern architecture and so now I've added in my point two dropout I've increased the size here and if I train this you know I can train it for a while it's going pretty well I can then add in TTA at the end eventually I get 85% and you know this is at a point now where like literally I wrote this whole notebook in like three hours we can like create this thing in three hours and this is like an accuracy that in kind of 2012 2013 was considered pretty much date of the art for Sci-Fi 10 so this is actually pretty damn good to get you know nowadays the most recent results are like 97% you know there is plenty of room we can still improve but they're all based on these techniques like there isn't really anything you know when we start looking in in part two at like how to get this right up the state of the art you'll see it's basically better approaches to data augmentation better approaches to regularization some tweaks on ResNet but it's all basically this idea okay so yes so is the residual training on the residual method is that only looks like it's a generic thing that can be applied non-image problems oh great question yes it is but it's like been ignored everywhere else in NLP something called the transformer architecture recently appeared and you know it was shown to be the state of the art for translation and it's got like a simple ResNet structure in it first time I've ever seen it in NLP I haven't really seen anybody else take advantage of it yeah this general approach we call these skip connections this idea of like skipping over a layer and kind of doing an identity it's yeah it's been appearing a lot in computer vision and nobody else much seems to be using it so there's nothing computer vision specific about it so I think it's a big opportunity okay so final stage I want to show you is how to use an extra feature of PyTorch to do something cool and it's going to be a kind of a segue into part two it's going to be our first little hint as to what else we can build on these neural nets and so and it's also going to take us all the way back to lesson one which is we're going to do dogs and cats okay so going all the way back to dogs and cats we're going to create a ResNet 34 okay so these different ResNet 34 50, 101 they're basically just different numbers different size blocks it's like how many of these kind of pieces do you have before a bottleneck block and then how many of these sets of super blocks do you have right that's all these different numbers mean so if you look at the TorchVision source code you can actually see the definition of these different ResNets you'll see they're all just different parameters okay so we're going to use ResNet 34 and so we're going to do this a little bit more by hand okay so if this is my architecture this is just the name of a function then I can call it to get that model right and then true if we look at the definition do I want the pre-trained so in other words is it going to load in the pre-trained ImageNet weights okay so m now contains a model and so I can take a look at it like so okay and so you can see here what's going on right is that inside here I've got my initial pre-compolution and here is the kernel size of 7 by 7 okay and interestingly in this case it actually starts out with a 7 by 7 stride 2 okay there's the padding that we talked about to make sure that we don't lose the edges right there's our batch norm okay there's our ReLU and you get the idea right conv and then so here you can now see there's a layer that contains a bunch of blocks right so here's a block which contains a conv batch norm ReLU conv batch norm you can't see it printed but after this is where it does the addition right so there's like a whole ResNet block and then another ResNet block and then another ResNet block okay and then you can see also sometimes you see one where there's a stride 2 right so here's actually one of these bottleneck layers okay so you can kind of see how this is structured so in our case sorry I skipped over this a little bit but the approach that we ended up using for ReLU was to put it before our let's have a look before our batch norm which let's see what they do here we've got batch norm ReLU conv batch norm ReLU okay so you can see the order that they're using it here okay and you'll find like there's two different versions of ResNet three different versions of ResNet floating around the one which actually turns out to be the best is called the pre-act ResNet which has a different ordering again but you can look it up it's basically a different order of where the ReLU and where the batch norm sit okay so we're going to start with a standard ResNet 34 and normally what we do is we need to now turn this into something that can predict dogs versus cats right so currently the final layer has a thousand features because ImageNet has a thousand features so we need to get rid of this so when you use Conv Learner from pre-trained in FastAI it actually deletes this layer for you and it also deletes this layer and something that as far as I know is unique to FastAI is we replace this, see this is an average pooling layer of size 7 by 7 right so this is basically the adaptive pooling layer but whoever wrote this didn't know about adaptive pooling so they manually said oh I know it's meant to be 7 by 7 right so in FastAI we replace this with adaptive pooling but we actually do both adaptive average pooling and adaptive max pooling and we then concatenate the two together which it is something we invented but at the same time we invented it somebody wrote a paper about it so it's you know we don't get any credit but I think we're the only library that provides it and certainly the only one that does it by default we're going to for the purpose of this exercise though we're going to do a simple version where we delete the last two layers so we'll grab all the children of the model we'll delete the last two layers and then instead we're going to add a convolution which just has two outputs I'll show you why in a moment then we're going to do our average pooling and then we're going to do our softmax okay so that's a model which is going to have you'll see that there is no this one has a fully connected layer at the end this one does not have a fully connected layer at the end but if you think about it this convolutional layer is going to be two filters only right and it's going to be 2 by 7 by 7 and so once we then do the average pooling it's going to end up being just two numbers that it produces so this is a different way of producing just two numbers I'm not going to say it's better I'm just going to say it's different but there's a reason we do it I'll show you the reason we can now train this model in the usual way so we can say transforms.model image classifier data from paths and then we can use that com learner from model data what about I'm now going to freeze every single layer except for that one and this is the fourth last layer so we'll say freeze to minus 4 right and so this is just training the last layer okay so we get 99.1% accuracy so that you know this approach is working fine and here's what we can do though we can now do something called class class activated maps class activation maps what we're going to do is we're going to try to look at this particular cat and we're going to use a technique called class activation maps where we take our model and we ask it which parts of this image turned out to be important and when we do this it's going to feed out this is going to picture it's going to create right and so as you can see here it's found the cat so how did it do that well the way it did that will kind of work backwards is to produce this matrix you'll see in this matrix there's some pretty big numbers around about here which correspond to our cat so what is this matrix this matrix is simply equal to the value of this feature matrix times this py vector the py vector is simply equal to the predictions which in this case said I'm 100% confident it's a cat so this is just equal to the value of if I just call the model passing in our cat that's an x then we get our predictions so it's just the value of our predictions what about feet what's that equal to feet is equal to the values in this layer in other words the value that comes out of the final in fact it's coming out of this layer coming out of the final convolutional layer right so it's actually the 7 by 7 by 2 and so you can see here see feet the shape of features is 2 filters by 7 by 7 so the idea is if we multiply that vector by that tensor right then it's going to end up grabbing all of the first channel because that's a 1 and none of the second channel because that's a 0 and so therefore it's going to return the value of the last convolutional layer for the for the section which lines up with being a cat if you think about it this the first section lines up with being a cat the second section lines up with being a dog so if we multiply that tensor by that tensor we end up with this matrix and this matrix is which parts are most like a cat or to put it another way now model the only thing that happened after the convolutional layer was an average pooling layer so the average pooling layer took that 7 by 7 grid and said average out how much each part is cat like right and so my final value my final prediction was the average cattiness of the whole thing right and so because it had to be able to average out these things to get the average cattiness that means I could then just take this matrix and resize it to be the same size as my original cat and just overlay it on top to get this heat map right so the way you can use this technique at home is to basically calculate this matrix right on some like really if you've got some really big picture you can calculate this matrix on a quick small little con net and then zoom into the bit that has the highest value and then rerun it just on that part right so it's like oh this is the area that seems to be the most like a cat or most like a dog that zoom in to that bit right so I skipped over that pretty quickly because we ran out of time and so we'll be learning more about these kind of approaches in part two and we can talk about it more on the forum but hopefully you get the idea the one thing I totally skipped over was how do we actually ask for that particular layer and I'll let you read about this during the week but basically there's a thing called a hook so we said we called save features which is this little class that we wrote that goes register forward hook and basically a forward hook is a special pie torch thing that every time it calculates a layer it runs this function it's like a callback basically it's like a callback that happens every time it calculates a layer and so in this case it just saved the value of the particular layer that I was interested in and so that way I was able to go inside here and grab those features out let's look after I was done so I called save features that gives me my hook and I can just grab the value that I saved so I'll skip over that pretty quickly but if you look in the pie torch docs they have some more information and help about that yes Yannette do you have the Jeremy can you spend five minutes talking about your journey into deep learning yeah and finally how can we keep up with important research that is important to practitioners so I was going to I think I'll close more in the latter bit which is like what now okay so for those of you who are interested you should aim to come back for part 2 if you're aiming to come back for part 2 how many people would like to come back for part 2 okay that's not bad I think almost everybody so if you want to come back for part 2 be aware of this by that time you're expected to have understood all of the techniques we've learned in part 1 and there's plenty of time between now and then okay even if you haven't done much or any ML before but it does assume that you're going to be working you know at the same level of intensity from now until then that you have been with practicing right so practicing so generally speaking the people who did well in part 2 last year had watched each of the videos about three times right and some of the people actually I knew had actually discovered they learnt some of them off by heart by mistake so like watching the videos again is helpful and make sure you get to the point that you can recreate the notebooks without watching the videos right and so they then make more interesting obviously try and recreate the notebooks using different data sets you know and definitely then just keep up with the forum and you'll see people keep on posting more stuff about recent papers and recent advances and over the next couple of months you'll find increasingly less and less of it seems weird and mysterious and more and more of it makes perfect sense and so it's a bit of a case of just staying tenacious you know there's always going to be stuff that you don't understand yet and but you'll be surprised if you go back to lesson 1 and 2 now you'll be like oh that's all trivial right so you know that's kind of hopefully a bit of your learning journey and yeah I think the main thing I've noticed is the people who succeed are the ones who just keep working at it you know so not coming back here every Monday you're not going to have that forcing function like I've noticed the forum suddenly gets busy at 5pm on a Monday you know it's like oh course is about to start and suddenly these questions start coming in so now that you don't have that forcing function you know try and use some other technique to you know give yourself that little kick maybe you can tell your partner at home you know I'm going to try and produce something every Saturday for the next four weeks or I'm going to try and finish reading this paper or something you know anyway so I hope to see you all back in March and even regardless whether I do or don't it's been a really great pleasure to get to know you all and hope to keep seeing you on the forum thanks very much