 Hi everybody and welcome to lesson 8, the last lesson of part 1 of this course. Thanks so much for sticking with us. Got a very interesting lesson today where we're going to dive into natural language processing and remind you we did see natural language processing in lesson 1. This was it here. We looked at a data set where we could pass in many movie reviews like so and get back probabilities that it's a positive or negative sentiment. And we trained it with a very standard looking classifier trainer approach. But we haven't really talked about what's going on behind the scenes there. So let's let's do that. And we'll also learn about how to make it better. So we were getting about 93 percent. So 93 percent accuracy for sentiment analysis, which is actually extremely good. And it only took a bit over 10 minutes. But let's see if we can do better. So we're going to go to notebook number 10. And in notebook number 10, we're going to start by talking about what we are going to do to train an NLP classifier. So a sentiment analysis, which is this movie review positive or negative sentiment, is just a classifier. The dependent variable is binary. And the independent variable is the kind of the interesting bit. So we're going to talk about that. But before we do, we're going to talk about what was the pre-trained model that got used here. Because the reason we got such a good result so quickly is because we're doing fine tuning of a pre-trained model. So what is this pre-trained model exactly? Well, the pre-trained model is actually a pre-trained language model. So what is a language model? A language model is a special kind of model. And it's a model where we try to predict the next word of a sentence. So for example, if our language model received, even if our language model knows the, then its job would be to predict basics. Now, the language model that we use as our pre-trained model was actually trained on Wikipedia. So we took all the, you know, non-trivial sized articles in Wikipedia. And we built a language model which attempted to predict the next word of every sequence of words in every one of those articles. And it was a neural network, of course. And we then take those pre-trained weights. And those are the pre-trained weights that when we said text classifier learner were automatically loaded in. So conceptually, why would it be useful to pre-train a language model? How does that help us to do sentiment analysis, for example? Well, just like an ImageNet model has a lot of information about what pictures look like and what they're consisting of, a language model tells us a bit, a lot about what sentences look like and what they know about the world. So a language model, for example, if it's going to be able to predict the end of the sentence in 1998, this law was passed by President Watt. So a language model to predict that correctly would have to know a whole lot of stuff. It would have to know about, well, how English language works in general and what kind of sentences go in what places. That after the word president would usually be the surname of somebody. It would need to know what country that law was passed in. And it would need to know what president was president of that country in what I say, 1998. So it'd have to know a lot about the world. It'd have to know a lot about language to create a really good language model is really hard. And in fact, this is something that people spend many, many, many millions of dollars on creating language models of huge datasets. Our particular one doesn't take particularly long to pre-train. But there's no particular reason for you to pre-train one of these language models because you can download them through fast AI or through other places. So what happened in lesson one is we downloaded this pre-trained Wikipedia model. And then we fine-tuned it. So as per usual, we threw away the last layer, which was specific for predicting the next word of Wikipedia. And fine-tuned the model initially just the last layer to learn to predict sentiment of movie reviews. And then as per usual, then fine-tuned the rest of the model. And that got us 93%. Now, there's a trick we can use though, which is we start with this Wikipedia language model and the particular subset we use is called wikitext 103. And rather than just jumping straight to a classifier, which we did in lesson one, we can do even better if we first of all create an IMDB language model, that is to say a language model that learns to predict the next word of a movie review. The reason we do that is that this will help it to learn about IMDB specific kind of words. Like it'll learn a lot more about the names of actors and directors, it'll learn about the kinds of words that people use in movie reviews. And so if we do that first, then we would hope we'll end up with a better classifier. So that's what we're going to do in the first part of today's lesson. And we're going to kind of do it from scratch. And we're going to show you how to do a lot of the things from scratch, even though later we'll show you how fast AI does it all for you. So how do we build a language model? So as we point out here, sentences can be different lengths. And documents like movie reviews can be very long. So how do we go about this? Well, a word is basically a categorical variable. And we already know how to use categorical variables as an independent variable in a neuron error, which was we make a list of all of the possible levels of a categorical variable, which we call the vocab. And then we replace each of those categories with its index. So they all become numbers. We create an initially random embedding matrix for each row, then is for one element from the vocab. And then we make that the first layer of a neural net. So that's what that's what we've done a few times now. And we've even created our own embedding layer from scratch, remember? So we can do the same thing with text, right? We can make a list of all the possible words in in the whole corpus, the whole data set. And we can replace each word with the index of the vocab and creating embedding matrix. So in order to create a list of all levels, in this case, a list of all possible words, let's first of all concatenate all the documents or the movie reviews together into one big long string and split it into words. Okay. And then our independent variable will basically be that sequence starting with the first word in the long list and ending with a second last. And our dependent variable will be the sequence of words starting with the second word and ending with the last. So they're kind of offset by one. So as you move through the first sequence, you're then trying to predict the next word in the next in the in the second part. That's kind of what we're doing, right? We'll see more detail in a moment. Now, when we create our vocab by finding all the unique words in this concatenated corpus, a lot of the words we see will be already in the embedding matrix already in the vocab of the pre-trained Wikipedia model. But there's also going to be some new ones, right? There might be some particular actors that don't appear in Wikipedia or maybe some informal slang words and so forth. So when we build our vocab and then our embedding embedding matrix for the IMDB language model, any words that are in the vocab of the pre-trained model will just use them as is. But for new words, we'll create a new random vector. So here's the process we're going to have to go through. First, we're going to have to take our big concatenated corpus and turn it into a list of tokens. Could be words, could be characters, could be substrings. That's called tokenization. And then we'll do numericalization, which is basically these two steps, which is replacing each word with its index in a vocab, which means we have to create that vocab. So create the vocab and then convert. Then we're going to need to create a data loader that has lots of substrings, lots of sequences of tokens from our IMDB corpus as an independent variable and the same thing offset by one as a dependent variable. And then we're going to have to create a language model. Now a language model is going to be able to handle input lists that can be arbitrarily big or small. And we're going to be using something called a recurrent neural network to do this, which we'll learn about later. So basically so far, we've always assumed that everything is a fixed size, a fixed input. So we're going to have to mix things up a little bit here and deal with architectures that can be different sizes. For this notebook, notebook 10, we're going to kind of treat it as a black box. It's just going to be just a neural net. And then later in the lesson, we'll look at delving inside what's happening in that architecture. Okay, so let's start with the first of these, which is tokenization. So converting a text into a list of words or a list of tokens, what does that mean? Is a full stop a token? What about don't? Is that single word or is it two words? Don't or is it what do you convert it to do not? What about long medical words that are kind of made up of lots of pieces of medical jargon that are all stuck together? What about hyphenated words? And really interestingly then, what about something like Polish where you or Turkish where you can create really long words all the time they create really long words that are actually lots of separate parts all concatenated together or languages like Japanese and Chinese that don't use spaces at all. They don't really have a world of find idea of a word. Well, there's no right answer, but there's basically three approaches. We can use a word based approach, which is what we use by default at the moment for English, although that might change, which is we split a sentence on space. And then there are some language specific rules, for example, turning don't into do and putting punctuation marks as a separate token most of the time. Really interestingly, there are tokenizes that are subword based. And this is where we split words into smaller parts based on the most commonly occurring substrings. We'll see that in a moment. Or the simplest character based split a sentence into its characters. We're going to look at word and subword tokenization in this notebook. And then if you look at the questionnaire at the end, you'll be asked to create your own character based tokenizer. So please make sure you do that. If you can, it'll be a great exercise. So fast AI doesn't invent its own tokenizers. We just provide a consistent interface to a range of external tokenizers because there's a lot of great tokenizers out there. So you can switch between different tokenizers pretty easily. So let's start. Let's grab our IMDB dataset like we did in lesson one. And in order to try out a tokenizer, let's grab all the text files. So we can instead of calling get image files, we'll call get text files. And it'll have a look at what that's doing. Don't forget. We can even look at the source code. And you can see actually it's calling a more general thing called get files and saying what extensions it wants. So if anything in fast AI doesn't work quite the way you want, and there isn't a option which works the way you want, you can often look or always look underneath to see what we're calling. And you can call the lower level stuff yourself. So files is now a list of files. So we can grab the first one. We can open it. We can read it, have a look at the start of this review. And here it is. Okay. So at the moment, the default English word tokenizer we use is called spacey, which uses a pretty sophisticated set of rules, the special rules for particular words and URLs and so forth. But we're just going to go ahead and say word tokenizer, which will automatically use fast AI's default word tokenizer, currently spacey. And so if we pass a list of documents, we'll just make it a list of one document here, to the tokenizer we just created, and just grab the first, since we just created a list, that's going to show us, as you can see, the tokenized version. So you can see here that this movie, which I just discovered at the video store has, etc. It's changed it's into it, and it's put a comma as a separate punctuation mark, and so forth. Okay. So you can see how it has tokenized this review. Let's look at a more interesting one, the US, blah, blah, blah. And you can see here it actually knows that US is special. So it doesn't put the full stop in a set as a separate place for US. It knows about 1.00 is special. So you can see there's a lot of tricky stuff going on with spacey to try and be as kind of thoughtful about this as possible. Fast AI then provides this tokenizer wrapper, which provides some additional functionality to any tokenizer, as you can see here, which is, for example, the word it here, which previously was capital IT, has been turned into lowercase it, and then a special token xxmage has appeared at the front. Everything starting with xx is a special fast AI token. And this means that the next match means that the next word was previously started with a capital letter. So here's another one. This used to be capital T. So we make it lowercase and then add xxmage. xxbos means this is the start of a document. So there's a few special rules going on there. So why do we do that? Well, if you think about it, if we didn't lowercase it, for instance, or this, then the capitalized version and the lowercase version are going to be two different words in the embedding matrix, which probably doesn't make sense, you know, regardless of the capitalization, they probably basically mean the same thing. Having said that, sometimes the capitalization might matter. So we kind of want to say, alright, use the same embedding every time you see the word this, but add some kind of marker that says that this was originally capitalized. Okay, so that's why we do it like this. So there's quite a few rules. You can see them in text proc rules. And you can see the source code. Here's a summary of what they do. But let's look at a few examples. So if we use that tokenizer we created and pass in, for example, this text, you can see the way it's tokenized, we get the xx beginning of stream or beginning of string, beginning of document. This HTML entity has become a real unicode. We've got the xxmage we discussed. Now here, www has been replaced by xxrep3w. That means the letter w is repeated three times. So for things where you've got like, you know, 100 exclamation marks in a row or the word so with like 50 os, this is a much better representation. And then you can see all uppercase has been replaced with xxup followed by the word. So there's some of those rules in action. Oh, you can also see multiple spaces have been replaced, you know, with making just making it standard tokens. So that's the word tokenizer. The really interesting one is the subword tokenizer. So how, why would you need a subword tokenizer? Well, consider, for example, this sentence here. So this is, my name is Jeremy, but the interesting thing about it is there's no spaces here, right? And that's because there are no spaces in Chinese. And there isn't really a great sense of what a word is in Chinese. In this particular sentence, it's fairly clear what the words are, but it's not always obvious. Sometimes the words are actually split, you know, so some of it's at the start of a sentence and some of it's at the end. So you can't really do word tokenization or something like Chinese. So instead, we use subword tokenization, which is where we look at a corpus of documents and we find the most commonly occurring groups of letters. And those commonly occurring groups of letters become the vocab. So for example, we would probably find water would appear often because that means my and mingze. And then mingze, for example, is name. And this is my westernized version of a Chinese name, which wouldn't be very common at all. So they would probably appear separately. So let's look at an example. Let's grab the first 2000 movie reviews. And let's create the default subword tokenizer, which currently uses something called sentence piece that might change. And now we're going to use something special, something very important, which is called setup transforms in fast AI. You can always call this special thing called setup. It often doesn't do anything, but it's always there. But some transforms like a subword tokenizer actually need to be set up before you can use them. In other words, you can't tokenize into subwords until you know what the most commonly occurring groups of letters are. So passing a list of texts in here, this list of texts to set up will will train the subword tokenizer that will find those commonly occurring groups of letters. So having done that, we can then this is just for experimenting, we're going to pass in some size, we'll say what vocab size we want for our subword tokenizer, we'll set it up with our texts, and then we will have a look at a particular sentence. So for example, if we create a subword tokenizer with 1000 tokens, and it returns this tokenized string. Now this kind of long underscore thing is what we replace space with, because now we're using subword tokens, we kind of want to know where the sentences actually start and stop. And you can see here, a lot of sentence words are common enough sequences of letters that they get their own vocab item, or else discovered wasn't common enough. So that became dis k over ed. Video appears enough, or else store didn't. So that became st or e. So you get the idea. So if we wanted a smaller vocab size, that would, as you see, even this doesn't become its own word. Movie is so common that it is its own word. So just becomes, for example. We have a question. Okay. How can we determine if the given pre trained model, in this case, wiki text 103 is suitable enough for our downstream task? If we have limited vocab overlap, should we need to add an additional data set to create a language model from scratch? If it's in the same language, so if you're doing English, it's always, it's almost always sufficient to use Wikipedia. We've played around to this a lot. And it was one of the key things that Sebastian Ruder and I found, when we created the ULM fit paper was before that time, people really thought you needed corpus specific pre trained models. But we discovered you don't, just like you don't that often need corpus specific pre trained vision models image network surprisingly well across a lot of different domains. So Wikipedia has a lot of words in it. It would be really, really, I haven't come across an English corpus that didn't have a very high level of overlap with Wikipedia. On the other hand, if you're doing ULM fit with like genomic sequences or Greek or whatever, then obviously you're going to need a different pre trained model. So once we got to a 10,000 word vocab, as you can see basically every word, at least common word becomes its own vocab item in the subword vocab, except say discovered, which becomes discover ed. So my guess is that subword approaches are going to become kind of the most common. Maybe they will be by the time you watch this. We've got some fiddling to do to get this working super well for fine tuning. But I think I know what we have to do. So hopefully we'll get it done pretty soon. All right. So after we split it into tokens, the next thing to do is numericalization. So let's go back to our word tokenized text, which looks like this. And in order to numericalize, we will first need to call setup. So to save a bit of time, let's create a subset of our text. So just create a couple of hundred of the corpuses. That's a couple of hundred of the reviews. So here's an example of one. And we'll create our numericalized object. And we'll call setup. And that's the thing that's going to create the vocab for us. And so after that, we can now take a look at the vocab. This is Cole repra is showing us a representation of a collection. It's what the L class uses underneath. And you can see when we do this that the vocab starts with the special tokens. And then we start getting the English tokens in order of frequency. So the default is a vocab size of 60,000. So that'll be the size of your embedding matrix by default. And if there are more than 60,000 unique words in your vocab in your corpus, then any the least common ones will be replaced with a special xxunk unknown token. So that'll help us avoid having a too big embedding matrix. All right, so now we can treat the numericalized object which we created as if it was a function as we so often do in both fast AI and PyTorch. And when we do it'll replace each of our words with numbers. So two, for example, is zero, one, two, beginning of string, beginning of stream, eight, zero, one, two, three, four, five, six, seven, eight. Okay, so a capitalized letter. There they are xxbos, xxmedge, etc. Okay. And then we can convert them back by indexing into the vocab and get back what we started with. Okay. Right. So now we have done the tokenization. We've done the numericalization. And so the next thing we need to do is to create batches. So let's say this is the text that we want to create batches from. And so if we tokenize that text, it'll convert it into this. And so let's take that and write it out here. Let's take that and write it out here. So xxbos, xxmedge in this chapter, xxbos, xxmedge in this chapter, we will go back over the example of classifying. And then next row starts here. Movie reviews, we studied in chapter one and dig deeper under the surface, full stop, xxmedge. First we will look at the etc. Okay. So we've taken these 90 tokens and to create a batch size of six, we've broken up the text into six contiguous parts, each of length 15. So one, two, three, four, five, six. And then we have 15 columns. Okay. So six by 15. Now ideally, we would just provide that to our model as a batch. And if indeed that was all the data we had, we could just pass it in as a batch. But that's not going to work for IMDB, because IMDB, once we concatenate all the reviews together, and then let's say we want to use a batch size of 64, then we're going to have 64 rows. And you know, probably there's a few million tokens of IMDB. So a few million divided by 64 across. It's going to be way too big to fit in our GPU. So what we're going to do then is we're going to split up that big wide array. And we're going to split it up horizontally. So we'll start with xxbOS, xxmedge in this chapter. And then down here, we will go back over the example of classifying movie reviews we studied in chapter one and dig deeper under the surface, etc. So this would become our first mini batch. Right. And so you can see what's happened is the kind of second row, right, actually is is is continuing what was like way down here. And so we basically treated each row as totally independent. So when we predict the second from the second mini batch, you know, the second mini batch is going to follow from the first in that each row to row one in the second mini batch will join up to row one of the first row two of the second batch mini batch will join up to row two of the first. So please look at this example super carefully, because we found that this is something that every year a lot of students get confused about because it's just not what they expected to see happen. Right. So go back over this and make sure you understand what's happening in this little example. So that's what our mini batches are going to be. So the good news is that all these fiddly steps, you don't have to do yourself. You can just use the language model data loader or LM data loader. So if we take those all the tokens from the first 200 movie reviews and map them through our numericalized object, right. So now we've got numericalized versions of all those tokens and then pass them into LM data loader and then grab the first item from the data loader. Then we have 64 by 72. Why is that? Well, 64 is the default batch size. And 72 is the default sequence length. You see here we've got one, two, three, four, five. Here we used a sequence length of five. Right. So what we do in practice is we use a default sequence length of 72. So if we grab the first of our independent variables and grab the first two tokens and look them up in the vocab, here it is. This movie, which I just something at the video store. So that's interesting. So this was not common enough to be in a vocab has apparently set around for R. And then if we look at the exact same thing, but for the dependent variable, rather than being xxbosxxmage this movie, it's xxmage this movie. So you can see it's offset by one, which means the end rather than being around for R, it's for a couple. So this is exactly what we want. This is offset by one from here. So that's looking good. So we can now go ahead and use these ideas to try and build our even better IMDB sentiment analysis. And the first step will be to, as we discussed, create the language model. But let's just go ahead and use the fast AI built in stuff to do it for us, rather than doing all that messing around manually. So we can just create a data block. And our blocks are it's going to be a text block from folder. And the items are going to be text files from these folders. And we're going to split things randomly. And then going to turn that into data loaders with a batch size of 128 and a sequence length of 80. In this case, our blocks, we're not just passing in a class directly, but we're actually passing in here a class method. And that's so that we can allow the tokenization, for example, to be saved to be cached in some path. So the next time we run this, it won't have to do it all from scratch. So that's why we have a slightly different syntax here. So once we've run this, we can call show batch. And so you can see here, we've got, for example, what xxmage i've read xxmage death blah, blah, blah. And you can see, so that's the independent variable. And so the dependent variable is the same thing offset by one. So we don't have the what anymore, but it just goes straight to xxmage i've read. And then at the end, this was also this, and of course, in the dependent variable, also this is. So this is that offset by one, just like we were hoping for show batch is automatically de numericalizing it for us, turning back into strings. But if we look at the actual, or you should look at the actual x and y to confirm that you actually see numbers there, that'll be a good exercise for you is to make sure that you can actually grab a mini batch from DL's LM. So now that we've got the data loaders, we can fine tune our language model. So fine tuning the language model means we're going to create a learner, which is going to learn to predict the next word of a movie review. So that's our data, the data loaders for the language model. This is the pre trained model, it's something called awd lstm, which we'll see how to create from scratch in a moment or something similar to it. Dropout we'll learn about later that we say how much dropout to use. This is how much regularization we want and what metrics do we want. We've know about accuracy. Perplexity is not particularly interesting. So I won't discuss it, but feel free to look it up if you're interested. And let's train with FP16 to use less memory on the GPU. And for any modern GPU it'll also run two or three times faster. So this gray bit here has been done for us, the pre training of the language model for Wikitext 103. And now we're up to this bit, which is fine tuning the language model for IMDB. So let's do one epoch. And as per usual, the using a pre trained model automatically calls freeze. So we don't have to freeze. So this is going to just actually train only the new embeddings initially. And we get an accuracy after 10 minutes or so of 30%. So that's pretty cool. So about a bit under a third of the time, our model is predicting the next word of a string. So I think that's pretty cool. Now, since this takes quite a while for each epoch, we may as well save it. And you can save it under any name you want. And that's going to put it into your part into your learner's path into a model sub folder, and it'll give it a dot pth extension for PyTorch. And then later on, you can load that with learn dot load after you create the learner. And so then we can unfreeze. And we can train a few more epochs. And we eventually get up to an accuracy of 34%. So that's pretty great. So once we've done all that, we can save the model. But actually, all we really need to need to do is to save the encoder. What's the encoder? The encoder is all of the model, except for the final layer. Oh, and we're getting a thunderstorm here. That could be interesting. We've never done a lesson with a thunderstorm before. But that's the joy of teaching during COVID-19. You get all the sound effects. So yeah, the final layer of our language model is predict is the bit that actually picks a particular word out, which we don't need. So when we say save encoder, it saves everything except for that final layer. And that's the pre-trained model we're going to use. That is a pre-trained model of a language model that is fine-tuned from Wikipedia, fine-tuned using IMDB, and doesn't contain the very last layer. Rachel, any questions at this point? Do any language models attempt to provide meaning? For instance, I'm going to the store is the opposite of I'm not going to the store. Or I barely understand this stuff and that ball came so close to my ear I heard it whistle. Both contain the idea of something almost happening, being right on the border. Is there a way to indicate this kind of subtlety in a language model? Yeah, absolutely. Our language model will have all of that in it or hopefully it will learn about it. We don't have to program that. The whole point of machine learning is it learns it for itself. But when it sees a sentence like, hey, careful, that ball nearly hit me. The expectation of what word is going to happen next is going to be different to the sentence, hey, that ball hit me. So yeah, language models generally you see in practice tend to get really good at understanding all of these nuances of English or whatever language it's learning about. Okay, so we have a fine tuned language model. So the next thing we're going to do is we're going to try fine tuning a classifier. But before we do just for fun, let's look at text generation. We can create write ourselves some words like I liked this movie because and then we can create say two sentences each containing say 40 words. And so we can just go through those two sentences and call learn.predict passing in this text and asking to predict this number of words with this amount of kind of randomization. And see what it comes up with. I liked this movie because of its story and characters. The storyline was very strong, very good for a sci fi. The main character, Alucard was very well developed and brought the whole story. But second attempt, I liked this movie because I like the idea of the premise of the movie, the very convenient virus, which, well, when you have to kill a few people, the evil machine has to be used to protect blah, blah, blah. So as you can see, it's done a good job of inventing language. There are much, I shouldn't say more sophisticated. There are more careful ways to do generation from a language model. This learn.predict uses the most kind of basic possible one. But even with a very simple approach, you can see we can get from a fine chin model some pretty authentic looking text. And so in practice, this is really interesting because we can now, you know, by using the prompt, you can kind of get it to generate appropriate context, appropriate text, particularly if you fine tune from a particular corpus. Anyway, that was really just a little demonstration of something we accidentally created on the way. Of course, the whole purpose of this is actually just to be a pre-trained model for classification. So to do that, we're going to need to create another data block. And this time we've got two blocks, not one. We've got a text block again, just like before. But this time we're going to ask FastAI not to create a vocab from the unique words, but use the vocab that we already have from the language model. Because otherwise, obviously, there's no point reusing a pre-trained model if the vocab's different. The numbers would mean totally different things. So that's the independent variable. And the dependent variable, just like we've used before, is a category. So a category block is for that. As we've used many times, we're going to use parent label to create our dependent variable. That's a function. GetItems, we'll use getTextFiles just like before. And we'll split using grandparent splitter as we've used before for vision. So this has been used for vision. This has been used for vision. And then we'll create our data loaders with a batch size of 128, a sequence length of 72. And now show batch. We can see an example of subset of a movie review and a category. Yes, question. Do the tokenizers use any tokenization techniques like stemming or limitization, or is that an outdated approach? That would not be a tokenization approach. So stemming is something that actually removes the stem. And we absolutely don't want to do that. That is certainly an outdated approach. In English, we have stems for a reason. They tell us something. So we don't like to remove anything that can give us some kind of information. We used to use that for a kind of pre-deep learning NLP quite a bit because we didn't really have good ways like embedding matrices of handling big vocabs that just differed in the end of a word. But nowadays, we definitely don't want to do that. One other difference here is previously we had an isLm equals true when we said textBlock.folder to say it was a language model. We don't have that anymore because it's not a language model. Okay. Now, one thing with a language model that was a bit easier was that we could concatenate all the documents together and then we could split them by batch size to create, why not split them by batch size, split them into a number of substrings based on the batch size. And that way we could ensure that every mini batch was the same size. It would be batch size by sequence length. But for classification, we can't do that. We actually need each dependent variable label to be associated with each complete movie review. And we're not showing the whole movie review here. We've truncated it just for display purposes, but we're going to use the whole movie review to make our prediction. Now, the problem is that if we're using a batch size of 128, then our movie reviews are often like 3000 words long. We could end up with something that's way too big to fit into the GPU memory. So how are we going to deal with that? Well, again, we can split them up. So first of all, let's grab a few of the movie reviews just to a demo here and numericalize them. And if we have a look at the length, so map the length over each, you can see that they do vary a lot in length. Now, we can split them into sequences. And indeed we have asked to do that sequence length 72. But when we do so, we're, you know, we don't even have the same number of sub sequences when we split each of these into 72 long sections, they're going to be all different lengths. So how do we deal with that? Well, just like in vision, we can handle different sized sequences by adding padding. So we're going to add a special xx pad token to every sequence in a mini batch. So like in this case, it looks like 581 is the longest. So we would add enough padding tokens to make this 581 and this 581 and this 581 and so forth. And then we can split them into sequence length into 72 long and into in the mini batches. And we'll be right to go. Now, obviously, if your lengths are very different like this, adding a whole lot of padding is going to be super wasteful. So another thing that FastAI does internally is it tries to shuffle the documents around so that similar length documents are in the same mini batch. It also randomizes them, but it kind of approximately sorts them. So it wastes a lot less time on padding. Okay, so that is how that is what happens when we we don't have to do any of that manually when we call text block from folder without the isLm, it does that all that for us. And then we can now go ahead and create a learner. This time it's going to be a text classifier learner. Again, we're going to base it off AWD LSTM. Pass in the data loaders we just created. For metric, we'll just use accuracy, make it FP16 again. And now we don't want to use a pre-trained Wikipedia model. In fact, there is no pre-trained Wikipedia classifier because what you classify matters a lot. So instead, we load the encoder to remember everything except the last layer, which we saved just before. So we're going to load as a pre-trained model, a language model for predicting the next word of a movie review. So let's go ahead and hit one cycle. And again, by default, it will be frozen. So it's only the final layer, which is the randomly added classifier layer that's going to be trained. It took 30 seconds. And look at this, we already have 93%. So that's pretty similar to what we got back in lesson one. But rather than taking about 12 minutes, once all the pre-training has been done, it takes about 30 seconds. This is quite cool. You can create a language model for your kind of general area of interest. And then you can create all kinds of different classifiers pretty quickly. And so that's just looking at the fine tuning the final randomly added layer. So now we could just unfreeze and keep learning. But something we found is for NLP, it's actually better to only unfreeze one layer at a time, not to unfreeze the whole model. So in this case, we've automatically unfrozen the last layer. And so then to unfreeze the last couple of layer groups, we can say freeze to minus two and then train a little bit more. And look at this, we're already beating, after a bit over a minute, easily beating what we got in lesson one. And then freeze to minus three to unfreeze another few layers. Now we're up to 94. And then finally, unfreeze the whole model and we're up to about 94.3% accuracy. And that was literally the state of the art for this very heavily studied data set just three years ago. If you also reverse all of the reviews to make them go backwards and train a second model on the backwards version and then average the predictions of those two models as an ensemble, you get to 95.1% accuracy. And that was the state of the art that we actually got in the ULM fit paper. And it was only beaten for the first time a few months ago, using a way, way bigger model, way more data, way more compute and way more data augmentation. I should mention actually with the data augmentation, one of the cool things they did do was they actually figured out also a way to even beat our 95.1 with less data as well. So I should mention that actually the data augmentation has become a really, since we created the ULM fit paper has become a really, really important approach. Any questions, Rachel? Can someone explain how a model trained to predict the last word in the sentence can generalize to classify sentiment? They seem like different domains. Yeah, that's a great question. They're very different domains and it's really amazing. And basically the trick is that to be able to predict the next word of a sentence, you just have to know a lot of stuff about not only the language, but about the world. So if, you know, let's say we wanted to finish the next word of this sentence, by training a model on all the texts read backwards and averaging the predictions of these two models, we can even get to 95.1% accuracy, which was the state of the art introduced by the what? So to be able to fill in the word ULM fit, you would have to know a whole lot of stuff about, you know, the fact that there's a thing called pre-trained language models and which one gets which results and the ULM fit got this particular result. I mean, that would be an amazing language model that could fill that in correctly. I'm not sure that any language models can, but to give you a sense of like what you have to be able to do to be good at language modeling. So if you're going to be able to predict the next word of a sentence like, wow, I really love this movie. I love every movie containing Meg what, right? Maybe it's Ryan. You would have to know about like the fact that Meg Ryan is an actress and actresses are in movies and so forth. So when you know so much about English and about the world to then turn that into something which recognizes that I really love this movie is a good thing rather than a bad thing is just not a very big step. And as we saw, you can actually get that far using just pre fine tuning just the very last layer or two. So it is it's amazing. And I think that's super, super cool. All right, another question. How would you do data augmentation on text? Well, you would probably Google for unsupervised data augmentation and read this paper and things that have cited it. So this is the one that easily beat our IMDB result with only 20 labeled examples, which is amazing, right? And so they did things like, if I remember correctly, translate every sentence into a different language, and then translate it back again. So you kind of get like different rewordings of the sentence that way. Yeah, so kind of tricks like that. Now let's go back to the generation thing. So remember, we saw that we can generate context appropriate sentences. And it's important to think about what that means in practice. When you can generate context appropriate sentences, have a look for example at even before this technology existed in 2017. The FCC asked for comments about a proposal to repeal net neutrality. And it turned out that less than 800,000 of the 22 million comments actually appeared to be unique. And this particular person Jeff Cow discovered that a lot of the submissions were slightly different to each other by kind of like picking up different, you know, the green bit would either be citizens or people like me or Americans. And the red bit would be as opposed to or rather than and so forth. So like, and that made a big difference to I believe to American policy. Here's an example of Reddit conversation. You're wrong. The defense budget is a good example of how badly the US spends money on the military, somebody else. Yeah, but that's already happening. There's a huge increase in the military budget. I didn't mean to sound like stop paying for the military. I'm not saying that we cannot pay the bills. All of these are actually created by a language model for GPT2. And this is a very concerning thing around disinformation is that never mind fake news, never mind deep fakes. Think about like what would happen if somebody invested a few million dollars in creating a million Twitter bots and Facebook groups bots and way more bots and made it so that 99% of the content on social networks were deep learning bots. And furthermore, they were trained not just to optimize the next word of a sentence, but we're trained to optimize the level of disharmony created or the level of agreeableness for some of the half of them and disagreeableness for the other half of them, you know, you could create like a whole lot of, you know, just awful toxic discussion, which is actually the goal of a lot of propaganda outfits. It's not so much to push a particular point of view, but to make people feel like there's no point engaging because the truth is too hard to understand or whatever. So I'm Rachel and I are both super worried about what could happen to discourse now that we have this incredibly powerful tool. And I'm not even sure we have we don't have a great sense of what to do about it. Algorithms are unlikely to save us here. If you could create a classifier, which could do a good job of figuring out whether something was generated by an algorithm or not, then I could just use your classifier as part of my training loop to train an algorithm that can actually learn to trick your classifier. So this is a real worry and the only solutions I've seen of those which are kind of based on cryptographic signatures, which is another whole can of worms, which has never really been properly sorted out, at least not in the Western world in a privacy centric way. All right, so yes. Just on that note, I'll add and I'll link to this on the forums. I gave a keynote at PsiPi conference last summer, which is the scientific Python conference and went into a lot more detail about the threat that Jeremy's describing about using advanced language models to manipulate public opinion. And so if you want to kind of learn more about the dangers there and exactly what that threat is, you can find that in my PsiPi keynote. Great. Thanks so much, Rachel. So let's have a five minute break and see you back here in five minutes. So we're going to finish with a kind of a segue into what will eventually be part two of the course, which is to go right underneath the hood and see exactly how a more complex architecture works. And specifically, we're going to see how a recurrent neural network works. Do we have a question first? In the previous lesson MNIST example, you showed us that under the hood, the model was learning parts of the image like curves of a three or angles of a seven. Is there a way to look under the hood of the language models to see if they are learning rules of grammar and syntax? Would it be a good idea to fine tune models with examples of domain specific syntax like technical manuals? Or does that miss the point of having the model learn for themselves? Yeah, there are tools that allow you to kind of see what's going on inside an NLP model. We're not going to look at them in this part of the course. Maybe we will in part two, but certainly worth doing some research to see what you can find. And there's certainly PyTorch libraries you can download and play with. Yeah, I mean, I think it's a perfectly good idea to incorporate some kind of technical manuals and stuff into your training corpus. There's actually been some recent papers on this general idea of trying to kind of create some carefully curated sentences as part of your training corpus. It's unlikely to hurt and it could well help. All right, so let's have a look at RNNs. Now, when Silvan and I started creating the RNN stuff for FastAI, the first thing I did actually was to create a new dataset. And the reason for that is I didn't find any datasets that would allow for quick prototyping and really easy debugging. So I made one, which we call human numbers, and it contains the first 10,000 numbers written out in English. And I am surprised at how few people create datasets. I create datasets frequently. I specifically look for things that can be kind of small, easy to prototype, good for debugging and quickly trying things out. And very, very few people do this, even though this human numbers dataset has been so useful for us, took me, I don't know, an hour or two to create. So this is definitely an underappreciated, underutilized technique. So we can grab the human numbers dataset and we can see that there's a training and a validation text file. We can open each of them. And for now, we're just going to concatenate the two together into a fire chord lines. And you can see that the contents are one, two, three, etc. And so there's a new line at the end of each. We can concatenate those all together and put a full stop between them as so. Okay. And then you could tokenize that by splitting on spaces. And so for example, here's tokens 100, 210, new number 42, new number 43, new number 44, and so forth. So you can see I'm just using plain Python here. There's not even any PyTorch, certainly not any fast AI. To create a vocab, we can just create all the unique tokens of which there are 30. And then to create a lookup from, so that's a lookup from a word to an ID, sorry, from an ID to a word, to go from a word to an ID, we can just enumerate that and create a dictionary from word to ID. So then we can numericalize our tokens by calling word to index on each one. And so here's our tokens. And here's the equivalent numericalized version. So you can see on fairly small data sets, when we don't have to worry about scale and speed and the details of tokenization in English, you can do the whole thing in just plain Python, the only other thing we did for to save a little bit of time was use L. But you could easily do that with the Python standard library about the same amount of code. So hopefully that gives you a good sense of really what's going on with tokenization and numericalization all done by hand. So let's create a language model. So one way to create a language model would be to go through all of our tokens. And let's create a range from zero to the length of our tokens minus four and every three of them. And so that's going to allow us to grab three tokens at a time 1.2.3.4.5.6.7.8 and so forth, right? So here's the first three tokens. And then here's the fourth token. And here's the second three tokens. And here's the seventh token and so forth. So these are going to be our independent variables. And this will be our dependent variable. So here's a super, super kind of naive, simple language model data set for our human numbers question. So we can do exactly the same thing as before, but use the numericalized version and create tensors. This is exactly the same thing as before, but now as through numericalized and as tensors. And we can create a data loader's object from data sets. And remember, these are data sets because they have a length and we can index into them, right? And so we can just grab the first 80% of the tokens as the training set. The last 20% is the validation set, like so, batch size 64, and we're ready to go. So we really used very, very little, I mean, the only pie torch we used was to create these tensors and the only fast AI we used was to create the data loaders. And it's just grabbing directly from the data set. So it's really not doing anything clever at all. So let's see if we can now create a neural network architecture, which takes three numericalized words at a time as input and tries to predict the fourth as dependent variable. So here is just such a language model. It's a three layer neural network. So we've got a linear layer here, which we're going to use once, twice, three times. And after each of them, we call value as per usual. But there's a little bit more going on. The first interesting thing is that rather than each of these being a different linear layer, we've just created one linear layer here, which we've reused, as you can see, one, two, three times. So that's the first thing that's a bit tricky. And so there's a few things going on. It's a bit a little bit different to usual. But the basic idea is here, we've created an embedding and NN linear, another NN linear. And then here we've used the linear layers and value. So it's very nearly a totally standard three layer neural net. I guess four, really, because there's an output layer. Yes, Rachel. We have a question. Sure. Is there a way to speed up fine tuning the NLP model? 10 plus minutes per epoch slows down the iterative process quite a bit. Any best practices or tips? I can't think of any, obviously, other than to say you don't normally need to fine tune it that often. You know, the work is often more at the classifier stage. So, yeah, I tend to kind of just leave it running overnight or while I have lunch or something like that. Yeah, just don't make sure you just make sure you don't sit there watching it. Go and do something else. This is where it can be quite handy to have a second GPU or fire up a second AWS instance or whatever. So you can kind of keep keep moving while something's training in the background. All right. So what's going on here in this model? To describe it, we're actually going to develop a little kind of pictorial representation. And the pictorial representation is going to work like this. Let's start with a simple linear model to define this pictorial representation. A simple linear model has an input of size batch size by number of inputs. And so we're going to use a rectangle to represent an input. We're going to use an arrow to represent a layer computation. So in this case, there's going to be a matrix product for a simple linear model. There'd be a matrix, actually, sorry, this is a single hidden layer model. There'll be a matrix product followed by a value. So that's what this arrow represents. And out of that, we're going to get some activations. And so circles represent computed activations. And it would be, we call this a hidden layer, it'll be of size batch size by number of activations. That's its size. And then to create a neural net, we're going to do a second matrix product. And this time a softmax. So the computation again represented by the arrow. And then output activations are a triangle. So the output would be batch size by num classes. So let me show you the pictorial version of this. So this is going to be a legend. Triangle is output circle hidden rectangle input. And here it is, we're going to take the first word as an input, it's going to go through linear layer and a value. And you'll notice here, I've deleted the details of what the operations are at this point. And I've also deleted the sizes. So every arrow is basically just a linear layer followed by a nonlinearity. So we take the word one input. And we put it through the layer, the linear layer and the nonlinearity to give us some activations. So there's our first set of activations. And when we put that through another linear layer and nonlinearity to get some more activations. And at this point we get word two. And word two is now goes through linear layer and a nonlinearity. And these two when two arrows together come to a circle, it means that we add or concatenate either spine the two sets of activations. So we'll add the set of activations from this input to the set of activations from here to create a new set of activations. And then we'll put that through another linear layer and a value. And again, word three is now going to come in and go through a linear layer and a value and they'll get added to create another set of activations. And then they'll find go through a final linear layer and really and a softmax to create our output activations. So this is our model. It's basically a standard one, two, three, four layer model. But a couple of interesting things are going on. The first is that we have inputs coming in to later layers and get added. So that's something we haven't seen before. And the second is all of the arrows that are the same color use the same weight matrix. So every time we get an input, we're going to put it through a particular weight matrix. And every time we go from one set of activations to the next, we'll put it through a different weight matrix. And then to go from the activations to the output, we'll use a different weight matrix. So if we now go back to the code to go from input to hidden, not surprisingly, we always use an embedding. So in other words, an embedding is the green. Okay. And you'll see we just create one embedding. And here is the first. So here's x, which is the three words. So here's the first word x zero, and it goes through that embedding. And word two goes through the same embedding. And word three index number two goes through the same embedding. And then each time you say we add it to the current set of activations. And so having put the got the embedding, we then put it through this linear layer. And again, we get the embedding, add it to the hit to the activations and put it through the linear that linear layer. And again, the same thing here, put it through the same linear layer. So H H is the orange. So these set of activations, we call the hidden state. Okay. And so the hidden state is why it's called H. And so if you follow through these steps, you'll see how each of them corresponds to a step in this diagram. And then finally at the end, we go from the hidden state to the output, which is this linear layer hidden state to the output. Okay. And then we don't have the actual softmax there because as you'll remember, we can incorporate that directly into the loss function, the cross entropy loss function using PyTorch. So one nice thing about this is everything we're using, we have previously created from scratch. So there's nothing magic here, we've created our own embedding layer from scratch, we've created our own linear layer from scratch, we've created our own value from scratch, we've created our own cross entropy loss from scratch. So you could actually try building this whole thing yourself from scratch. So why do we, just in terms of the nomenclature IH, so H refers to hidden. So this is a layer that goes from input to hidden. This is one that goes from hidden to hidden. This is one that goes from hidden to output. So if any of this is feeling confusing at any point, go back to where we actually created each one of these things from scratch and create it from scratch again. Make sure you actually write the code so that nothing here is mysterious. So why do we use the same embedding matrix each time we have a new input word, for input word index zero, one and two? Well, because conceptually, they all represent English words, you know, for human numbers. So why would you expect them to be a different embedding? They all should have the same representation, they all have the same meaning. Same for this hidden to hidden. At each time, we're basically describing how to go from one token to the next of our language model. So we would expect it to be the same computation. So that's basically what's going on here. So having created that model, we can go ahead and and instantiate it. So we're going to have to pass in the vocab size for the embedding and the number of hidden. Right? So that's number of activations. So here we create the model and then we create a learner by passing in a model and our data loaders and a loss function and optionally metrics. And we can fit. Now, of course, this is not pre trained, right? This is not a application specific learner. So it wouldn't know what pre trained model to use. So this is all random. And we're getting somewhere around 45 to 50% or so accuracy. Is that any good? Well, you should always compare to random or not random, you should always compare to like the simplest model where the simplest model is like some average or something. So what I did is I grabbed the validation set. So all the tokens put it into a Python standard library counter, which simply counts how many times each thing appears. I found that the word thousand is the most common. And then I said, okay, what if we used 7,104 thousand that's here and divide that by the length of the tokens and we get 15%, which in other words means if we always just predicted, I think the next word will be 1000. We would get 15% accuracy. But in this model, we got around 45 to 50% accuracy. So in other words, our model is a lot better than the simplest possible baseline. So we've learned something useful. That's great. So the first thing we're going to do is we're going to refactor this code. Because you can see we've got x going into ih, going into hh, going into ReLU, x going into ih, going into hh, going into ReLU, x going into ih, going into hh, going into ReLU. How would you refactor that in Python? You would of course use a for loop. So let's go ahead and write that again. So these lines of code are identical. In fact, these lines of code are identical, as is this. And we're going to instead of doing all that stuff manually, we create a loop that goes through three times. And in each time it goes ih add to our hidden hh ReLU. And then at the end, hidden to output. So this is exactly the same thing as before, but it's just refactored with a for loop. And we can train it again. And again, we get the same, basically 45 to 50% as you would expect, because it's exactly the same. It's just been refactored. So here's something crazy. This is a recurrent neural network. Even though it's like exactly the same as it's exactly the same as this, right? It's just been refactored into a loop. And so believe it or not, that's actually all an RNN is. An RNN is a simple refactoring of that model with that deep learning linear model we saw. I shouldn't say linear model, deep learning model of simple linear layers with ReLU's. So let's draw our pictorial representation again. So remember this was our previous pictorial representation. We can refactor the picture as well. So instead of showing these dots separately, we can just take this arrow and represented it as a loop. Because that's all that's happening, right? So the word one goes through an embedding goes into this activations, which then just gets repeated from 2 to n2 to n-1, where n at this time is, you know, we've got three words basically for each word coming in as well. And so we've just refactored our diagram and then eventually it goes through our blue to create the output. So this diagram is exactly the same as this diagram, just replacing the middle with that loop. So that's a recurrent neural net. And so H, remember was something that we just kept track of here. H, H, H, H, H, H as we added each layer to it. And here we just have it inside the loop. We initialize it as zero, which is kind of tricky. And the reason we can do that is that zero plus a tensor will broadcast this zero. So that's a little neat feature. That's why we don't have to make this a particular size tensor to start with. Okay, so we're going to be seeing the word hidden state a lot. And so it's important to remember that hidden state simply represents these activations that are occurring inside our recurrent neural net. And our recurrent neural net is just a refactoring of a particular kind of a fully connected deep model. So that's it. That's what an RNN is. No questions at this point, Rachel. Something that's a bit weird about it though, is that for every batch, we're setting our hidden state to zero. Even although we're going through the entire set of numbers, the human numbers data set in order. So you would think that by the time you've gone like one, two, three, you shouldn't then forget everything we've learned when you go to four, five, six, right? It would be great to actually remember where we're up to and not reset the hidden state back to zero every time. So we can absolutely do that. We can maintain this state of our RNN. And here's how we would do that. Rather than having something called h, we'll call it self.h. And we'll set it to zero at the start when we first create our model. Everything else here is the same. And everything else here is the same. And then there's just one extra line of code here. What's going on here? Well, here's the thing. If h is something which persists from batch to batch, then effectively, this loop is effectively kind of becoming infinitely long, right? Our deep learning model, therefore, is getting effectively, we're not infinitely deep, but as deep as the entire size of our data set, because every time we're stacking new layers on top of the previous layers. The reason this matters is that when we then do back propagation, when we then calculate the gradients, we're going to have to calculate the gradients all the way back through every layer going all the way. So by the time we get to the end of the data set, we're going to be effectively back propagating not just through this loop, but remember self.h was created also by the previous quarter forward and the previous quarter forward and the previous quarter forward. So we're going to have this incredibly slow calculation of the gradients all the way back to the start. It's also going to give up a whole lot of memory because it's going to have to store all those intermediate gradients in order to calculate them. So that's the problem. And so the problem is easily solved by saying detach. And what detach does is it basically says throw away my gradient history, forget that I forget that I was calculated from some other gradients. So the activations are still stored, but the gradient history is no longer stored. And so this kind of cuts off the gradient computation. And so this is called truncated back propagation. So exactly the same lines of code as the other two models, h equals zero has been moved into self.h equals zero. These lines of code are identical. And we've added one more line of code. So the only other thing is that from time to time, we might have to reset self.h to zero. So I've created a method for that. And we'll see how that works shortly. Okay, so back propagation, oh, sorry, I was using the wrong jargon back propagation through time is what we call it when we calculate the back prop over going back through this loop. All right, now, we do need to make sure that the samples are seen in the correct order, you know, given that we need to make sure that every batch connects up to the previous batch. So go back to notebook 10 to remind yourself of what that needs to look like. But basically the first batch, we see that the number the length of our sequences divided by the batch size is 328. So the first batch will be index number zero, then m, then two times m, and so forth. The second batch will be one m plus one two times m plus one and so forth. So the details don't matter. But here's how we create, you know, do that indexing. So now we can go ahead and call that group chunks function to calculate to create our training set and our validation set. And certainly don't shuffle because that would break everything in terms of the ordering. And then there's one more thing we need to do, which is we need to make sure that at the start of each epoch, we call reset. Because at the start of the epoch, we're going back to the start of our natural numbers. So we need to set self today to back to zero. So something that we'll learn about in part two is that fast AI has something called callbacks. And callbacks are classes which allow you to basically say, during the training loop, I want you to call some particular code. And in particular, this is going to call this code. And so you can see callbacks are very small or can be very small. They're normally very small. When we start training, it'll call reset. When we start validation, it'll call reset. So this is each epoch. And when we're all finished fitting, it will call reset. And what does reset do, it does whatever you tell it to do. And we told it to be self today equals zero. So if you want to use a callback, you can simply add it to the callbacks list, CVs, when you create your learner. And so now when we train, that's way better. Okay, so we've now actually kept, it's just got a state for RNN. It's actually keeping the state, keeping the hidden state from batch to batch. Now we still got a bit of an obvious problem here, which is that if you look back to the data that we created, we used these first three tokens to predict the fourth. And then the next three tokens to predict the seventh. And then the next three tokens to predict the one after and so forth. Effectively, what would rather do you would think is, is predict every word, not just every fourth word. It seems like we're throwing away a lot of signal here, which is pretty wasteful. So we want to create more signal. And so the way to do that would be rather than putting, rather than putting this output stage outside the loop, right? So this dotted area is a bit that's looped. What if we put the output inside the loop? So in other words, after every hidden state was created, we immediately did a prediction. And so that way we could predict after every time step. And our dependent variable could be the entire sequence of numbers offset by one. So that would give us a lot more signal. So we have to change our data. So the dependent variable has each of the next three words after each of the three inputs. So instead of being just the numbers from i to i plus SL as input, and then i plus SL plus one as output, we're going to have the entire set offset by one as our dependent variable. So and it can now do exactly the same as we did before to create our data loaders. And so you can now see that each sequence is exactly the previous is the independent variable and the dependent variable the same thing, but offset by one. Okay. And then we need to modify our model very slightly. This code is all exactly the same as before. But rather than now returning one output, we'll create a list of outputs and we'll append to that list after every element of the loop. And then at the end, we'll stack them all up. And then this is the same. So it's nearly exactly the same. Okay, just a very minor change. Our loss function needs to we need to create our own loss function, which is just the cross entropy loss. But we need to just flatten it out. So the target gets flattened out, the input gets flattened out. And so then we can now pass that as our loss function, everything else here is the same. And we can fit. And we've gone from 58 to 64. So it's improved a little bit. So that's good. You know, we did find this a little little flaky. Sometimes it would train really well. Sometimes it wouldn't train great. But sometimes, you know, we often got this reasonably good answer. Now, one problem here is, although effectively we have quite a deep neural net, if you kind of go back to the version. So this, this version where we have the loop in it is kind of the normal way to think about an RNN. But perhaps an easier way to think about it is what we call the unrolled version. And the unrolled version is when you look at it like this. Now, if you unroll this stateful neural net, we have, you know, it's, it is quite deep. But every single one of the hidden to hidden layers uses exactly the same weight matrix. So really, it's not really that deep at all, because it can't really do very sophisticated computation, because it has to use the same weight matrix every time. So in some ways, it's not really any smarter than a plain linear model. So it would be nice to try to, you know, create a truly deep model, have multiple different layers that it can go through. So we can actually do that easily enough by creating something called a multi layer RNN. And all we do is we basically take that diagram we just saw before, and we repeat it. But, and this is actually a bit unclear, the, the dotted arrows here are different weight matrices to the non dotted arrows here. So we can have a different hidden to hidden weight matrix in the kind of second set of RNN layers, and a different weight matrix here for the second set. And so this is called a stacked RNN or a multi layer RNN. And so here's the same thing in the unrolled version. Right. So this is exactly the same thing, but showing you the unrolled version. Writing this out by hand, maybe that's quite a good exercise, or particularly this one would be quite a good exercise, but it's kind of tedious, so we're not going to bother. Instead, we're going to use PyTorch's RNN class. And so PyTorch's RNN class is basically doing exactly what we saw here, right. And specifically hit this, this part here, and this part here. Right. But it's nice that it also has an extra number of layers parameter that lets you tell it how many to stack on top of each other. So it's important when you start using PyTorch's RNN to realize there's nothing magic going on. Right. You're just using this refactored for loop that we've already seen. So we still need the input to hidden embedding. This is now the hidden to hidden with the loop all done for us. And then this is the hidden to output just as before. And then this is our hidden just like before. So now we don't need the loop, we can just call self.rnn. And it does the whole loop for us. We can do all the input to hidden at once to save a little bit of time because thanks to the wonder of embedding matrices. And as per usual, we have to go detach to avoid getting a super deep effective network and then pass it through our output linear layer. So this is exactly the same as the previous model except that we have just refactored it using nn.rnn and we said we want more than one layer. So let's request say two layers. We still need the model resetter just like before because remember nothing's changed. And let's go ahead and fit. And oh, it's terrible. So why is it terrible? Well, the reason it's terrible is that now we really do have a very deep model. And very and very deep models are really hard to train because we can get exploding or disappearing activations. So what that means is we start out with some initial state and we're gradually putting it through all of these layers and all of these layers, right? And so each time we're doing a matrix multiplication, which remember is just doing a whole bunch of multiplies and adds. And then we multiply and add and we multiply and add and we multiply and add and we multiply and add and we multiply and add. And if you do that enough times, you can end up with very, very, very big results or so that would be if the kind of things are multiplying and adding by a pretty big or very, very, very, very small results, particularly because we're putting it through the same layer again and again. And why is that a problem? Well, if you multiply by two a few times, you get one, two, four, eight, et cetera. And after 32 steps, you're already at four billion. Or if you start at one and you multiply by a half a few times after 32 steps, you're down this tiny number. So a number even slightly higher or lower than one can kind of cause an explosion or disappearance of a number. And matrix multiplication is just multiplying numbers and adding them up. So exactly the same thing happens to matrix multiplication. You kind of have matrices that grow really big or grow really small. And when that does that, you're also going to have exactly the same things happening to the gradients. They'll get really big and really small. And one of the problems here is that numbers are not stored precisely in a computer. They're stored using something called floating point. So we stole this nice diagram from this article called what you never wanted to know about floating point, but will be forced to find out. And here we're at this point where we're forced to find out. And it's basically showing us the granularity with which numbers are stored. And so the numbers that are further away from zero are stored much less precisely than the numbers that are close to zero. And so if you think about it, that means that the gradients further away from zero could actually for very big numbers could actually become zero themselves. Because you could actually end up in kind of with two numbers that are between these kind of little gradations here. And you actually end up with the same thing with the really small numbers. Because the really small numbers, although they're closer together, the numbers that they represent are also very close together. So in both cases, the kind of the relative accuracy gets worse and worse. So you really want to avoid this happening. There's a number of ways to avoid this happening. And this is the same for really deep convolutional neural nets or really deep kind of tabular standard tabular networks. Anytime you have too many layers, it can become difficult to train and you generally have to use like either really small learning rates, or you have to use special techniques that avoid exploding or disappearing activations or gradients. For RNNs, one of the most popular approaches to this is to use an architecture called an LSTM. And I am not going to go into the details of an LSTM from scratch today, but it's in the book and in the notebook. But the key thing to know about an LSTM is, let's have a look, is that rather than just being a matrix multiplication, it is this, which is that there are a number of linear layers that it goes through. And those linear layers are combined in particular ways. And the way they're combined, which is shown in this kind of diagram here, is that it basically is designed such that there are like little mini neural networks inside the layer, which decide how much of the previous state is kept, how much is thrown away, and how much of the new state is added. And by letting it have little neural nets to kind of calculate each of these things, it allows the LSTM layer, which again is shown here, to decide how much of an update to do at each time. And then with that capability, it basically allows it to avoid kind of updating too much or updating too little. And by the way, this code, you can refactor, which Sylvain did here, into a much smaller amount of code. But these two things are exactly the same thing. So as I said, I'm not going to worry too much about the details of how this works now. The important thing just to know is that you can replace the matrix multiplication in an RNN with this sequence of matrix multiplications and sigmoids and times and plus. And when you do so, you will very significantly decrease the amount of gradient or activation exploding explosions or disappearances. So that's quite an LSTM cell and an RNN which uses this instead of a matrix multiplication is called an LSTM. And so you can replace NN.RNN with NN.LSTM. Other than that, we haven't really changed anything, except that LSTMs, because they have more of these layers in them, we actually have to make our hidden state have more layers in as well. But other than that, we can just replace RNN with LSTM and we can call it just the same way as we did before. We can detach just like before, but that's now a list. So we have to detach all of them and pop it through our output layer, which is exactly as before. Reset is just as before, except it's got to look through each one. And we can fit it in exactly the same way as before. And as you can see, we end up with a much better result, which is great. We have two questions. Okay, perfect. Could we somehow use regularization to try to make the RNN parameters close to the identity matrix? Or would that cause bad results because the hidden layers want to deviate from the identity during training? So we're actually about to look at regularization. So we will take a look. The identity matrix for those that don't know or don't remember is the matrix where if you multiply it by it, you get exactly the same thing that you started with. So just like if you multiply by one, you get back the same number you started with. For linear algebra, if you multiply by the identity matrix, you get the same matrix you started with. And actually, one quite popular approach to initializing the hidden to hidden activations is to initialize with an identity matrix, which ensures that you start with something which doesn't have gradient explosions or activation explosions. There are, yeah, well, we're, and we're about to have a look at some more regularization approaches. So let's wait until we do that. Alright, next question. Is there a way to quickly check if the activations are disappearing slash exploding? Absolutely. Just go ahead and calculate them. And we'll be looking at that a lot more detail in part two. But a really great exercise would be to try to figure out how you can actually output the activations of each layer. And it would certainly be very easy to do that in the in the RNNs that we built ourselves from scratch, because we can actually see the linear layers. And so you could just print them out or print out some statistics or store them away or or something like that. FastAI has a class called activation stats, which kind of you can check out if you're interested. That's a really good way to specifically to do this. Okay, so, yeah, so regularization is important. We have potentially a lot of parameters and a lot of layers. It would be really nice if we can do the same kind of thing that we've done with with our CNNs and so forth, which is to use more parameters, but then use regularization to ensure that we don't overfit. And so we can certainly do that with an LSTM as well. And perhaps the best way to do that is to use something called dropout. And dropout is not just used for RNNs. Dropout is used all over the place, but it works particularly well in RNNs. This is a picture from the dropout paper. And what happens in dropout is here's a here's a kind of a picture of three fully connected layers. Sorry, I guess it's two, one, two, yeah, three fully connected layers. And so in these two fully connected layers at the start here, what we could do is we could delete some of the activations at random. And so this has happened here by x. This is what x means, it's like deleting those those activations at random. And if we do so, you can see we end up with a lot less computation going on. And what dropout does is each batch, each mini batch, it randomly deletes a different set of activations from whatever layers you ask for. That's what dropout does. So basically the idea is that dropout helps to generalize, because if a particular activation was kind of effectively learning some input, some some particular piece of input, memorizing it, then sometimes it gets randomly deleted. And so then suddenly it's not going to do anything useful at all. So by randomly deleting activations, it ensures that activations can't become over specialized at doing just one thing. Because then if it did, then the times they're randomly deleted, it's not going to work. So here is the entire implementation of a dropout layer. You pass it some value P, which is the probability that an activation gets deleted. So we'll store that away. And so then in the forward, you're going to get your activations. Now, if you're not training, so if you're doing validation, then we're not going to do dropout. But if we are training, then we create a mask. And so the mask is a Bernoulli random variable. So what does Bernoulli random variable means? It means it's a bunch of ones and zeros, where this is the probability that we get a one, which is one minus a probability we get a zero. And so then we just multiply that by our input. So that's going to convert some of the inputs into zeros, which is basically deleting them. So you should check out some of the details, for example, about why we do a divide one minus P, which is described here. And we do point out here that normally, and I would normally in the lesson, show you an example of what Bernoulli does. But of course, nowadays, you know, we're getting to the advanced classes, you're expected to do it yourself. So be sure to create a little cell here, and make sure you actually create a tensor, and then run Bernoulli underscore on it, and make sure you see exactly what it's doing, so that then you can understand this class. Now, of course, we don't have to use this class, we made ourselves, we can just use nn dot dropout. But you can use this class yourself, because it does the same thing. So again, you know, we're trying to make sure that we know how to build stuff from scratch. This special self dot training is set for every module automatically by fast AI, to based on whether or not you're in the validation part of your training loop, or the training part of your training loop. It's also part of PyTorch. And in PyTorch, if you're not using fast AI, you have to call the train method on a module to set training to true, and the eval method to set it to false for every module inside some other module. So that's one great approach to regularization. Another approach, which I've only seen used in recurrent neural nets, is activation regularization and temporal activation regularization, which is very, very similar to the question that we were just asked. What happens with activation regularization is it looks a very similar to weight decay. But rather than adding sum multiplier times the sum of squares of the weights, we add some multiplier by the sum of squares of the activations. So in other words, we're basically saying, we're not just trying to to decrease the weights but decrease the total activations. And then similarly, we can also see what's the difference between the activations from the previous time step to this time step. So take the difference. And then again, squared times some value. So these are two hyper parameters, alpha and beta. The higher they are, the more regularized your model. And so with TAR, it's going to say no layer of the LSTM should too dramatically change the activations from one time step to the next. And then for alpha, it's saying no layer of the LSTM should create two large activations. And so they wouldn't actually create these large activations or large changes unless the loss improved by enough to make it worth it. Okay, so there's then I think just one more thing we need to know about, which is called weight tying. And weight tying is a very minor change. And let's have a look at it here. So this is the embedding we had before. This is the LSTM we had before. This is where we're going to introduce drop out. This is the hidden to output linear layer we had before. But we're going to add one more line of code, which is the hidden to output weights are actually equal to the input to hidden weights. Now this is not just setting them once. This is actually setting them so that they're a reference to the exact same object in memory, the exact same tensor in memory. So the weights of the hidden to output layer will always be identical to the weights of the input to hidden layer. And this is called weight tying. And the reason we do this is because conceptually, in a language model, predicting the next word is about kind of converting activations into English words, whereas an embedding is about converting English words to activations. And there's a reasonable hypothesis which would be that, well, those are basically exactly the same computation or at least the reverse of it. So why shouldn't they use the same weights? And it turns out, lo and behold, yes, if you use the same weights, then actually it does work a little bit better. So then here's our forward, which is to do the input to hidden, do the RNN, apply the dropout, do the detach, and then apply the hidden to output, which is using exactly the same weights as the input to hidden and resets the same. We haven't created the RNN regularizer from scratch here, but you can add it as a callback, passing in your alpha and your beta. If you call text learner instead of learner, it will add the model resetter and the RNN regularizer for you. So that's what one of the things text learner does. So this code is the same as this code. And so we can then train a model again. And that's also add weight decay. And look at this, we're getting up close to 90% accuracy. So we've covered a lot in this lesson, but the amazing thing is that we've just replicated all of the pieces in an AWD LSTM, all of the pieces in this state of the art recurrent neural net, which we've showed we could use in the previous notebook to get what was until very recently state of the art results for text classification and far more quickly and with far less compute and memory than more modern than the approaches in the last year or so, which have beaten that benchmark. So this is a really efficient, really accurate approach. And it's still the state of the art in many, many academic situations. And it's still very widely used in industry. And so it's pretty cool that we've actually seen how to write it from scratch. So the main thing to mention in the further research is to have a look at the source code for AWD LSTM and fast AI and see if you can see how the things in AWD LSTM map to the, you know, what those lines of code, how they map to the concepts that we've seen in this chapter. Rachel, do we have any questions? So here we have come to the conclusion of our, what was originally going to be seven lessons and turned into eight lessons. I hope that you've got a lot out of this. Thank you for staying with us. What a lot of folks, people, people now do when they finish, at least people who finish previous courses is they go back to lesson one and try and repeat it. But during a lot less looking at the notebooks, a lot more doing stuff from scratch yourself and going deeper into the assignments. So that's one thing you could do next. Another thing you could do next would be to pick out a Kaggle competition to enter or pick a book that you want to read about deep learning or a paper and team up with some friends to do like a paper reading group or a book reading group. You know, one of the most important things to keep the learning going is to get together with with other people on the learning journey. Another great way to do that, of course, is through the forums. So if you haven't been using the forums much so far, no problem, but now might be a great time to get involved and find some projects that are going on that look interesting. And it's fine if you, you know, you don't have to be an expert, right? Obviously any of those projects, the people that are already doing it are going to know more about it than you do at this point because they're already doing it. But if you drop into a thread and say, Hey, I would love to learn more about this, how do I get started or have a look at the wiki posts to find out and try things out. You can start getting involved in other people's projects and help them out. So yeah. And of course, don't forget about writing. So if you haven't tried writing a blog post yet, maybe now's a great time to do that pick something that's interesting to you, especially if it's something in your area of expertise at work or a hobby or something like that or specific to where you live, maybe you could try and build some kind of text classifier or text generator for particular kinds of texts that are that you know about. You know, that would be that would be a super interesting thing to try out and be sure to share it with the folks on the forum. So there's a few ideas. So don't let this be the end of your learning journey. You know, keep, keep going and then come back and try part two. If it's not out yet, obviously, you'll have to wait until it is out. But if it is out, you might want to kind of spend a couple of months, you know, really experimenting with this before you move on to part two to make sure that everything in part one feels pretty solid to you. Well, thank you very much, everybody for your time. We've really enjoyed doing this course. It's been a tough course for us to teach because with all this COVID-19 stuff going on at the same time, I'm really glad we've got through it. I'm particularly, particularly grateful to Sylvain who has been extraordinary in really making so much of this happen. And particularly since I've been so busy with COVID-19 stuff around masks in particular. It's really a lot, thanks to Sylvain, that everything has come together. And of course, to Rachel who's been here with me on every one of these lessons. Thank you so much. And I'm looking forward to seeing you again in a future course. Thanks, everybody.