 So welcome to lesson 10 or as somebody on the forum described it lesson 10 mod 7 Which is probably a clearer way to think about this We're going to be talking about NLP before we do Let's do a quick review of last week because Last week, you know, there's quite a few people who have kind of flown here to San Francisco for this in-person course I'm seeing them pretty much every day. They're working full-time on this And quite a few of them are still struggling to understand the material from last week, right? So if you're finding it difficult, that's that's fine One of the reasons I kind of put it up there up front is so that we've got something to Cojotate about and think about and gradually work towards so that by lesson 14 mod 7 You'll you'll get a second crack at it. Okay But it's it's well worth that there's so many pieces and so hopefully you can keep developing better understanding To understand the pieces you'll need to understand, you know the shapes of convolutional layer outputs and Receptive fields and loss functions and and everything, right? So it's all stuff that you're going to understand need to understand for all of your deep learning studies Anyway, right? So everything you do to develop an understanding of last week's lesson is going to help you with anything else So one key thing I wanted to mention is we started out with something which is really pretty simple Which is single person classifier a single object classifier single object bounding box without a classifier and then single object classifier and bounding box and anybody who's hopefully spent some time studying since lesson 8 mod 7 has got to the point where they understand this bit, right now the reason I mention this is because the bit where we go to multiple objects is Actually almost identical to this Except we first have to solve the matching problem. We end up creating far more activations than we need for a number of bounding boxes Ground truth bounding boxes and so we match each ground truth object to a subset of those activations, right? And once we've done that the loss function that we then do to each matched pair is almost identical to this loss function Right, so if you're feeling stuck go back to lesson 8 right and Make sure you understand the data set the data loader and most importantly the loss function from the end of Lesson 8 or the start of lesson 9. Okay? Okay, so Once we've got this thing which can predict the class and bounding box for one object We went to multiple objects by just creating more activations We had to then deal with the matching problem having done with the dealt with the matching problem We then basically moved each of those anchor boxes In and out a little bit and around a little bit so they tried to line up with particular ground truth objects and We talked about how we took advantage of the convolutional nature of the network to to try to have Activations that had a receptive field that was similar to the ground truth object we were predicting and Chloe sultan provided this fantastic picture I guess for her own notes, but she shared it with everybody which is lovely to talk about like What is SSD multi-head forward do line by line? And I partly wanted to show this to help you with your revision But I also partly wanted to show do this to show this to kind of say Doing this kind of stuff is very useful for you to do like walk through and in whatever way helps you make sure you understand something And you can see what Chloe's done here is she's focused particularly on the dimensions of the tensor at each point and The in the in the path as we kind of gradually down sampling using these drive-to-convolutions Making sure she understands why those grid sizes happen and then understanding how the outputs come out of those And so one thing you might be wondering is well, how did Chloe? calculate these numbers so I don't know the answer I haven't spoken to her but Obviously one approach would be like from first principles just thinking through it but then you want to know what am I right right and so This is where you've got to remember this PDB dot set trace idea, right? So I just went in just before class and went into SSD multi-head dot forward and entered PDB dot set trace and then I ran a single batch Right, and so at that so I put the trace at the end and then I could just print out the size of all of these guys, right? so Which by the way reminds me last week there may have been a point where I said 21 Plus four equals 26 Which is not true? In most universes, so And like by the way when I code I do that stuff like that's the kind of thing I do all the time So that's why we have Debuggers and know how to check things and do things in small little bits along the way anyway So this idea of putting a debugger inside your forward function and printing out the sizes is something Which is damn super helpful or or you could just put a print statement here as well So I actually don't know if that's how Chloe figured it out, but that's how I would if I was her And then we talked about increasing K Which is the number of anchor boxes for each convolutional grid cell which we can do with different zooms and different aspect ratios and so that gives us a plethora of activations and therefore predicted bounding boxes Which we then went down to a small number using non-maximum suppression and I'll try to remember to put a link. There's a really interesting paper that one of our students told me about that I hadn't heard about Which is attempting to like, you know, I've mentioned non-maximum suppression. It's like kind of hacky kind of ugly Totally heuristic, you know didn't even talk about the code because it seems kind of hideous So somebody actually came up with a paper recently which attempts to do an end-to-end comf net to replace that NMS piece So I'll put that paper up. Nobody's created a torch a pie torch implementation yet So it would be an interesting Project if anybody wanted to try that one thing I've noticed in our study groups during the week is Not enough people reading papers The what we are doing in class now is implementing papers the papers are the real ground truth, right and I think you know from talking to people a lot of the reason people aren't reading papers is because a Lot of people don't think they're capable of reading papers They don't think they're the kind of people that read papers But you are you're here, right and like we started looking at a paper Last week and we read the words that were in English and we largely understood them, right? So it's like if you actually look through this picture from SSD carefully you'll realize SSD multihead forward is not doing the same as this and then you might think oh, I wonder if this is better You know and my answer is probably right because SSD multihead forward was like the first thing I tried Just to get something out there, you know But you know that there are between this and the YOLO version three paper and stuff there are probably much better ways One thing you notice in particular is they use a smaller K But they have a lot more sets of grids One by one three by three five by five ten by ten nineteen by nineteen and thirty eight by thirty eight 8,700 per class right so a lot more Then we had something interesting thing to experiment with another thing I noticed is That whereas we had four by four two by two one by one which means there's a lot of overlap like every set fits within every other set In this case where you've got one three five There's that you don't have that overlap right so it might actually make it easier to learn So there's lots of interesting things you can play with Based on stuff that's that you know either trying to make it closer to the paper or think about other things You could try that aren't in the paper or whatever Perhaps the most important thing I would recommend is To put the code and the equations next to each other. Yes, Rachel There was a question of whether you could speak about the use cyclic learning rate argument in the fit function We will get there so put the code and the Equations from the paper next to each other And you're in one of two groups You're either a code person like me Who's not that happy about math in which case I start with the code? And then I look at the math and I learn about how the math maps to the code and end up eventually understanding the math all your PhD in stochastic differential equations like Rachel Whatever that means In which case you can look at the math and then learn about how the code That's the math right but either way unless you're one of those rare people who is equally comfortable in either world You'll learn about one or the other Now learning about code is pretty easy because there's documentation and we know how to its index We know how to look it up and so forth sometimes learning the math is hard because the notation might seem hard to look up But there's actually a lot of resources For example list of mathematical symbols on Wikipedia is amazingly great It has examples of them explanations of what they mean and tells you what to search for to find out more about it Really terrific, and if you Google for math notation cheat sheet you'll find more of these kinds of terrific resources, okay So over time You do need to learn the notation, but as you'll see from the Wikipedia page There's not actually that much of it right obviously. There's a lot of concepts behind it But once you know the notation you can then quickly look up the concept as it pertains to particular thing you're studying Nobody learns all of math and then starts learning machine learning, right? Everybody even top researchers. I know when they're reading a new paper Will very often come to bits of math they haven't seen before and they'll have to go away and learn that that bit of math Another thing you should try doing is to recreate things that you see in the papers, right? So here was the key most important figure one from the focal loss paper the retina paper so Recreate it right and like very often I Put these challenges up on the forums, right? So like keep an eye on the lesson threads during the four on the forums And so I put this challenge up there and within about three three minutes Serada had said I've done it in Microsoft Excel naturally Along with actually a lot more information than the original paper a nice thing here Is that she was actually able to draw a line showing at a point five ground truth a lot of probability What's the loss for different amounts of gamma? Which is kind of cool and if you want to cheat she's also provided Python code on the forum, too I did discover a minor bug in my code last week the way that I was flattening out the convolutional activations did not Line up with how I was using them in the loss function and fixing that actually made it quite a bit better So my motorbikes and cows and stuff are actually in the right place So when you go back to the notebook you'll see it's a little less bad than it was Okay, so there's some Quick coverage of what's gone before Yes Quick question. Are you gonna put the PowerPoint on github? I'll put a subset of it on okay, and then secondly Usually when we down sample we increase the number of filters or depth when we're doing sampling from 77 to 44 Why are we decreasing the number from 512 to 256? Why not decrease dimension in SSD head? Is it performance related 77 to 44? Oh seven by seven to four by four. I guess they've got It's it's it's because Well largely it's because that's kind of what the papers tend to do is is we've got a number of at well We have a number of our paths And we kind of want each one to to be the same so we don't want each one to have a different number of filters And also this is what the the papers did so I was trying to match up with that with having these 256 It's a it's a different concept because we're taking advantage of not just the last layer, but the layers before that as well Life's easier if we make them more consistent okay, so We're now going to move to NLP and so let me kind of lay out where we're going here we We've seen a couple of times now this idea of like taking a pre-trained model In fact, we've seen it in every lesson take a pre-trained model Whip off some stuff on the top replace it with some new stuff get it to do something similar, right and So what we're going to do so and so we've kind of Dived in a little bit deeper to that to say like okay with Conflowner Dot pre-trained it had like a standard way of like sticking stuff on the top which does a particular thing which was some classification and then we learned actually we have we can stick any Pytorch module we like on the end and have it do anything we like with a with a custom head And so suddenly you discover Wow, there's there's some really Interesting things we can do in fact that reminds me Reminds me young Lou Said well, what if we did a different kind of custom head and so the different custom hit was well let's take the original pictures and rotate them and Then make our dependent variable The op you know the opposite of that rotation basically and see if it can learn to unrotate it And this is like a super useful thing obviously in fact, I think Google photos nowadays has this option that it'll actually Automatically rotate your photos for you But the cool thing is as young Lou shows here You can build that network right now by doing exactly the same as our previous lesson, but your custom head is One that spits out a single number, which is how much to rotate by and your data set Has a dependent variable, which is how much did you rotate by right? So like you suddenly realize with this idea of a backbone plus a custom head You can do almost anything you can think about So today we're going to look at the same idea and say like okay. Well, how does that apply to? NLP and Then in the next lesson, we're going to go further and say like well If NLP in computer vision kind of lets you do the same basic ideas How do we combine the two and we're going to learn about a model that can actually learn to find word structures from images or images from word structures or images from images And that will form the basis if you wanted to go further of doing things like going from an image to a sentence It's called image captioning or going from a sentence to an image Which will kind of start to do a phrased image and so From there, you know, we're going to go deeper then Into computer vision to think about like okay What other kinds of things we can do with this idea of a pre-trained network plus a custom head and so we'll look at various kinds of image enhancement like increasing the resolution of a low res photo to guess what was missing or Adding artistic filters on top of photos or changing photos of forces into photos of zebras And stuff like that and then finally that's going to bring us all the way back to bounding boxes again Okay, and so to get there we're going to first of all let about segmentation Which is not just learn figuring out where a bounding box is but figuring out what every single pixel in an image is part of So this pixel is part of a person this pixel is part of a car And then we're going to use that idea of particularly an idea called unit And it turns out that this idea from unit we can apply to bounding boxes Where it's called feature pyramids everything has to have a different name in every slightly different area And we'll use that to hopefully get some better even better results and really good results actually with bounding boxes So that's kind of our path from here So it's all going to kind of build on each other that take us into lots of different areas Now for NLP Last part we relied on a pretty great library called torch text, but as pretty great as it was I've since then found the limitations of it too problematic to keep using it That's a lot of you complained on the forums. It's pretty damn slow partly that that's because it's doing parallel not doing parallel processing and Partly it's because it doesn't remember what you did last time and it does it all over again from scratch and Then it's kind of hard to do fairly simple things like a lot of you were trying to get into the toxic comment competition on Kaggle Which was a multi label problem and trying to do that with torch text. I eventually got it working But it took me like a week of Hacking away Which is kind of ridiculous So to fix all these problems. We've printed a new library called fast AI text fast AI text is a replacement for the combination of torch text and fast AI NLP Okay, so don't use fast AI NLP anymore. Okay, that's like that's Obsolete it's it's slower. It's more confusing. It's less good in every way, right? But there's a lot of overlaps, okay Like intentionally a lot of the classes have the same names a lot of the functions have the same names But this is the the non torch text version. Okay Okay, so we're going to work with IMDB again, right? So for those of you who have forgotten go back and check out lesson four But basically this is a data set of moody reviews and you remember we used it to find out whether we might enjoy Zombie Geddon or not and we thought probably my kind of thing So we're going to use the same data set and By default it it calls itself a co IMDB So this is just the the raw data set that you can download And as you can see I'm doing from fast AI dot text import star There's no torch text and I'm not using fast AI dot NLP I'm going to use path lib as per usual we're going to learn about what these tags are later So you might remember the basic path for NLP is that we have to take Sentences and turn them into numbers And there's a couple of steps to get there. Okay, so at the moment Somewhat intentionally fast AI dot text doesn't provide that many helper functions It's really designed more to let let you handle things in a fairly flexible way, right? So as you can see here, I wrote something called get texts Which goes through each thing in classes and these are the three classes that they have in IMDB negative Positive and then there's another folder unsupervised that stuff. They haven't gotten around the labeling yet So I'm just going to call that a class now And so I just go through each one of those classes and then I just find every File in that folder with that name and I open it up and read it and chuck it into the end of this array, okay, and as you can see with path lib, it's super easy To grab stuff and pull it in and then the label is just whatever class I'm up to so far, right? So I'll go ahead and I'll do that for the train bit and I'll go ahead and I'll do that for the test bit So there's seventy thousand thousand in train twenty five thousand in test Fifty thousand of the train ones are unsupervised. We won't actually be able to use them when we get to classification piece Okay, so I actually find This much easier than the kind of torch text approach of having Lots of layers and wrappers and stuff because in the end reading text files is not is not that hard Okay, one thing that's always a good idea is to sort things randomly It's Useful to know this simple trick for sorting things randomly particularly when you've got multiple things you have to sort the same way in this Case I've got labels and texts NP dot random dot permutation if you give it an integer It gives you a random list from zero up to and not including the number you give it in some random order And so you can then just pass that in as a indexer, right to give you List that sorted in that random order. So in this case, it's going to sort train texts and train labels in the same random way Okay, so that's a useful Little idiom to use so now I've got my texts and my labels sorted I can go ahead and create a data frame from them. Why am I doing this? Well, the reason I'm doing this is because there is a somewhat standard approach starting to appear for text classification data sets which is to have your training set as a CSV file with the With the labels first and the text of the NLP documents second In a train dot CSV and a test dot CSV So basically looks like this you got your labels and your texts and then a file called classes dot text which just lists the classes I Say somewhat standard is this in a reasonably recent academic paper Yarn LeCoune and a team of researchers looked at quite a few data sets and they use this format for all of them and so That's what I've started using as well for my recent paper So what I've done is you'll find that this notebook if you put your Data into this format the whole notebook will work every time All right, so rather than having a thousand different Classes or formats and readers and writers and whatever I've just said Let's just pick a standard format and your job. You're all coders You can do it perfectly well is to put it in that format, which is the CSV file Okay The CSV files have no header. Okay by default All right Now you'll notice at the start here that I had two different paths. One was the classification path One was the language model path in NLP. You'll see LM all the time LM means language model in So the classification path is going to contain The information that we're going to use to create a sentiment analysis model The language model path is going to contain the information. We need to create a language model So they're a little bit different. One thing that's different is that when we create the Train dot CSV and the classification path we remove Everything that has a label of two Because label of two is unsupervised Okay, so we remove the unsupervised data from the classifier. We can't use it So that means this is going to have actually 25,000 positive 25,000 negative and the second difference is the labels For the classification path the labels are the actual labels But for the language model, there are no labels So we just use a bunch of zeros and that just makes it a little bit easier because we can use a consistent Data frame format or CSV format. Okay Now the language model We can create our own validation set and so you've probably come across by now SKlearn.modelselection.train test split which is a really simple little function that grabs a data set and Randomly splits it into a training set in the validation set according to whatever proportion you specify And so in this case, I concatenate My classification training and validation together. So it's going to be a hundred thousand all together Split it by ten percent. And so now I've got ninety thousand training ten thousand validation for my language model So go ahead and save that. So that's my basic, you know, get the data In a standard format for my language model and my classifier, right? So the next thing we need to do is tokenization So tokenization means at this stage, we've got for a document from review review We've got a big long string and we want to put it into a list of tokens which are kind of a list of words But not quite, right? For example, don't we want to be doh Hmmt, we probably want full stop to be a token and so forth, right? So tokenization is something that we passed off to a terrific library called spacey Partly terrific because an Australian wrote it and partly terrific because it's good at what it does We put a bit of stuff on top of spacey, but the vast majority of the work is being done by spacey Before we pass it to spacey I've written this simple fix-up function which is basically Each time I looked at a different data set and I've looked at about a dozen in building this everyone had different weird things that needed to be replaced so Here are all the ones I've come up with so far Hopefully This will help you out as well So I HTML and escape all the entities and then there's a bunch more things I replace Have a look at the result of running this on text that you put in and make sure there's not more weird Tokens in there. It's amazing how many weird things people do to text So basically I've got this function called get all which Is going to go ahead and call Get texts and text is going to go ahead and do a few things one of which is to apply that Fix-up, but we just mentioned So let's kind of look through this because there's some interesting things to point out So I've got to use pandas to open our train dot CSP from the language model path But I'm passing in an extra parameter. You may not have seen before called chunk sites Python and pandas can both be pretty inefficient when it comes to storing and using text data and so you'll see that Very few people in NLP are working with large corpuses and I think part of the reason is that Traditional tools have just made it really difficult. You run out of memory all the time so this Process I'm showing you today. I have used on corpuses of over a billion words Successfully using this exact code right and so one of the simple tricks is to use this thing called chunk size with pandas What that means is that pandas does not return a data frame But it returns an iterator that we can iterate through chunks of a data frame and so that's why I Don't say talk train equals Get texts because here is the thing get texts but instead I call get all which loops through The data frame, but actually what it's really doing is it's looping through chunks of the data frame So each of those chunks is basically a data frame representing a subset of the data When I'm working with NLP data many times I come across data with foreign text or characters Is it better to discard them or keep them? No, no definitely keep them and this whole process is is unicode And I've actually used this on Chinese text This is designed to work on pretty much anything. Yeah Yeah, in general Most of the time it's not a good idea to remove anything Like old-fashioned NLP approaches tend to do all this like Limitization and all these kind of normalization steps to kind of get rid of you know the lowercase everything blah blah blah But you know that's throwing away Information which you don't know ahead of time whether it's useful or not So don't throw away Okay, so we go through each chunk each of which is a data frame and we call get texts Get texts is going to grab the labels and make them into ends It's going to grab Then the the texts and I'll point out a couple of things The first is that before we include the text we have this beginning of stream token Which you might remember we used way back up here There's nothing special about these particular strings of letters. They're just ones. I figured don't appear in Normal texts very often So every text is going to start with X B O S Way is that because it's often really useful for your model to know when a new text is starting For example, if it's a language model, right? We are going to concatenate all the text together and so it'd be really helpful for it to know Oh, this article is finished and a new one started. So I should probably like forget some of that context now Ditto is Quite often texts have multiple fields like a title and abstract and then the main document And so by the same token, I've got this thing here Which lets us actually have multiple fields in our CSB Okay, so this really this process is designed to be very flexible and again at the start of each one We put a special field starts here token followed by the number of the field that's starting here For as many fields as we have Then we apply our fix up to it and then most importantly we tokenize it and we tokenize it by doing a process or Multi-processor Multi-processing I guess I should say And so tokenizing tends to be pretty slow But we've all got multiple cores in our machines now when some of the better machines on AWS and stuff can have dozens of cores Here on our university computer. We've got 56 cores So Spacey is not Very amenable to multi-processing But I finally figured out how to get it to work and the good news is it's all wrapped up in this one function now And so all you need to pass to that function is a list of things to tokenize Which each part of that list will be tokenized on a different core? And so I've also created this function called partition by cores which takes a list and Splits it into sub lists. The number of sub lists is the number of cores that you have in your computer Okay, so on On my machine Without multi-processing this takes About an hour and a half and with multi-processing it takes about two minutes, right? So it's a really Handy thing to have and now that this codes here, you know Feel free to look inside it and take advantage of it through your own stuff, right? Remember, we all have multiple processes Rottable cores even in our laptops and very few things in Python take advantage of it unless you make a bit of an effort to make it work So there's a couple of tricks to get things working quickly and reliably as it runs it prints out how it's going And so here's the result of the end, right? Beginning of stream token beginning of field number one token. Here's the tokenized text. You'll see that the Punctuation is on the whole now a separate token You'll see there's a few interesting little things one is this what's this t up? MGM well MGM obviously was originally capitalized, right, but The interesting thing is that normally people are the lowercase everything or they leave the case as is now if you leave the case as is then Screw you or caps and screw you lowercase are two Totally different sets of tokens that have to be learned from scratch or if you lowercase them all Then there's no difference at all between screw you and screw you, right? So how do you? Fix this so that you both get the semantic impact of like I'm shouting now Right, but not have every single word have to learn the shouted version versus the normal version And so the idea I came up with and I'm sure other people have done this too is to come up with a unique Token to mean the next thing is all uppercase So then I lowercase it So now whatever used to be upcase is now lowercase It's just one token and then we can learn the semantic meaning of all uppercase And so I've done a similar thing if you've got like 29 explanation marks in a row We don't learn a separate token for 29 explanation marks instead I put in a special token for the next thing repeats lots of times and Then I put the number 29 and then I put the explanation mark Right and so there's a few little tricks like that And if you're interested in LP have a look at the code for tokenizer For these little tricks that I've added in because some of them are kind of fun Okay, so the nice thing with doing things this way is we can now just NP dot save that and Load it back up later like we don't have to recalculate all this stuff each time like we tend to have to do with Torch text or a lot of other libraries Okay, so we've now got it Tokenized the next thing we need to do is to turn it into numbers All right, which we call numericalizing it And the way we numericalize it is very simple We make a list of all the words that appear in some order and then we replace every word with its index into that list The list of all the words that appear Or all the tokens that appear we call the vocabulary So here's an example of some of the vocabulary the counterclass in Python is very handy for this. It basically Gives us a list of unique items and their counts Okay, so here are the 25 most common things in the vocabulary You can see there are things like apostrophe s and double quote and end of paragraph and also stuff like that Now generally speaking we don't want Every unique token in our vocabulary If it doesn't appear at least two times then might just be a spelling mistake or a word I mean, we can't learn anything about it. It doesn't appear that often Also, the stuff that we're going to be learning about at least so far on this part Gets a bit clunky once you've got a vocabulary bigger than 60,000 Time permitting we may look at some work I've been doing recently on handling larger vocabularies. Otherwise that might have to come in a future course, okay? But actually for classification. I've discovered that doing more than about 60,000 words doesn't seem to help anyway So we're going to limit our vocabulary to 60,000 words things that appear at least twice And so here's a simple way to do that Use that dot most common Pass in the maximum capsize that'll sort it by the frequency by the way And if it appears less often than a minimum frequency then don't bother with it at all, okay? So that gives us I to s that's the same name that torch text used remember It means our inch to string so this is just the list of the tokens unique tokens in the vocab I'm going to insert two more tokens a Token for unknown a vocab item for unknown and a vocab item for padding, okay? Then we can create the dictionary which goes in the opposite direction so string to int and That won't cover everything because we intentionally truncated it down to 60,000 words Right, and so if we come across something that's not in the dictionary We want to replace it with zero for unknown and so we can use a default dict for that with a lambda function that always turns your Okay, so you can see all these things we're using that kind of keep coming coming back up So now that we've got our s2i dictionary defined. We can then just call that for every word For every sentence Right, and so there's our numericalized version and there it is Okay, and so of course the nice thing is again. We can save that step as well That so each time we get to another step we can save it And these are not very big files compared to what you used to with with images text is generally Pretty small very important to also save That vocabulary right because this this list of numbers means nothing Unless you know what each number refers to and that's what I2s tells you okay? So you save those three things and then later on you can load them back up Okay So now our vocab size is 60,000 and two and our training language model has 90,000 documents in it okay, so That's the pre-processing you do we can probably wrap a little bit more of that in little utility functions if we want to but it's all pretty straightforward and Basically that exact code will work for any data set you have once you've got it in that CSV format so Here is a Kind of a new insight that's not new at all Which is that? We'd like to pre-train Something like we know from lesson four That if we pre-train our classifier By first creating a language model and then fine-tuning that as a classifier That was helpful to remember it actually got us a new state of the art result We got the best IMDB classifier result that had ever been published but quite a bit Well, we're not coming the far enough though right because IMDB movie reviews are Not that different to Any other English document, you know compared to how different they are to a random string or even to a Chinese document so Just like image net Allowed us to train things that recognize stuff that kind of looks like pictures and we could use it on stuff There was nothing to do with image net like satellite images Why don't we train a language model? It's just like good at English and then fine-tune it to be good at major reviews So this basic insight led me to try building a language model on Wikipedia So my friend Stephen Merity has already processed Wikipedia Found a subset of nearly, you know the most of it But throwing away the stupid little Articles so most of the bigger articles and he calls that wiki-text 103 So I grabbed wiki-text 103 and I trained a language model on it And I used exactly the same approach. I'm about to show you for training a IMDB language model that instead I trained a wiki-text 103 language model And then I saved it and I've made it available for anybody who wants to use it at this URL So this is not a URL for wiki-text 103 The documents this is the wiki-text 103 the language model And so the idea now is let's train an IMDB language model which starts with these words Now hopefully to you folks, this is a Extremely obvious extremely non-controversial idea because it's basically what we've done in nearly every class so far But when I first When I first mentioned this to to people in the NLP community, I guess like June, July of last year They couldn't have been less interest, you know I asked on Twitter where a lot of the top through the research is a kind of people that I follow when they follow me back I was like, hey, what if we like pre-trained like a general language model? And they're like, no, well language is different. You know, you can't do that or like, oh, I don't know why you would bother anyway I've talked to people at conferences and they'd be like Pretty sure people have tried that and it's stupid like there was just this like I don't know this weird like straight past and You know, I guess because I am Arrogant and it's deperous I ignored them even though they know much more about NLP than I do and just tried it anyway And let me show you what happened So Here's how we do it, right? Grab the wiki text models, right? And if you use W get minus R It'll actually recursively grab the whole directory. It's got a few things in it We need to make sure that our language model has exactly the same embedding size number of hidden and number of layers as my wiki text one did otherwise You can't load the weights in. Okay, so these are the numbers here So here's our pre-trained path. Here's our pre-trained language model path. Let's go ahead and Torch dot load in those weights from the forward wiki text one oh three model. Okay We don't normally use torch dot load, but that's that's that's the you know the pie torch way of grabbing a file And it basically gets you a dictionary Containing the name of the layer and a tensor of those weights or an array of those weights Now here's the problem That wiki text language model was built with a certain vocabulary Which was not the same as this one was built on right? So my number 40 was not the same as wiki text one oh three models number 40 So we need to map one to the other Okay, that's very very simple because luckily I saved the i2s For the wiki text vocab, right? So here's the list of what each word is When I trained the wiki text one oh three model and so we can do the same default dict trick to map it in reverse Right, and I'm going to use minus one to mean that it's not in the wiki text dictionary And so now I can just say okay my new set of weights Is just a whole bunch of zeros With vocab size by embedding size, so we're going to create an embedding matrix I'm then going to go through every one of the words in my IMD be vocabulary Right, I am going to look it up in s to i2 So string to int for the wiki text one oh three vocabulary and see if that's words there Okay, and if that is word there it then I'm not going to get this minus one Right so r will be greater than or equal to zero so in that case I will just set that row of the embedding matrix to to the weight that I just looked at the which was stored inside this This named element and so these these names you can look at you can just look at this dictionary And it's pretty obvious what each name corresponds to because it looks very similar to the names that you gave it when you set up your module so here are the encoder weights Okay So grab it from the encoder weights if I don't find it Then I will use the row mean in other words. Here is the average embedding weight across all of the wiki text one oh three things. Okay, so That's pretty simple So I'm going to end up with an embedding matrix for every word that's in both my vocab roof IMD be and the wiki text 103 vocab I will use the wiki text one oh threes Embedding matrix weights for anything else I will just use whatever was the average weight from the wiki text one oh three embedding matrix Okay, and then I'll go ahead and I will replace the encoder weights with that turn into a tensor We haven't talked much about weight tying we might do so later But basically the decoder so the thing that turns the final prediction back into a word Uses exactly the same weights so I pop it there as well And then there's a bit of a weird thing with how we do embedding dropout that ends up with a whole separate copy of them For a reason that doesn't matter much anyway, so we just pop the weights back where they need to go so This is now something that a dictionary we can now or a set of torch state which we can load in So let's go ahead and create our language model Okay, and so the basic approach we're going to use and I'm going to look at this in more detail in a moment but the basic approach you're going to use is I'm going to Concatenate all of the documents together into a single a Single list of tokens of length 24 point nine nine eight million Okay, so that's going to be What I pass in as my training set So if a language model right we basically just take all our documents and just concatenate them back to that Okay, and we're going to be continuously trying to predict. What's the next word after these words? What's the next word after these words? And we'll look at these details in a moment. I'm going to set up a whole bunch of dropout. We'll look at that in detail in a moment Once we've got a model data object. We can then grab the model from it. So that's going to give us a learner Okay, and then as per usual we can call learner dot fit So we first of all as per usual just Do a single epoch on the last layer just to get that okay And the way I set it up is the last layer is actually the embedding weights Because that's obviously the thing that's going to be the most wrong because like a lot of those embedding weights Didn't even exist in the vocab So we're just going to train a single epoch of just the embedding weights and Then we'll start doing a few epochs of the full model. And so how is that? Looking well, here's lesson four, which was our academic world's best-ever result and after 14 epochs we had a 4.23 loss Here after one epoch we have a 4.12 loss Right, so by pre-training on wiki-text 103 In fact, let's go and have a look. We kept training and training at a different rate. Eventually. We got to 4.16 So by pre-training on wiki-text 103 we have a better loss after one epoch than the best loss We got for the language model otherwise Yes, Rachel What is the wiki-text 103 model? Is it AWD LSTM again? Yeah, and we're about to dig into that It's the way I trained it was literally the same lines of code that you see here But without pre-training it on wiki-text 103 Okay, so let's take a 10 minute break come back at 7.40 and we'll dig in and have a look at these models Okay, welcome back before we go back into language models and NLP classifiers a quick Discussion about something pretty new at the moment, which is the fast AI doc project So the goal of a fast AI doc project is to create documentation that makes readers say Wow, that's the most fantastic document documentation. I've ever read And so we have some specific ideas about how to do that, but it's the same kind of idea of like top-down You know thoughtful take full advantage of the medium approach, you know interactive experimental code first that we're all familiar with If you're interested in getting involved the basic approach you can see in in the docs Directory, so this is that this is the read me in the docs directory in there there is a amongst other things a Transforms template dot a doc what the hell is a doc a doc is ASCII doc. How many people here have come across ASCII doc? That's awesome. Ask you doc is People are laughing because it's one hand up and it's somebody who was in our study group today who talked to me about ASCII doc ASCII doc is the most amazing project. It's like markdown, but it's like what markdown needs to be to create Actual books and like a lot of actual books are written in ASCII doc And so it's as easy to use as markdown, but there's way more cool stuff You can do with it. In fact, here is an ASCII doc file here, right? And as you'll see it looks very normal there's headings, right and this is pre-formatted text and And There's yeah, there's lists and whatever else it looks pretty standard And actually I'll show you a more complete ASCII doc thing more standard ASCII doc thing But you can do stuff like say put a table of contents here, please You can say colon colon means put a definition list here, please Plus means this is a continuation of the previous list item So there's just like little things that you can do which are super handy or like Put this thing make it slightly smaller than everything else so it's like Turbocharged markdown and so this ASCII doc creates This HTML and I didn't add any CSS or do anything myself like we literally started this project like four hours ago So this is like just an example basically, right? And so you can see we've got a Table of contents we can jump straight to here We've got a cross reference we can click on to jump straight across reference Each method kind of comes along with its details and so on and so forth, right? And to make things even easier rather than having to know that This is meant to be like the argument list is meant to be smaller than the main part or how do you create a cross reference or how? You meant to format the arguments to the method name and list out each one of its arguments We've created a special template Where you can just write various stuff in curly brackets like please put the arguments here and here is an example of one argument And here is a cross reference and here is a method and so forth We're in the process of Documenting the documentation template, but there's basically like five or six of these little curly bracket things you'll need to learn But for you to create the documentation of a class or a method you can just copy one. That's already there basically And so the idea is we're going to have like It'll almost be like a book, you know, there'll be tables and pictures and little video segments and A hyperlink throughout and all that stuff You might be wondering what about doc strings But actually, I don't know if you've noticed but if you look at the Python standard library and look at the doc string For example for regex compile. It's a single line Nearly every doc string in Python is a single line and Python then does exactly this They have then a website containing the documentation that says like hey This is what regular expressions are and this is what you need to know about them And if you want them to go faster, you need to use compile and here's lots of information about compile and here's the examples It's not in the doc string and that's why how we're doing it as well. Our doc strings Will be one line unless you need like two sometimes and it's going to be very similar to Python but even better So Everybody is welcome to help contribute to the documentation and hopefully by the time if you're watching this on the MOOC It'll be reasonably fleshed out and we'll try to keep a list of things to do all right, so I'm going to do one first So one question that came up in the break was how does this compare to word to vex and This is actually a great thing for you to spend time thinking about during the week is how does this compare to work to vex I'll give you the summary now, but it's a very important conceptual difference. The main conceptual difference is what is word to vex? Word to vex is a single embedding matrix Each word has a vector and that's it so in other words. It's a single It's a single layer from a pre-trained model and specifically that layer is the input layer and Also specifically that pre-trained model is a linear model Okay, that is pre-trained on something called a co-occurrence matrix So we have no particular reason to believe that this model has learned anything much about the English language Or that it has any particular capabilities because it's just a single linear layer. That's it So What's this wiki-tex 103 model? It's a language model All right, it has a 400-dimensional embedding matrix three hidden layers with 1150 activations per layer and Regularization and all of that stuff tied in put out what matrix matrices. It's basically a state-of-the-art AWD LSTM So like what's the difference between a single layer of a single linear model versus a three-layer? Recurrent neural network Everything, you know, they're they're very different levels of capability and so you'll see when you try using a pre-trained language model versus a Word-to-vec layer, you'll get very very good results for the vast majority of tasks What if the NumPy array does not fit in memory? Is it possible to write a pie torch data loader directly from a large CSV file? It almost certainly won't come up. So I'm not going to spend time on it like these things are tiny these They're just ince think about how many ince you would need to run out of memory. It's not going to happen They don't have to fit in GPU memory just in your memory So I've actually done another Wikipedia model which I called giga wiki which was on all of Wikipedia and even that is it's in memory The reason I'm not using it is because it turned out not to really help very much versus wiki text 103 But you know, I've built a bigger model than anybody else I found in the academic literature pretty much and it fits in memory on a single machine What is the idea behind averaging the weights of embeddings? They're going to be set to something, you know, like there are words that weren't there So other options is we could leave them at zero. That seems like a very extreme thing to do like zero is a very extreme number Why would it be zero? We could set it equal to some random numbers But if so what would be the mean and standard deviation of those random numbers or should be there be uniform if we just average The rest of the you know the embeddings and we have something that's reasonable scale and just to clarify This is how you're initializing Words that didn't appear in the training. Yeah, thanks Rachel. That's right And then I think you've pretty much kind of just answered this one But someone had asked if there's a specific advantage to creating our own pre-trained embedding over Using glove or word to back. Yeah, I think I have we're not creating a pre-trained embedding. We're putting a pre-trained model Okay So Let's talk a little bit more with this is kind of stuff. We've seen before but it's changed a little bit It's actually a lot easier than it was in part one, but I want to go a little bit deeper into the language model Loader Okay, so this is the language model loader and I really hope that by now you've learned in your editor or IDE You had a jump to symbols. Okay. I don't want it to be a A burden for you to find out what the source code of language model loader is, right? And if it's still a burden, please, you know, go back and try and learn those keyboard shortcuts in VS code You know, like if your editor doesn't make it easy, don't use that editor anymore Okay, there's lots of good free editors that make this easy. Okay, so So here's the source code for language model loader and Yeah, it's it's it's interesting to notice that it it's not doing anything Particularly tricky it's not deriving from anything at all right What makes it something that's capable of being a data loader is that it's something you can iterate over Okay, and so specifically Okay, so specifically here's the fit function inside fast AI dot model This is like whatever where everything ends up eventually which goes through each epoch and Then it creates an iterator from the data loader and then just does a for loop through it All right, so anything you can do a for loop through can be a data loader specifically. It needs to return tuples of mini batches independent independent variable for mini batches so Anything with a done to it a method is something that can act as an iterator and Yield is a neat little Python keyword You probably should learn about it if you don't already know it But it basically spits out our thing and waits for you to ask for another thing normally in like a for loop or something so in this case we start by Initializing the language model passing it in the the numbers. So this is the numericalized Big big long list of all of our documents concatenated together and the first thing we do is to Batchify it and this is the thing which quite a few of you got confused about last time right if our batch size is 64 and our We have 24 25 million Numbers in our list We are not creating Items of length 64. We're not doing that. We're creating 64 items in total So each of them is of size t divided by 64 which is 390,000 All right, so that's what? That's what we do here when we reshape it so that this Access here is of length 64 and then this minus one is everything else So that's 300 and whatever I said thousand three hundred and ninety thousand long right and then we transpose it So that means that we now have 64 columns 390,000 rows and then what we do each time we do an iterate is we grab One batch of some sequence length. We'll look at that in a moment, but basically it's approximately equal to BP TT which we set to 70 stands for back prop through time And we just grab That many rows Okay, so from I to I plus 70 rows And then we try to predict that plus one remember so we're trying to predict One past where we're up to so we've got 64 Columns and each of those is 164th of our 25 million or whatever it was tokens You know hundreds of thousands long and we just grab you know 70 At a time so each of those columns each time we grab it is going to kind of hook up to the previous column Okay, and so that's why we get this consistency this language model. It's it's stateful. It's really important pretty much all the cool stuff in the language model is is stolen from Steven Meredith's AWD LSTM including this little trick here, which is If we always grab 70 at a time and then we go back and we do a new epoch We're going to grab exactly the same batches Every time there's no randomness now. Normally we shuffle our data Every time we do an epoch or every time we grab some data we grab it at random You can't do that with a language model because this set has to join up to the previous set because it's kind of trying to learn The sentence right and if you suddenly jump somewhere else Then that doesn't make any sense as a sentence. So Steven's idea Is to say okay, well since we can't shuffle the order Let's stead randomly change the size the sequence length Okay, and so basically he says all right 95% of the time will use Bptt 70 but 5% of the time will use half that Right, and then he says you know what I'm not even going to make that the sequence length I'm going to create a normally distributed random number With that average and a standard deviation of five and I'll make that the sequence length right so the sequence length is 70 ish and that means every time we go through we're getting slightly different Batches, so we've got that little bit of extra randomness I asked Steven Merity where he came up with this idea Did he think of it and he was like I Think I thought of it, but it seems so obvious that I bet I didn't think of it Which is like true of like every time I come up with an idea and deep learning it always seems so obvious that you assume Somebody else's thought of it, but I think he thought of it So yeah, so this is like a nice Thing to look at if you're trying to do something a bit unusual with a data loader is like okay Here's a simple kind of role model You can use as to creating a data loader from scratch something that spits out batches of data so So our language model loader just took in all of the documents and cabinet together along with the batch size and the BP TT now Generally speaking we want to create a learner and the way we normally do that is by getting a model data object and By calling some kind of method which have various names But sometimes you know often we call that method get model and so the idea is that the model data object has enough information To know what kind of model to give you so we have to create that model data object Which means we need that that class? And So that's very easy to do right So here are all of the pieces we're going to create a custom learner a custom model data class and a custom model class So a model data class again, this one doesn't inherit from anything. So you really see this. There's almost nothing to do, right? You need to tell it most importantly. What's your training set give it a data loader? What's the validation set give it a data loader and optionally give it a test set that loader plus anything else It needs to know right so It might need to know the BP TT It needs to know the number of tokens. That's the vocab size. It needs to know what is the padding index And so that it can save temporary files and models model data is always need to know the path Okay, and so we just grab all that stuff and we dump it Right, and that's it. That's the entire initializer. There's no logic there at all Okay, so then all of the work happens inside get model right and so get model Call something we'll look at later which just grabs a normal pi torch and end up module architecture, okay, and Jux it on the GPU Note with pi torch normally we would say dot CUDA with fast AI It's better to say to GPU and the reason is that if you don't have a GPU It'll leave it on the CPU and it also provides a global variable You can set to choose whether it goes on the GPU or not. So it's a it's a it's a better approach So we wrap the model in a language model and the language model is this basically a language model is a Subclass of basic model It basically almost does nothing except it defines layer groups And so remember how when we do like discriminative learning rates where different Layers have different learning rates or like we freeze different amounts We don't Provide a different learning rate for every layer because there can be like a thousand layers We provide a different learning rate for every layer group, right? So when you create a custom model, you just have to override this one thing which returns a list of All of your layer groups and so in this case my last layer group contains The last part of the model and one bit of dropout And the rest of it this star here means pull this apart. So this is basically got to be One layer per RNN layer Okay So that's all that is and then finally Turn that into a learner and so a learner you just pass in the model and it turns it into a learner In this case we have overridden learner And the only thing we've done is to say I want the default Lost function to be cross-entropy okay, so this You know entire set of custom model custom model data custom learner all fits on a single screen and they always basically Look like this, right? So that's a kind of little dig inside this pretty boring part of the code base So the interesting part of this code base is get language model, right? Because get language model is actually the thing that gives us our AWD LSTM and It actually contains The big idea That the big incredibly simple idea that everybody else here thinks is really obvious But everybody in the NLP community I spoke to thought was insane Which is basically Every model can be thought of pretty much of your model can be thought of as a backbone Plus a head and if you pre-train the backbone and stick on a random head You can do fine-tuning and that's a good idea, right? And so here's These two bits of the code are literally right next to each other. This is kind of all there is Inside this bit of fast AI dot LM RNN Here's get language model Here's get classifier get language model creates an RNN encoder And then creates a sequential model that sticks on top of that a linear decoder Classifier creates an RNN encoder and then a sequential model that sticks on top of that a pooling linear classifier We'll see these what these differences are in a moment, but you get the basic idea They're basically doing pretty much the same thing. They've got this this head and then they're sticking on a simple linear layer on top so It's worth digging in a little bit deeper and seeing what's going on here. Yes, Rachel And there was a question earlier about whether that any of this translates to other languages Yeah, this whole thing works in any language you like obviously I mean, would you Would you have to retrain your language model on a corpus from that language? Absolutely, or yeah Yeah, so the wiki text 103 pre-trained language model knows English right You could use it Maybe to as a pre-trained start for like a French or German model start by retraining the embedding layer from scratch Might be helpful Chinese maybe not so much but like Given that a language model can be trained from any unlabeled documents at all You'd never have to do that right because every every almost every language in the world has, you know, plenty of Documents you can grab Newspapers web pages, you know parliamentary records, whatever you know as long as you've got a Few thousand documents showing Somewhat normal usage of that language you can create a language model And so I know some of our students, you know one of our students They might have to look up during the week for embarrassing Try at this approach for Thai You said like the first model he built easily beat the previous day of the art tie classifier Like it. It's yeah, like for those of you that are international fellows This is an easy way for you to kind of to whip out a paper in which you you know Either create the first ever classifier in your language or beat everybody else's classifier in your language And then you can tell them that you've been a student of deep learning for six months and piss off all the academics Okay, so Here's our ad encoder It's just a standard and a module most of the text in it is actually just documentation as you can see It it looks like there's more going on in it than there actually is But really all there is is we create an embedding layer We create an LSTM for each player that's been asked for And that's it everything else in it is dropout right basically all of the interesting stuff Just about in the AWD LSTM paper is all of the places you can put dropout And then the forward is basically the same thing right it's Call the embedding layer add some dropout go through each layer Call that RNN layer Appended to our list of outputs at dropout That's about it So it's it's it's really pretty straightforward and the The paper you want to be reading as I've mentioned is the AWD LSTM paper Which is this one here regularizing and optimizing LSTM language models and it's it's well written and pretty accessible right and Entirely implemented inside fast AI as well, right so you can see all of the code For that paper and like a lot of the code actually is Shamelessly plagiarized with Stevens permission from his excellent github repo AWD LSTM And the process of which I picked some of his bugs as well I even told him about them So Yeah, so I talk you know I'm talking increasingly about please read the papers. So here's the paper. Please read This paper and it refers to other papers. So for things like Why is it that the encoder weight and the decoder weight are the same? right well, it's because There's this thing called tie weights this is inside This is inside that get language model. There's a thing called tie weights the defaults to true and if it's true then the We actually tie We literally use the same weight matrix for the encoder and the decoder Okay, so like they're literally pointing at the same block of memory if you like and so why is that? What's the result of it? That's one of the citations and Stevens paper and which is also a well written paper You can go and look up and learn about work time. So there's a lot of cool stuff in there Okay, so we have basically a standard RNN the only reason why it's not standard is just got lots more types of dropout in it and then a sequential Model on top of that. We stick a linear decoder Which is literally? Half the screen of code It's got a single linear layer We initialize the weights to some range. We add some dropout and that's it. So It's a linear layer drop out. Okay, so we've got an RNN On top of that we stick a linear layer with dropout and we're finished Okay, so that's the language model So what dropout you choose? matters a lot and Through a lot of experimentation I found a bunch of dropouts and you can see here We've got like each of these corresponds to a particular Arguerite a bunch of dropouts that tend to work pretty well for language models but If you have less data for your language model You'll need more dropout If you have more data you can benefit from less dropout You don't want to regularize more than you have to make sense, right? Rather than having to tune every one of these five things, right? My claim is They're already pretty good ratios to each other. So just tune this number. I just multiply it all by something okay, so There's really just one number you have to choose so if you're overfitting Then you'll need to increase this number if you're underfitting you'll need to decrease this number because other than that These ratios actually seem pretty good So One Important idea Which may seem pretty minor, but again, it's incredibly controversial is that We should measure accuracy When we look at a language model. So normally language models We look at this this loss value, which is just cross-entropy loss, but specifically We nearly always take a to the power of that Which the NLP community calls perplexity. Okay, so perplexity is just a to the power of cross-entropy There's a lot of problems with comparing things based on cross-entropy loss I'm not sure I've got time to go into it in detail now, but the basic problem is that It's kind of like that thing we learned about vocal loss Cross-entropy loss if you're right it wants you to be really confident that you're right, you know So it really penalizes a model That doesn't kind of say like I'm so sure this is wrong. That's wrong Whereas accuracy doesn't care at all about how confident you are This car cares about whether you're right and this is much more often the thing which you care about in real life So this accuracy is what how many how often do we guess the next word? Correctly, and I just find that a much more stable Number to keep track of so so that's a simple little thing that I do right so So we trained for a while and We get down to a 3.9 Cross-entropy loss and if you go either the power of that And to kind of give you a sense of like what's happened with language models if you look at academic papers from about 18 months ago You'll see them talking about perplexities state of the art of plexities of like over a hundred Okay, like the the rate at which our ability to kind of understand language And I think like measuring language model accuracy or complexity is not a terrible Proxy for understanding language if I can guess what you're going to say next It's you know, I pretty much need to understand language pretty well And also the kind of things you might talk about pretty well So this number is just come down so much It's been amazing NLP in the last 12 to 18 months And it's going to come down a lot more. It really feels like 2011 2012 computer vision, you know We're just starting to understand transfer learning and fine-tuning and these basic models are getting so much so much better So everything you thought about like what NLP can and can't do Is very rapidly Going out of date, but there's still lots of stuff NLP is not good at to be clear, right? Just like in 2012 there was lots of stuff computer vision wasn't good at but it's changing incredibly rapidly And now is a very very good time to be getting very very good at NLP or starting start-ups based on NLP because there's a whole bunch of stuff which computers would absolutely shit at two years ago and Now like not quite as good at people and then next year. They'll be much better than people Yeah, two questions One what is your ratio of paper reading versus coding in a week? What do you think Rachel you see me? I mean, it's a lot more coding, right? It's a lot more coding. I feel like it also really varies from week to week like I feel like they're Like with that bounding box stuff, you know Like with that bounding box stuff There was all these papers and no map Through them and so I didn't even know which one to read first and then I'd read the citations and Didn't understand any of them So there was a few weeks of just kind of reading papers before I even know what to start coding That's unusual though, but first of the time I'm yeah, I don't know any time I start reading a paper I'm always convinced that I'm not smart enough to understand it always regardless of the paper and somehow eventually I do But yeah, I try to spend as much time as I can coding And then the second question is your dropout rate the same through the training or do you adjust it and the weights accordingly? I just say one more thing about the last bit Which is very often like the vast majority nearly always I after I've read a paper Even after I've read the bit that says this is the problem I'm trying to solve I'll kind of stop there and try to implement something that I think might solve that problem And then I'll go back and read the paper and I'll read little bits about like all these are how I solve these problem Bits and I'll be like oh, that's a good idea and then I'll try to implement those and so like that's why For example, I didn't actually implement SSD, you know like my custom head is not the same as their head It's because I kind of read the gist of it And then I tried to create something best as I could and then go back to the papers and try to see why and so So by the time I got to the focal lost paper Rachel will tell you I was like driving myself crazy with like how come I can't find small objects How come it's always predicting background, you know, and I read the focal lost paper and I was like That's why you know, so like it's so much better when You deeply understand the problem they're trying to solve and I do find the vast majority of the time by the time I read that bit of the paper which is like solving the problem I'm then like Yeah, but these three ideas I came up with they didn't try, you know And you suddenly realize that you've got new ideas or else if you just implement the paper You know mindlessly It's you know, you tend not to have these insights about better ways to do it Yeah Varying dropout is really interesting and there are some recent papers actually that suggest gradually changing dropout and It was either a good idea to gradually make it smaller or to gradually make it bigger. I'm not sure which Let's try maybe one of us can try and find it during the week. I haven't seen it widely used I tried it a little bit with the most recent paper. I wrote and It I had some good results. I think I was Gradually making a smaller The next question is am I correct in thinking that this language model is built on word embeddings Would it be valuable to try this with phrase or sentence and beddings? I Asked that I Asked this because I saw from Google the other day universal sentence encoder Yeah, no, this is like this is much better than that like just you know I mean like this is this is not just an embedding of a sentence. This is an entire model Right. So an embedding by definition is like a fixed thing Oh, I think they're asking They're saying that this language. Well, I the first question is is this language model built on word embedding, right? but So I'm saying it's a sentence or a phrase embedding is Always a model that creates that right and we've got a model That's like trying to understand language. It's not just as phrase. It's not just a sentence You know, it's a it's a document in the end And it's not just an embedding that we're training through the whole thing. So like this has been a huge problem with NLP For years now is this attachment they have to embeddings And so even the paper that the community has been most excited about recently from the from AI to the Allen Institute called Elmo ELMO And they found much better results across lots of models, but again, it was an embedding They took a fixed model and created a fixed set of numbers, which they then fed into a model But in in computer vision, we've known for years that that approach of having a fixed Fixed set of features. They're called hypercolons in in computer vision people stopped using them like three or four years ago because Fine-tuning the entire model works much better Right, so for those of you that have spent quite a lot of time with NLP and not much time with computer vision You're going to have to start Reloading right all that stuff you have been told about this idea that there are these things called embeddings and that you learn them ahead of time and then you apply these fixed things Whether it be word level or phrase level or whatever level Don't do that. All right, you want to actually create a pre-trained model and fine-tune it and to it And you'll see some you'll see some specific results All right, so We're coming up as you answer the existing ones For using accuracy instead of perplexity as a metric for the model Could we work that into the loss function rather than just use it as a metric? No, you never want to do that whether it be computer vision or NLP or whatever. It's too bumpy, right? so Cross-entropy spine is a loss function and I'm not saying instead of I use it in addition to you know I think it's good to look at the accuracy and to look at the Cross-entropy, but for your loss function, you need something Nice and smooth accuracy doesn't work very well You'll see there's two different versions of save there's save and saving coder Save saves the whole model as per usual saving coder saves Just That bit right in other words in the sequential model it saves just that bit and not that bit In other words, you know this bit Which is the bit that actually makes it into a language model. We don't care about in the classifier We just care about That bit. Okay, so that's why we save two different models here Okay, so let's now create the classifier Okay, and I'm going to go through this bit pretty quickly because it's the same But like when you go back during the week and look at the code convince yourself It's the same right we do get all PD reads USB again junk size again get all again Save those tokens again We don't create a new I2S vocabulary we obviously want to use the same vocabulary we had in the language model Okay, because we're about to reload the same encoder. Okay, same default deck Same way of creating our numericalized list which as per before we can save. Okay, so that's all the same Later on we can reload those rather than having to rebuild them so all of our Hyper parameters are the same We're not all of them. Sorry the construction of the model hyper parameters are the same. We can change the dropout Optimize a function pick a batch size that is as big as you can that doesn't run out of memory And so this bit is a bit interesting There's some fun stuff going on here The basic idea here is that for the classifier We do really want to look at one, you know our document, right? We need to say is this document positive or negative and so we do want to shuffle the documents, right? It's because we we like to shuffle things but Those documents are different lengths and so if we stick them all into one batch This is a handy thing that fast AI does for you You can stick things of different lengths into a batch and it will automatically happen, so you don't have to worry about that Okay But if they're wildly different lengths Then you're going to be wasting a lot of computation times There might be one thing there that's 2,000 words long and everything else of 50 words long And that means you end up with a 2,000 wide tensor That's pretty annoying so James Bradbury who's actually one of Steven Merity's colleagues and the guy who came up with torch text Came up with a neat idea which was Let's sort The data set by length ish Right, so kind of make it so the first things in the list are On the whole shorter than the things at the end but a little bit random as well Okay, and so I'll show you how I implemented that So the first thing we need is a data set right, and so we have a data set passing in the the documents and their labels and so here's a text data set and it inherits from data set Here is data set from Torch from pie torch And actually data set doesn't do anything at all It says you need to get item if you don't have one you're going to get an error You need a length if you don't have one you're going to get an error. Okay, so this is an abstract class So we're going to pass in our x we're going to pass in our y and get item is going to grab the x and Grab the y and return them. It couldn't be much simpler, right? Optionally it could reverse it optionally it could stick an end of stream at the end optionally It could stick a stutter student beginning. We're not doing any of those things So literally all we're doing is we're putting in an x putting in a y and then to grab an item We're returning the x and the y as a tuple Okay, and the length is however long x for a is so that that's that's all the data set is right Something with a length that you can index So to turn it into a data loader you simply pass the data set to the data loader Constructor and it's now going to go ahead and give you a batch of that at a time Normally you can say shuffle equals true or shuffle equals false. It'll decide whether to randomize it for you in this case though we're actually going to pass in a Sampler parameter and a sampler is a class. We're going to define that tells the Data loader how to shuffle Okay, so for the validation set we're going to define something that actually just sorts Right it just deterministically sorts it so the all the shortest Documents will be at the start or the longest documents will be at the end and that's going to minimize the amount of padding okay For the training sampler. We're going to create this thing. I called a sort ish sampler which also sorts ish right So this is where like I really like Pytorch is that they came up with this idea for an API for their data loader where we can like hook in New classes to make it behave in different ways, right? So here's a sort sampler Right is simply something which again it has a length Which is the length of the data source and it has an iterator Which is simply an iterator which goes through the data source sorted by Length the key and I pass in as the key a Lambda function Which returns the length? right And so for the sort ish sampler I won't go through the details, but it basically does the same thing with a little bit of randomness Okay, so it's a really Beautiful, you know just another of these beautiful little design things in Pytorch that I discovered I could take James Bradbury's ideas, which he had like written a whole new set of classes around and I could actually just use the inbuilt hooks Inside Pytorch You will notice data loader. It's not actually Pytorch's data loader It's actually fast AI's data loader, but it's basically almost entirely plagiarized from Pytorch But customized in some ways to make it faster mainly by using multi-threading instead of multi-processing. Yes, Rachel Does the pre-trained LSTM depth and BBTT need to match with the new one we are training? No, no the the BBTT doesn't need to match at all That's just like how many things do we look at at a time? It's got nothing to do with the architecture Okay, so Now we can call that function we just saw before get RNN classifier. It's going to create exactly the same encoder more or less Okay, and we're going to pass in the same architectural details as before but this time We can the head that we add on You've got a few more things you can do one is you can you can add more than one hidden layer So this layers here says this is what the input to my Classifier section my head is going to be This is the output of the first layer This is the output of the second layer and you can add as many as you like So you can basically create a good old multi-layered neural net classifier at the end And so ditto these are the dropouts to go after each of these layers Okay, and then here are all of the AWD LSTM Dropouts, which we're going to basically plagiarize that idea for our classifier We're going to use the RNN learner just like before We're going to use discriminative learning rates for different layers. Actually, this is the set that I use from here You can try using weight decay or not. I've been fiddling around a bit with that to see what happens And so we start out just training the last layer And we get ninety two point nine percent accuracy Then we freeze one more layer unfreeze one more layer get ninety three point three accuracy and then we find you in the whole thing and after three epochs Okay, so this was So here is the famous James Beppery we're talking about This was kind of the main attempt before our paper came along at Using a pre-trained model and what they did is they used a pre-trained Translation model but they they didn't fine-tune the whole thing they just took the the the activations of the translation model and When they tried IMDB They got 91.8% which We beat easily after only fine-tuning one layer right They would state of the art there the state of the art is ninety four point one which we beat after fine-tuning the whole thing for three epochs And so by the end we're at ninety four point eight Which is Obviously a huge difference because like in terms of error rate that's gone down from from five five point nine and then I'll tell you a simple little trick is go back to the start of this notebook and Reverse the order of all of the documents and then rerun the whole thing right and When you get to the bit that says wt 103 replace This fwd forward with bwd for backward That's a backward English language model that learns to read English backwards So if you redo this whole thing Put all the documents in reverse And change this to backward you now have a second classifier which classifiers things by positive or negative sentiment based on the reverse document if you then take the Two predictions and take the average of them you basically have like a bi-directional model that you've trained each bit separately That gets you to ninety five point four percent accuracy Okay, so we can we basically load it from five point nine to four point six. So this kind of like 20 percent change in the state of the art is It's like it's almost unheard of you know, it's like you have to go back to like Jeffrey Hinton's image net Computer vision thing where they drop like 30% off the state of the art like it doesn't happen very often and so you can see this idea of like Just use transfer learning It's ridiculously powerful But every new field thinks their new field is too special and you can't do it, right? So it's a big opportunity for all of us So we turn this into a paper and when I say we I did it with this guy Sebastian Ruder Now you might remember his name because in lesson five I told you that I actually had shared lesson four with Sebastian because I think he's a Awesome researcher who I thought might like it. I didn't know him personally at all and Much to my surprise he actually watched the damn video. I was like what? You know what an LP research is going to watch some beginner's video, but he watched the whole video and he was like That's actually quite fantastic. That's like, well, thank you very much. That's awesome coming from you and he said Hey, we should turn this into a paper And I said I don't write papers. I don't care about papers. I'm not interested in papers That sounds really boring and He said okay, how about I write the paper for you? and I said You can't really write a paper about this yet because you'd have to do like studies to compare it to other things They're called ablation studies to see which it's actually work like there's no rigor here I just put in everything that came in my head and chucked it all together and it happened to work And it's like, okay What if I write all the paper and do all the ablation studies then can we write the paper and I said well? It's like a whole library that like I haven't documented and like I You know, I'm not going to yet and like you don't know how it all works He said okay if I write the paper and do the ablation studies and figure out from scratch how the code works without bothering you Then can we write the paper? I was like Yeah, if you did all those things The paper they're like, okay, and so then two days later he comes back and he says okay. I've done a draft with the paper So I share this story to say like If you're some, you know Student in Ireland, he's a student in Ireland and you want to like do good work. Don't let anybody stop you, right? I did not Encourage him to say the least right, but in the end he was like look I want to do this work I think it's going to be good and I'll figure it out and you know he wrote a fantastic paper and he did the ablation studies and he figured out how fast a I works and Now we're planning to write another paper together and so like there's some You've got to be a bit careful right because sometimes I get messages from random people saying like I've got lots of good ideas. Can we have coffee? I don't want to have I don't you know I can have coffee at my office anytime. Thank you But it's very different to say like hey, I took your ideas and I wrote a paper and I did a bunch of experiments And I figured out how your code works to add a documentation to it You know Should we it all submit this to a conference? Do you see what I mean like this? There's nothing to stop you Doing amazing work and if you do amazing work that like helps somebody else like in this case, okay? I'm happy that we have a paper. I don't really care about papers But I think it's cool that you know these ideas now have this rigorous study like let me show you what he did so He took all my code, right? So I'd already done all the fast. I got text and stuff like that and as you've seen it lets us work with large corpuses So, you know Sebastian is fantastically well read and he said here's a paper that yeah I'm looking some guys just came out with where they tried lots of different classification data sets So I'm going to try running your code on all these data sets And so these are the data sets right and so some of them had you know many many hundreds of thousands of documents and they were far bigger than anything I had tried But I thought it should work, right and So, you know he had a few Good little ideas as we went along and so you should like totally make sure you you read the paper, right? This paper And he said well this thing that you caught in the lessons differential learning rates Differential kind of mean something else like maybe we should rename it. So we renamed it It's now called discriminative learning rates. So this idea that we had from part one where we use different learning rates of different layers After doing some literature research, it does seem like that hasn't been done before so it's now officially a thing discriminative learning rates and so all these ideas like this is something we learned in lesson one, right? It now has an equation with Greek and everything So when you see an equation with Greek and everything that doesn't necessarily mean it's more complex and anything We did in lesson one because this one isn't again that idea of like unfreezing a layer at a time Also seems never been done before so it's now a thing and it's got the very clever name gradual unfreezing So then long promised what we're going to look at this slanted triangular learning rates So this actually was not my idea Leslie Smith One of my favorite researchers who you all now know about emailed me a while ago and said I'm so over so cool learning rates. I don't do that anymore I now do a slightly different version where I have one cycle Which goes up quickly at the start and then slowly down afterwards and he said I often find it works better I've tried going back over all of my old data sets and it works better for all of them everyone. I tried So this is what the learning rate looks like All right, you can use it in fast AI just by adding use CLR equals to your fit This first number is the ratio between the highest learning rate and the lowest learning rate So here. This is one thirty second of that The second number is the ratio between the first peak and the last peak All right, and so the basic idea is if you're doing a cycle length 10 And you want the first cycle sorry the first epoch To be the upward bit and the other nine epochs to be the downward bit then you would use 10 And I find that works pretty well That was also Leslie suggestion is make about a tenth of it the upward bit and about nine tenths The downward bit Since he told me about it actually it was just maybe two days ago He wrote this amazing paper a disciplined approach to neural network hyper parameters in which he Described something very slightly different to this again, but the same basic idea. This is a must read paper You know, it's got all the kinds of ideas that fast AI talks about a lot In in in great depth and nobody else is talking about this stuff It's it's kind of a slog Unfortunately, Leslie had to go away on a trip before he really had time to edit it properly So it's a little bit slow reading, but it's don't let that stop you. It's amazing So this triangle this is the equation from my paper with Sebastian Sebastian was like Jeremy Can you send me the math equation behind that coach? He wrote I was like no, I just wrote the code. I could not turn it into math. So he figured out the math for it So you might have noticed The First layer of our classifier was equal to embedding size times three Why times three times three because and again, this seems to be something which people haven't done before so new idea concat pooling Which is that we take the average pooled the average pooling over the sequence of the activations the max pooling of the sequence over the activations and The final set of activations and just concatenate them all together Again, this is something which we talked about in part one That doesn't seem to be in the literature before so it's now called concat pooling and again It's now got an equation and everything, but this is the entirety Of the implementation Pull with average pull with max concatenate those two along with the final sequence So, you know, you can go through this paper and see how the fast AI code implements each piece So then to me one of the kind of interesting pieces is the difference between RNN encoder, which you've already seen and Multi-batch RNN encoder. So what's the difference there, right? So the key difference is that the the normal RNN encoder for the language model. We could just do VPTT Chunk at a time Right, no problem I'm predict the next word but For the classifier We need to do the whole document We need to do the whole movie review before we decide if it's positive or negative and the whole movie review can easily be 2000 words long and I can't fit 2000 words worth of You know gradients in my GPU memory For every single one of my activations What sorry for everyone one of my weights? So what do I do and So the idea was very simple Which is I go through my whole sequence length one batch of BPT BPTT at a time, right and I call Super dot forward. So in other words the RNN encoder Right, so just couldn't call the usual RNN encoder to grab its outputs and Then I've got this maximum sequence length parameter where it says okay if you As long as you're doing no more than that sequence length Then start appending it to my list of outputs, right? So in other words the thing that it sends back to this pulling Is is only as much as only as many activations as we've asked it to keep Right and so that way you can basically to figure out how much what's max sec Do you can your? Particular GPU handle, right? So it's still using the whole document But let's say max sec is a thousand thousand words and your longest document length is two thousand words Right then it's still going through the RNN creating state for those first thousand words Right, but it's not actually going to Store the activations for the back prop the first thousand. It's only going to keep the last thousand Right, so that means that it can't back propagate the loss Back to any decisions that any any state that was created in the first thousand words You know basically that's that's now gone So it's a really Simple piece of code, you know and honestly when I wrote it it was like I Didn't spend my time thinking about it. It seems so obviously the only way that this could possibly work But again, it's it seems to be a new thing So we now have back prop through time for text classification is a thing so you can see there's lots of little pieces in this paper so What was the result right and so the result was on Every single data set we tried we got a better result than any previous academic For text classification So IMDB Trek 6 AG news DBpedia Yelp all different types And honestly IMDB was the only one I spent any time trying to optimize the model So like most of them we just did it like whatever came out first So if we actually spent time on it, I think these would be a lot better But and the things that these are comparing to most of them are like you'll see like They're different on each table because like they're optimized You know these are like customized algorithms on the whole so this is saying like one simple fine-tuning algorithm can beat these really customized algorithms and And so here's the One of the really cool things that Sebastian did with these ablation studies Right, which is I was really keen that if we're going to publish a paper. We had to say why does it work? right, so Sebastian went through and Tried, you know removing all of those different contributions. I mentioned right so what if we Don't use gradual freezing. What if we don't use discriminative learning rates? What if instead of discriminative learning rates we use cosine annealing? What if we don't do any Pre-training with Wikipedia What if we don't do any fine-tuning and then the really interesting one to me was What's the validation error rate on IMDB if we only use a hundred training examples? Versus 200 versus 500 and you can see you know very interestingly the the full version of This approach is nearly as accurate on just a hundred training examples like it's still very accurate Versus all 20,000 Training examples, whereas if you're training from scratch on a hundred. It's like almost random All right, so kind of like it's what I expected, you know, I've kind of said to Sebastian I really think that this this is most beneficial when you don't have much data And this is like where fast AI is most interested in contributing right? There's like small data regimes small compute regimes and so forth and so he did these studies to check So I want to show you a couple of tricks as to how you can run these kinds of studies The first trick is is something which I know you're all going to find really handy I know you've all been annoyed when you're running something in a Jupyter notebook and you lose your internet connection For long enough that it decides you've gone away and then your session disappears and you have to start it again from scratch Okay, so what do you do? There's a very simple cool thing called VNC Where basically you can install on your AWS instance or paperspace or whatever X windows A lightweight window manager a VNC server Firefox a terminal and some fonts Chuck these lines at the end of your VNC X startup Configuration file and then run this command. It's now running a server where if you now run Well, let me explain. It's now running a server where you can then run the type VNC viewer or any VNC viewer on your computer And you point it at your server, right? But specifically what you do is You go you use SSH port forwarding to port forward port 5913 to localhost 5913 right and so then you Connect to port 5913 on on localhost it will send it off to port 5913 on your server Which is the VNC port because you said colon 13 here and it will display an X windows desktop and then you can click on the Linux start like button and click on Firefox and you now have Firefox and So you can now run And you'll see here in Firefox. It says localhost because this Firefox is running on my AWS server Right, and so you now run Firefox You start your thing running and then you close your VNC viewer Remembering that Firefox is like displaying on this virtual VNC display not in the real display And so then later on that day you log back into VNC viewer and it pops up again So it's like a persistent Desktop and it's shockingly fast. It works really well. Okay, so there's Trick number one and there's lots of different VNC servers and clients and whatever, but this one worked fine for me So there you go So you can see here I connect to localhost 5913 Trick number two is To create Python scripts, right and this is what we ended up doing So I ended up creating like a little Python script for Sebastian to kind of say this is the basic steps You need to do and now you need to create like different versions or anything else And I suggested to him that he tried using this thing called Google fire What Google fire does is you create a function With shit loads of parameters, right? And so these are all the things that Sebastian wanted to try doing different dropout amounts different learning rates Do I use pre-training or not? Do I use CLR or not? Do I use discriminative learning rate or not? Do I go backwards or not blah blah blah, right? So you create a function and then you add something saying if name equals main fire dot fire and the function name You do nothing else at all. You don't have to add any metadata any doc strings anything at all And you then call that script and automatically You now have a command line interface And that's it right so that's a super fantastic easy way to run Lots of different variations in a terminal And this is like this is ends up being easier if you want to do lots of variations than using a notebook Because you can just like have a bash script that tries all of them and spits them all out You'll find inside The deal to course directory. There's now something called IMDB scripts And I've put there all of the scripts that Sebastian and I used So you'll see because we needed to like tokenize every single Dataset we had to turn every data set new airplanes every data set We had to train a language model on every data set We had to train a classified every data set and we had to do all of those things in a variety of different ways to compare them We had a script for all those things so you can check out and see all of the scripts that we used When you're doing a lot of scripts and stuff And they've got different code all over the place eventually it might get frustrating That you want to you know, you don't want to sim link your fast aio library again and again But you probably don't want to pip install it because that version tends to be a little bit old We move so fast you want to use the current version in git If you say pip install minus a dot from the fast AI repo base That does something quite neat which is basically creates a sim link to the fast AI library The git, you know in your git installation right here inside your Site packages directory your site packages directory is like your main You know python library and so if you do this You can then access fast AI From anywhere But every time you do you get pool you've got the most recent version One downside of this is that it installs any updated versions of packages from hip which can kind of confuse Condor a little bit so another alternative here is just to sim link the fast AI library To your site packages library like that works just as well And then you can use fast AI again from anywhere And it's quite handy when you want to kind of run scripts that use fast AI from different directories on your system Okay, so one more thing before we go Which is something you can try if you like you Don't have to tokenize words Instead of tokenizing words you can tokenize what are called sub word units and so for example unsupervised for example unsupervised could be tokenized as un supervised tokenizer could be tokenized as token Iser Right, and then you can do the same thing the language model that works on sub word units a classifier that works on sub word units, etc So how well does that work I? started playing with it and With not too much playing I was getting classification results that were nearly as good as using word level tokenization not quite as good, but nearly as good I suspect With more careful thinking and playing around maybe I could have got as good or better but even if I couldn't if you create a A sub word unit wiki text model then IMDB model language model and then classifier Forwards and backwards for sub word units and then ensemble it with the forwards and backwards word level ones You should be able to beat us Right, so here's an approach. You may be able to beat our state of the art result Google has as Sebastian told me about this particular project is great Google has a project called sentence piece Which actually uses a neural net to figure out the optimal? splitting up of words and so you end up with a vocabulary of sub word units In my playing around I found that creating a vocabulary of about 30,000 sub word units seems to be about optimal So if you're interested There's something you can try it's a bit of a pain to install its C++. It doesn't have great error messages But it will work There is a python live before it and if anybody tries this I'm happy to help them get it working There's been little if any experiments with ensembling Subword and word level stuff classification, and I do think it should be the best approach All right. Thanks everybody. Have a great week and see you next Monday