 I want to start pointing out a couple of the many cool things that happened this week. One thing that I'm really excited about is we briefly talked about how Leslie Smith has a new paper out, and it basically, the paper goes, takes his previous two key papers, are cyclical learning rates and superconvergence, and builds on them with a number of experiments to show how you can achieve superconvergence. And so superconvergence lets you train models five times faster than previous, like, kind of stepwise approaches. It's not five times faster than CLR, but it's faster than CLR as well. And the key is that superconvergence lets you get up to, like, massively high learning rates, like, somewhere between one and three, which is quite amazing. And so the interesting thing about superconvergence is that it, you actually train at those very high learning rates for quite a large percentage of your epochs. And during that time, the loss doesn't really improve very much, but the trick is, like, it's doing a lot of searching through the space to find really generalizable areas, it seems. So there's, we kind of had a lot of what we needed in Fast AI to achieve this, but we were missing a couple of bits. And so Silvia Guga has done an amazing job of fleshing out the pieces that we're missing and then confirming that he has actually achieved superconvergence on training on SciFi 10. So I think this is the first time that this has been done that I've heard of outside of Leslie Smith himself. And so he's got a great blog post up now on OneCycle, which is what Leslie Smith called this approach. And this is actually, it turns out, what OneCycle looks like. It's a single cyclical learning rate, but the key difference here is that the going up bit is the same length as the going down bit, right, so you go up, like, really slowly. And then at the end, for, like, a tenth of the time, you then have this little bit where you go down even further. And it's interesting, obviously, this is a very easy thing to show, a very easy thing to explain. Silvia has added it to Fast AI under the, temporarily, it's called UseCLR Beta. By the time you watch this on the video, it'll probably be called OneCycle or something like that. But you can use this right now. So that's one key piece to getting these massively high learning rates. And he shows a number of experiments when you do that. A second key piece is that as you do this to the learning rate, you do this to the momentum, right? So when the learning rate's low, it's fine to have a high momentum. But when the learning rate gets up really high, your momentum needs to be quite a bit lower. So this is also part of what he's added to the library is this cyclical momentum. And so with these two things, you can train for about the fifth of the number of epochs with a step-wise learning rate schedule. Then you can drop your weight decay down by about two orders of magnitude. You can often remove most or all of your dropout. And so you end up with something that's trained faster and generalizes better. It actually turns out that Sylvain got quite a bit better accuracy than Leslie Smith's paper. His guess I was pleased to see is because our data augmentation defaults are better than Leslie's. I hope that's true. So check that out. Another cool thing I just, as I say, there's been so many cool things this week, I'm just going to pick two. Hamel Hussain, who works at GitHub, I just really like this. There's a fairly new project called a Kubeflow, which is basically TensorFlow for Kubernetes. Hamel wrote a very nice article about magical sequence-to-sequence models, building data products on that and using Kubernetes to kind of put that in production and so forth. He said that the Google Kubeflow team created a demo based on what he wrote earlier this year, directly based on the skills alone in class AI. And I will be presenting this technique at KDD. KDD is one of the top academic conferences. So I wanted to share this as a motivation for folks to blog, which I think is a great point. Nobody who goes out and writes a blog thinks that probably none of us really think our blog's actually going to be very good. Probably nobody's going to read it. And then when people actually do like it and read it, it's like with great surprise, you just go, oh, that's actually something people were interested to read. So here is the tool where you can summarize GitHub issues using this tool, which is now hosted by Google on the kubeflow.org domain. So I think that's a great story of getting, if Hamel didn't put his work out there, none of this would have happened. And yeah, you can check out his post that made it all happen as well. So talking of the magic of sequence to sequence models, let's build one. So we're going to be specifically working on machine translation. So machine translation is something that's been around for a long time. But specifically, we're going to look at an approach called neural translation, which is using neural networks for translation. And that wasn't really a thing in any kind of meaningful way until a couple of years ago. And so thanks to Chris Manning from Stanford for the next three slides. 2015, Chris pointed out that neural machine translation kind of first appeared properly. And it was pretty crappy compared to the statistical machine translation approaches. It is kind of classic, like feature engineering and standard MLP kind of approaches with lots of stemming and fiddling around with word frequencies and n-grams and lots of stuff. By a year later, it was better than everything else. This is on a metric called blue. We're not going to discuss the metric because it's not a very good metric. And it's not very interesting. But it's what everybody uses. So that was blue as of when Chris did this slide. As of now, it's about up here. It's about 30. So we're kind of seeing machine translation starting down the path that we saw starting computer vision object classification in 2012, I guess. Which is we kind of just surpassed the state of the art and now we're zipping past it at a great rate. It's very unlikely that anybody watching this is actually going to build a machine translation model because you can go to translate.google.com and use theirs and it works quite well. So why are we learning about machine translation? Well, the reason we're learning about machine translation is that the general idea of taking some kind of input like a sentence in French and transforming it into some other kind of output of arbitrary length, such as a sentence in English, is a really useful thing to do. For example, the thing that we just saw that Hamel GitHub did takes GitHub issues and turns them into summaries. Other examples is taking videos and turning them into descriptions or taking a, well, I don't know. I mean, basically anything where you're spitting out kind of an arbitrary-sized output, very often that's a sentence, so maybe taking a CT scan and spitting out a radiology report. This is where you can use sequence-to-sequence learning. So the important thing about neural machine translation, these are more slides from Chris, and generally sequence-to-sequence models, is that there's no fussing around with heuristics and hacky feature engineering or whatever. It's end-to-end training. We're able to build these distributed representations which are shared by lots of concepts within a single network. We're able to use long-term state in the RNN, so use a lot more context than n-gram type approaches. And in the end, the text we're generating uses an RNN as well, so we can build something that's more fluid. We're going to use a bidirectional LSTM with attention. Well, actually, we're going to use a bidirectional GRU with attention, but basically the same thing. So you already know about bidirectional recurrent neural networks and attention we're going to add on top today. These general ideas you can use for lots of other things as well as Chris points out on this slide. So let's jump into the code, which is in the translate notebook, funnily enough. And so we're going to try to translate French into English. And so the basic idea is that we're going to try and make this look as much like a standard neural network approach as possible. So we're going to need three things. You all remember the three things. Data, a suitable architecture, and a suitable loss function. Once you've got these three things, you run fit. And all things going well, you end up with something that solves your problem. So data, we generally need x, y pairs. Because we need something which we can feed it into the loss function. And so I took my x value, which was my French sentence. And the loss function says it was meant to generate this English sentence. And then you had your predictions, which you would then compare and see how good it is. So therefore, we need lots of these tuples of French sentences with their equivalent English sentence. That's called a parallel corpus. Obviously, this is harder to find than a corpus for a language model. Because for a language model, we just need text in some language, which you can basically all, for any living language of which the people that use that language, like use computers, there will be a few gigabytes, at least, of text floating around the internet for you to grab. So building a language model is only challenging corpus-wise for ancient languages. One of our students is trying to do a Sanskrit one, for example, at the moment. But that's very rarely a problem. For translation, there are actually some pretty good parallel corpuses available for European languages. The European Parliament basically has every sentence in every European language. Anything that goes through the UN is translated to lots of languages. For French or English, we have a particularly nice thing, which is pretty much any semi-official Canadian website. We'll have a French version and an English version. And so this chap, Chris Callison-Burch, did a cool thing, which is basically to try to transform French URLs into English URLs by replacing dash FR with dash EN and hoping that that retrieves the equivalent document and then did that for lots and lots of websites and ended up creating a huge corpus based on millions of web pages. So French to English, we have this particularly nice resource. So we're going to start out by talking about how to create the data. Then we'll look at the architecture, and then we'll look at the loss function. And so for bounding boxes, all of the interesting stuff was in the loss function. But for neural translation, all of the interesting stuff is going to be in the architecture. So let's zip through this pretty quickly. And one of the things I want you to think about particularly is what are the relationships or similarities in terms of the task we're doing and how we do it between language modeling versus neural translation? So the basic approach here is that we're going to take a sentence. So in this case, the example is English to German. And this slides from Stephen Merity. We steal everything we can from Stephen. We start with some sentence in English. And the first step is to do basically the exact same thing we do in a language model, which is to chuck it through an RNN. Now, with our language model, actually, let's not even think about language model. Let's start even easier. The classification model. So something that turns this sentence into positive or negative sentiment. We had a decoder, something which basically took the RNN output. And from our paper, we grabbed three things. We took a max pull over all of the time steps. We took a mean pull over all the time steps. And we took the value of the RNN at the last time step, stuck all those together, and put it through a linear layer. Most people don't do that in most NLP stuff. I think it's something we invented. People pretty much always use the last time step. So all the stuff we'll be talking about today uses the last time step. So we start out by chucking this sentence through an RNN and out of it comes some state. So some state meaning some hidden state, some vector that represents the output of an RNN that has encoded that sentence. You'll see the word that Stephen used here was encoder. We've tended to use the word backbone. So like when we've talked about adding a custom head to an existing model, like the existing pre-trained image net model, for example, we kind of say that's our backbone. And then we stick on top of it some head that does the task we want. In sequence to sequence learning, they use the word encoder. But basically, it's the same thing. It's some piece of a neural network architecture that takes the input and turns it into some representation, which we can then stick a few more layers on top of to grab something out of it, such as we did for the classifier, where we stuck a linear layer on top of it to turn it into a sentiment, positive or negative. So this time, though, we have something that's a little bit harder than just getting sentiment, which is I want to turn this state not into a positive or negative sentiment, but into a sequence of tokens, where that sequence of tokens is the German, in this case, the German sentence that we want. So this is sounding more like the language model than the classifier, because the language model had multiple tokens. For every input word, there was an output word. But the language model was also much easier, because the number of tokens in the language model output was the same length as the number of tokens in the language model input. And not only were they the same length, they exactly matched up. It's like, after word one comes word two. After word two comes word three, and so forth. But for translating language, you don't necessarily know that the word he will be translated as the first word in the output, and that loved will be the second word in the output. In this particular case, unfortunately, they are the same. But very often, the subject-object order will be different, or there will be some extra words inserted, or some pronouns will need to add some gendered article to it, or whatever. So this is the key issue we're going to have to deal with. It's the fact that we have an arbitrary length output where the tokens in the output do not correspond to the same order, specific tokens in the input. But the general idea is the same. Use an RNN to encode the input, turns it into some hidden state, and then this is the new thing we're going to learn is generating a sequence output. So we already know sequence to class. That's IMDB classifier. We already know sequence to equal length sequence where it corresponds to the same items. That's the language model, for example. But we don't know yet how to do a general purpose sequence to sequence, so that's the new thing today. Very little of this will make sense unless you really understand lesson six, how an RNN works. So if some of this lesson doesn't make sense to you and you find yourself wondering, what does he mean by hidden state exactly? How is that working? Go back and re-watch lesson six. To give you a very quick review, we learned that an RNN at its heart is a standard fully connected network. So here's one with one, two, three, four layers. Takes an input and puts it through four layers. But then at the second layer, it can just concatenate in a second input, third layer concatenating in a third input. But we actually wrote this in Python as just literally a four-layer neural network. There was nothing else we used other than linear layers and values. We used the same weight matrix every time an input came in. We used the same matrix every time we went from one of these states to the next. And that's why these arrows are the same color. And so we can redraw that previous thing like this. And so not only did we redraw it, but we took the four lines of linear, linear, linear, linear code in PyTorch. And we replaced it with a for loop. So remember, we had something that did exactly the same thing as this, but it just had four lines of code saying linear, linear, linear, linear. And we literally replaced it with a for loop because that's nice to refactor. So like literally that refactoring, which doesn't change any of the math, any of the ideas, any of the outputs, that refactoring is an RNF. It's turning a bunch of separate lines in the code into a Python portal. And so that's how we can draw it. We could take the output so that it's not outside the loop and put it inside the loop like so. And if we do that, we're now going to generate a separate output for every input. So in this case, this particular one here, the hidden state gets replaced each time. And we end up just spitting out the final hidden state. So this one is this example. But if instead, we had something that said h's dot append h and returned h's at the end, that would be this picture. And so go back and re-look at that notebook if this is unclear. I think the main thing to remember is when we say hidden state, we're referring to a vector. See here? Here's the vector. h equals torch dot zeros and hidden. Now, of course, it's a vector for each thing in the mini batch, so it's a matrix. But generally, when I speak about these things, I ignore the mini batch piece and treat it just a single item. So it's just a vector of this length. We also learned that you can stack these layers on top of each other. So rather than this first RNN spitting out output, there could just spit out inputs into a second RNN. And if you're thinking at this point, I think I understand this, but I'm not quite sure. If you're anything like that me, that means you don't understand this. And the only way you know and that you actually understand it is to go and write this from scratch in pi torch or num pi. And if you can't do that, then you know, OK, you don't understand it. And you can go back and re-watch lesson six and check out the notebook and copy some of the ideas until it's really important that you can write that from scratch. It's less than a screen of code. So you want to make sure you create a two-layer. And this is what it looks like if you unroll it. So that's the goal is to get to a point that we first of all have these x, y pairs of sentences. And we're going to do French to English. So we're going to start by downloading this data set. And training a translation model takes a long time. Google's translation model has eight layers of RNN stacked on top of each other. There's no conceptual difference between eight layers and two layers. It's just like if you're Google and you have more GPUs or GPUs and you know what to do with, then you're fine doing that. Whereas in our case, it's pretty likely that the kind of sequence-to-sequence models we're building are not going to require that level of computation. So to keep things simple, let's do a cut-down thing where rather than learning how to translate a French into English for any sentence, let's learn to translate French questions into English questions. And specifically questions that start with what, where, which, when. So you can see here I've got a red check that looks for things that start with WH and end with a question mark. So I just go through the corpus, open up each of the two files. Each line is one parallel text. Zip them together, grab the English question, the French question, and check whether they match the regular expressions. Dump that out of the pickle so that I don't have to do it again. And so we now have 52,000 sentences. And here are some examples of those, well, sentence pairs. And here are some examples of those sentence pairs. One nice thing about this is that what, who, where type questions tend to be fairly short, which is nice. But I would say the idea that we could learn from scratch with no previous understanding of the idea of language, let alone of English or of French, that we could create something that can translate one to the other for any arbitrary question with only 50,000 sentences, sounds like a ludicrously difficult thing to ask this to do. So I will be impressed if we can make any progress whatsoever. This is very little data to do a very complex exercise. All right. So this contains the tuples of French and English. You can use this handy idiom to split them apart into a list of English questions and a list of French questions. And then we tokenize the English questions and we tokenize the French questions. So remember that just means splitting them up into separate words or word-like things. By default, the tokenizer that we have here, and remember this is a wrapper around the spacey tokenizer, which is a fantastic tokenizer. This wrapper, by default, assumes English. So to ask for French, you just add an extra parameter. The first time you do this, you'll get an error saying that you don't have the spacey French model installed, and you can Google to get the command something Python minus M spacey download French or something like that to grab the French model. OK. I don't think any of you are going to have RAM problems here, because this is not particularly big corpus. But I know that some of you were trying to train new language models during the week, and were having RAM problems. If you do, it's worth knowing what these functions are actually doing. So for example, these ones here is processing every sentence across multiple processes. That's what the NP means. And remember, fast AI code is designed to be pretty easy to read. So three or four lines of code. So here's the three lines of code to process all NP. Find out how many CPUs you have. Divide by two, because normally with hyperthreading, they don't actually all work in parallel. Then in parallel, run this process function. So that's going to spit out a whole separate Python process for every CPU you have. If you have a lot of cores, that's a lot of Python processes. Everyone's going to load all this data in, and that can potentially use up all your RAM. So you could replace that with just prop all rather than prop all NP to use less RAM. Or you could just use less cores. So at the moment, we're calling this function partition by cores, which calls partition on a list and asks to split it into a number of equal length things according to how many CPUs you have. So you could replace that by not splitting it into a smaller list and run it on less things. Yes, right, sure. Was an intention layer tried in the language model? Do you think it would be a good idea to try and add one? We haven't learned about attention yet. So let's ask about things that we have got to, not things we haven't. The short answer is no, I haven't tried it properly. Yes, you should try it because it might help. OK, and in general, there's going to be a lot of things that we can't put a day, which if you've done some sequence-to-sequence stuff before, you'll want to know about something we haven't covered yet. I'm going to cover all the sequence-to-sequence things. OK, so at the end of this, if I haven't covered the thing you wanted to know about, please ask me then. If you ask me before, I'll be answering something based on something that I'm about to teach you. OK, so having tokenized the English and French, you can see how it gets split out. And you can see that tokenization for French is quite different-looking because French loves their apostrophes and their hyphens and stuff. So if you try to use an English tokenizer for a French sentence, you're going to get a pretty crappy outcome. So like I don't find you need to know heaps of NLP ideas to use deep learning for NLP, but just some basic stuff like use the right tokenizer for your language is important. And so some of the students this week in our study group have been trying to work, build language models for Chinese instance, which of course doesn't really have the concept of a tokenizer in the same way. So we've been starting to look at, briefly mentioned last week, this Google thing called sentence piece, which basically splits things into arbitrary subword units. And so when I say tokenize, if you're using a language that doesn't have spaces in, you should probably be checking out sentence piece for some other similar subword unit thing instead. And hopefully in the next week or two, we'll be able to report back with some early results of these experiments with Chinese. So having tokenized it, we'll save that to disk. And then remember the next step after we create tokens is to turn them into numbers. And to turn them into numbers, we have two steps. The first is to get a list of all of the words that appear. And then we turn every word into the index into that list. If there are more than 40,000 words that appear, then let's cut it off there so it doesn't get too crazy. And we insert a few extra tokens for beginning of stream, padding, end of stream, and unknown. So if we try to look up something that wasn't in the 40,000 most common, then we use a default dict to return three, which is unknown. So now we can go ahead and turn every token into an ID by putting it through the string to integer dictionary we just created. And then at the end of that, let's add the number two, which is end of stream. And you'll see the code you see here is the code I write when I'm iterating and experimenting. Because like 99% of the code I write when I'm iterating and experimenting turns out to be totally wrong or stupid or embarrassing and you don't get to see it. But there's no point refactoring that and making it beautiful when I'm writing it. So I was wanting you to see all the little shortcuts I have. So rather than doing this properly and actually having some constant or something for end of stream marker and using it, when I'm prototyping, I just do the easy stuff. Not so much that I end up with broken code, but I try to find some mid-ground between beautiful code and code that works. Just heard him mention that we divide number of CPUs by 2 because with hyperthreading, we don't get to speed up using all the hyperthreaded cores. Is this based on practical experience or is there some underlying reason why we wouldn't get additional speed up? Yeah, it's just practical experience. And it's like not all things kind of seem like this, but I definitely noticed with tokenization, hyperthreading seemed to slow things down a little bit. Also, if I use all the cores, like often I want to do something else at the same time. I generally run some interactive notebook and I don't have any spare room to do that. It's a minor issue. So now for English and our French, we can grab our list of IDs. And when we do that, of course, we need to make sure that we also store the vocabulary. There's no point having IDs if we don't know what the number five represents. There's no point having a number five. So that's our vocabulary, the string, and the reverse mapping string to int that we can use to convert more cores to some future. OK, so just to confirm it's working, we can go through HID, convert the int to a string, and spit that out. And there we have our thing back. We're now with an industry marker at the end. Our English vocab is 17,000. Our French vocab is 25,000. So there's not too big, there's not too complex a vocab that we're dealing with, which is nice to know. OK, so we spent a lot of time on the forums during the week discussing how pointless word vectors are and how you should stop getting so excited about them. We're now going to use them. Why is that? Basically, all the stuff we've been learning about using language models and pre-trained proper models rather than pre-trained linear single layers, which is what word vectors are, I think applies equally well to sequence to sequence. But I haven't tried it yet. I haven't built it yet. So Sebastian and I are starting to look at that. I'm slightly distracted by preparing this class at the moment, but after this class is done. So there's a whole thing. For anybody interested in creating some genuinely new, highly publishable results, the entire area of sequence to sequence with pre-trained language models hasn't been touched yet. And I strongly believe it's going to be just as good as a classification stuff. And if you work on this and you get to the point where you have something that's looking exciting and you want help publishing it, I'm very happy to help co-author papers on stuff that's looking good. So feel free to reach out if and when you have some interesting results. So at this stage, we don't have any of that. So we're going to use very little fast AI actually and very little in terms of fast AI ideas. So all we've got is word vectors. Anyway, so let's at least use decent word vectors. So word to vec is very old word vectors. There are better word vectors now. And fast text is a pretty good source of word vectors. There's hundreds of languages available for them. Your language is likely to be represented. So to grab them, you can click on this link, download word vectors for a language that you're interested in. Install the fast text Python library. It's not available on PyPy. But here's a handy trick. If there is a GitHub repo that has a setup.py in it and a requirements.text in it, you can just chuck git plus at the start and then stick that in your pip install and it works. Hardly anybody seems to know this. If you go to the fast text repo, they won't tell you this. So you have to download it and CD into it and blah, blah, blah. But you don't. You can just run that. Which you can also use for the fast AI library, by the way. If you want to pip install the latest version of fast AI, you can totally do this. So you grab the library, import it, load the model. So here's my English model and here's my French model. You'll see there's a text version and a binary version. The binary version is a bit faster. We're going to use that. The text version is also a bit buggy. And then I'm going to convert it into a standard Python dictionary to make it a bit easier to work with. So this is just going to grow through each word with a dictionary comprehension and save it as a pick hold dictionary. So now we've got our pick hold dictionary. We can go ahead and look up a word, for example, comma. And that will return a vector. The length of that vector is the dimensionality of this set of word vectors. So in this case, we've got 300 dimensional English and French word vectors. OK. For reasons that you'll see in a moment, I also want to find out what the mean of my vectors are and the standard deviation of my vectors are. So the mean is about 0 and the standard deviation is about 0.3. So we'll remember that. Often corpuses have a pretty long tail distribution of sequence length. And it's the longest sequences that kind of tend to overwhelm how long things take and how much memory is used and stuff like that. So I'm going to grab, in this case, the 99th to 97th percentile of the English and French and truncate them to that amount. Originally, I was using the 90th percentile. So these are poorly named variables. So apologies for that. So that's just truncating them. So we're nearly there. We've got our tokenized, numericalized English and French data set. We've got some word vectors. So now we need to get it ready for PyTorch. So PyTorch expects a data set object. And hopefully by now, you all can tell me that a data set object requires two things, a length and an indexer. So I started out writing this. And I was like, OK, I need a sec to sec data set. And I started out writing it. And I thought, OK, we're going to have to pass it our x's and our y's and store them away. And then my indexer is going to need to return a numpy array of the x's at that point and a numpy array of the y's at that point. And oh, that's it. So then after I wrote this, I realized I haven't really written a sec to sec data set. I've just written a totally generic data set. So here's like the simplest possible data set that works for any pair of arrays. So it's now poorly named. It's much more general than a sec to sec data set. But that's what I needed it for. This a function, remember, we've got v for variables, t for tensors, a for arrays. So this basically goes through each of the things you pass it. If it's not already a numpy array, it converts it into a numpy array and returns back a tuple of all of the things that you passed it, which are now guaranteed to be numpy arrays. So that's A, V, T, three very handy little functions. Okay, so that's it. That's our data set. So now we need to grab our English and French IDs and get a training set and a validation set. And so one of the things which is pretty disappointing about a lot of code out there on the internet is that they don't follow some simple best practices. For example, if you go to the PyTorch website, they have an example section for a sequence to sequence translation. Their example does not have a separate validation set. I tried it, training according to their settings and I tested it with a validation set and it turned out that it overfit massively. So this is not just a theoretical problem. The actual PyTorch repo has their actual official sequence to sequence translation example which does not check for overfitting and overfits horribly. Also, it fails to use mini batches so it actually fails to utilize any of the efficiency of PyTorch whatsoever. So there's a lot of like, even if you find code in the official PyTorch repo, don't assume it's any good at all, right? The other thing you'll notice is that everybody, when they pretty much every other sequence to sequence model I found in PyTorch anywhere on the internet has clearly copied from that shitty PyTorch repo because it all has the same variable names, it has the same problems, it has the same mistakes. Like another example, nearly every PyTorch convolutional neural network I found does not use an adaptive pooling layer. So in other words, the final layer is always like average pool seven comma seven, right? So they assume that the previous layer is seven by seven and if you use any other size input, you get an exception. And therefore, nearly everybody I've spoken to that uses PyTorch thinks that there is a fundamental limitation of CNNs that they are tied to the input size. And that has not been true since VGG, right? So every time we grab a new model and stick it in the fast AI repo, I have to go in, search for pool and add adaptive to the start and replace the seven with a one and now it works on any sized object, right? So just be careful, you know, it's still early days and believe it or not, even though most of you have only started in the last year, your deep learning journey, you know quite a lot more about a lot of the more important practical aspects than the vast majority of people that are like publishing and writing stuff in official repos and stuff. So you kind of need to have a little more self-confidence than you might expect when it comes to reading other people's code. If you find yourself thinking that looks odd, it's not necessarily you, right, it might well be them. Okay, so yeah, I would say like at least 90% of deep learning code that I start looking at turns out to have like, you know, like deathly serious problems that make it completely unusable for anything. And so I kind of been telling people that I've been working with recently, you know, if the repo you're looking at doesn't have a section on it saying, here's the test we did where we got the same results as the paper that's meant to be implementing, that almost certainly means they haven't got the same results of the paper they're implementing, they probably haven't even checked, okay. And if you run it, it definitely won't get those results because it's hard to get things right the first time. It takes me 12 goes, you know, probably takes the normal smarter people than me six goes, but if they haven't tested it once, it's almost certainly won't work. Okay, so there's our sequence data set. Let's get the training and validation sets. Here's an easy way to do that. Grab a bunch of random numbers, one for each row of your data, see if they're bigger than point one or not. That gets your list of balls. Index into your array with that list of balls to grab a training set. Index into that array with the opposite of that list of balls to get your validation set. So it's a nice, easy way. There's lots of ways of doing it. I just like to do different ways so you can see a few approaches. Okay, so now we can create our data set with our X's and our Y's, French and English. If you wanna translate instead English to French, switch these two around and you're done. Okay, now we need to create data loaders. We can just grab our data loader and pass in our data set and batch size. We actually have to transpose the arrays. I'm not gonna go into the details about why. You can talk about it during the week if you're interested, but have a think about why we might need to transpose their orientation. But there's a few more things I wanna do. One is that since we've already done all the pre-processing, there's no point spawning off multiple workers to do augmentation or whatever because there's no work to do. So making numb workers equals one will save you some time. We have to tell it what our padding index is. That's actually pretty important because what's gonna happen is that we've got different length sentences and fast AI, I think it's like pretty much the only library that does this, fast AI will just automatically stick them together and pad the shorter ones to be so that they're all in that equal length. Because remember, a tensor has to be rectangular. In the decoder in particular, I actually want my padding to be at the end, not at the start. For a classifier, I want the padding at the start because I want that final token to represent the last word of the movie review. But in the decoder, as you'll see, it actually is gonna work out a bit better to have the padding at the end. So I say pre-paddicles plus. And then finally, since we've got sentences of different lengths coming in and they all have to be put together in a mini-batch to be the same size by padding, we would much prefer that the sentence in a mini-batch are of similar sizes already because otherwise it's gonna be as long as the longest sentence and that's gonna end up wasting time and memory. So therefore I'm gonna use the sampler trick that we learned last time, which is the validation set. We're gonna ask it to sort everything by length first. And then for the training set, we're gonna ask it to randomize the order of things but to roughly make it so that things of similar length are about in the same spot. So we've got our sort sample and our sort is sampler. Okay, and then at that point, we can create a model data object. Remember a model data object really does one thing, which is it says I have a training set and a validation set and an optional test set and sticks them into a single object. We also have a path so that it has somewhere to store temporary files, models, stuff like that. So we're not using FastAI for very much at all in this example, just to kind of a minimal set to show you how to get your model data object. Because in the end, once you've got a model data object, you can then create a learner and you can then call fit. So that's kind of like minimal amount of FastAI stuff here. This is a standard PyTorch compatible data set. This is a standard PyTorch compatible data loader. Behind the scenes it's actually using the FastAI version because I do need it to do this automatic padding for convenience. So there's a few tweaks in our version that are a bit faster and a bit more convenient. The FastAI samplers we're using, but there's not too much going on here. So now we've got our model data object. We can basically tick off number one. So as I said, most of the work is in the architecture. And so the architecture is going to take our sequence of tokens, it's going to spit them into a encoder or when computer vision turns what we've been calling a backbone, something that's gonna try and turn this into some kind of representation. So that's just gonna be an RNN. That's gonna spit out the final hidden state, which for each sentence it's just a vector. Remember, it's just a single vector. And so that's all gonna take, that's none of this is gonna be you. That's all gonna be using very direct simple techniques that we've already learned. And then we're gonna take that and we're gonna spit it into a different RNN which is a decoder. And that's gonna have some new stuff because we need something that can go through one word at a time. Okay. And it's gonna keep going until it thinks it's finished the sentence. It doesn't know how long the sentence is gonna be ahead of time, keeps going until it thinks it's finished the sentence and then it stops and returns the sentence. Okay. So let's start with the encoder. So in terms of variable naming here, there's basically identical variables for encoder or attributes for encoder and decoder. The encoder versions have enc, the decoder versions have dec. Okay. So for the encoder, here's our embeddings. Okay. And so like, I always try to mention like what the mnemonics are, rather than writing things out in too long hand. So just remember, enc is an encoder, dec is a decoder, and there's an embedding. The final thing that comes out is out. The RNN in this case is a GRU, not an LSTM. They're nearly the same thing. So don't worry about the difference, you could replace it with an LSTM and you'll get basically the same results to replace it with an LSTM, simply type LSTM and you're done. Okay. So we need to create an embedding layer to take, because remember what we're being passed is the index of the words into a vocabulary and we want to grab their fast text embedding. And then over time, we might want to also fine tune to train that embedding end to end. So to create an embedding, we'll call create embedding up here. So we just say nn.embedding. So it's important that you know now how to set the rows and columns for your embedding. So the number of rows has to be equal to your vocabulary size. So each vocabulary item has a word vector. And the, how big is your embedding? Well, in this case it was determined by fast text. And the fast text embeddings are size 300. So we have to use size 300 as well, otherwise we can't start out using their embeddings. Okay, now, so what we want to do is this is initially going to give us a random set of embeddings. And so we're going to now go through each one of these. And if we find it in fast text, we'll replace it with a fast text embedding. So again, something that you should already know is that a PyTorch module that is learnable has a weight attribute and the weight attribute is a variable and that variables have a data attribute and the data attribute is a tensor. Now, you'll notice very often today, I'm saying here is something you should know, not so that you think, oh, I don't know that I'm a bad person, right? But so that you think, okay, this is a concept that, you know, I haven't learned yet and Jeremy thinks I ought to know about. And so I've got to write that down and I'm going to go home and I'm going to like Google because like this is a normal PyTorch attribute in every single learnable PyTorch module. This is a normal PyTorch attribute in every single PyTorch variable. And so if you don't know how to grab the weights out of a module or you don't know how to grab the tensor out of a variable, it's going to be hard for you to build new things or to debug things or maintain things or whatever. So if I say you ought to know this and you're thinking I don't know this, don't run away and hide, go home and learn the thing. And if you're having trouble learning the thing because you can't find documentation about it or you don't understand that documentation or you don't know why Jeremy thought it was important, you know it, jump on the forum and say like, please explain this thing, here's my best understanding of that thing as I have it at the moment, here's the resources I've looked at, help fill me in, okay? And normally if I respond, it's very likely I will not tell you the answer but I will instead give you something, a problem that you could solve that if you solve it will solve it for you because I know that that way it'll be something you remember, okay? So again, don't be put off if I'm like, okay you like go read this link, try and summarize that thing, tell us what you think. Like I'm trying to be helpful, not unhelpful and if you're still not following, just come back and say like I had to look, honestly that link you sent, I don't know what it means, I wouldn't know where to start, whatever, like I'll keep trying to help you until you fully understand it. Okay, so now that we've got our weight tensor, we can just go through our vocabulary and we can look up the word in our pre-trained vectors and if we find it, we will replace the random weights with that pre-trained vector. The random weights have a standard deviation of one, our pre-trained vectors, it turned out had a standard deviation of about 0.3. So again, this is the kind of hacky thing I do when I'm prototyping stuff, I just multiply it by three. Okay, obviously by the time you see the video of this and the way we'll have put all this sequence to sequence stuff into the fast AI library, you won't find horrible hacks like that and there, sure hope, but hack away when you're prototyping. Some things won't be in fast text in which case we'll just keep track of it and I've just added this print statement here just so that I can kind of see what's going, like why am I missing stuff, basically, I'll probably comment it out when I actually commit this to GitHub. That's why that's there, okay. So we create those embeddings and so when we actually create the sequence to sequence RNN, it'll print out how many were missed and so remember we had like about 30,000 words so we're not missing too many and interesting the things that are missing, well, there's a special token for uppercase, not surprising that's missing. Also, remember, it's not token to VEC, it's not token text, it does words so L apostrophe and D apostrophe and apostrophe S, they're not appearing either. So that's interesting, that does suggest that maybe we could have slightly better embeddings if we try to find some which would be tokenized the same way we tokenize, but that's okay, Rachel. Do we just keep embedding vectors from training? Why don't we keep all word embeddings in case you have new words in the test set? Oh, I mean, we're gonna be fine tuning them, you know? And so, yeah, I don't know. I mean, it's an interesting idea, maybe that would work. I haven't tried it. I mean, obviously you wouldn't. Sorry, can you use that? Okay. I asked the question. So you can also add random embedding to those and at the beginning just keep them random but it's gonna make an effect in the sense that you're gonna be using those words. Yeah, I think it's an interesting line of inquiry but I will say this, the vast majority of the time when you're kind of doing this in the real world, your vocabulary will be bigger than 40,000 and once your vocabulary is bigger than 40,000 using the standard techniques, you, the embedding layer gets so big that it takes up all your memory, it takes up all of the time in the backdrop. There are tricks to dealing with very large vocabularies. I don't think we'll have time to handle them in this session but you definitely would not want to have all three and a half million fast text vectors in an embedding layer. So I wonder, so if you're not touching a word, it's not gonna change, right? Like even if you are fine tuning, you are not, you have it in the memory. It's in GPU RAM and you gotta remember three and a half million times 300 times the size of a single precision floating point vector plus all of the gradients for them even if it's not touched. Like, without being very careful and adding a lot more code and stuff, it is slow and hard and we wouldn't touch it for now. But as I say, I think it's an interesting path of inquiry but it's the kind of path of inquiry that leads to multiple academic papers not something that you do on a weekend. But I think it'd be very interesting for, yeah, maybe we can look at it sometime, you know. And as I say, I have actually started doing some stuff around incorporating large vocabulary handling into fast AI. It's not finished but hopefully by the time we get here, this kind of stuff would be possible. Okay, so we create our encoder embedding add a bit of dropout, okay. And then we create our RNN. This input to the RNN obviously is the size of the embedding by definition. Number of hidden is whatever we want. So we set it to 256 for now. However many layers we want and some dropout inside the RNN as well. Okay, so this is all standard PyTorch stuff. You could use a LSTM here as well. And then finally we need to turn that into some output that we're gonna feed to the decoder. So let's use a linear layer to convert the number of hidden into the decoder embedding size. Okay, so in the forward pass, here's how that's used. We first of all initialize our hidden state to a bunch of zeros. Okay, so we've now got a vector of zeros, which we, and then we're gonna take our input and put it through our embedding. We're gonna put that through dropout. We then pass our currently zeros hidden state and our embeddings into our RNN. And it's gonna spit out the usual stuff that RNN spit out, which includes the final hidden state. We're then gonna take that final hidden state and stick it through that linear layer. So we now have something of the right size to feed to our decoder. Okay, so that's it. And again, this is like ought to be very familiar and very comfortable. It's like the most simple possible RNN. So if it's not, go back, check out lesson six, make sure you can write it from scratch and that you understand what it does. But the key thing to know is that it takes our inputs and spits out a hidden vector that hopefully we'll learn to contain all of the information about what that sentence says and how it says it. Because if it can't do that, right, if it can't do that, then we can't feed it into a decoder and hope it to spit out our sentence in a different language. So that's what we want it to learn to do. And we're not gonna do anything special to make it learn to do that. We're just gonna do, you know, the three things and cross our fingers because that's what we do. All right, so that's H is that S, right? It's the hidden state. I guess Stephen used S for state. I used H for hidden, but there you go. You'd think the two Australians could agree on something like that, but apparently not. So how do we now do the new bit? And so the basic idea of the new bit is the same. We're gonna do exactly the same thing, but we're gonna write our own for loop, okay? And so the for loop is gonna do exactly what the for loop inside PyTorch does here, but we're gonna do it manually, right? So we're gonna go through the for loop. And how big is the for loop? It's of output sequence length. Well, what is output sequence length? It's something that got passed to the constructor and it is equal to the length of the largest English sentence, okay? So we're gonna do this for loop as long as the largest English sentence because we're translating into English, right? So we can't possibly be longer than that, right? At least not in this corpus. If we then used it on some different corpus that was longer, this is gonna fail. So, but you could make this, you know, you could always pass in a different parameter, of course. All right, so the basic idea is the same. We're gonna go through and put it through the embedding. We're gonna stick it through the RNN. We're gonna stick it through dropout and we're gonna stick it through a linear layer, right? So the basic four steps are the same. And once we've done that, we're then gonna append that output to a list, right? And then when we're gonna finish, we're gonna stack that list up into a single tensor and return it, okay? That's the basic idea. Normally, a recurrent neural network, here is our decoder recurrent neural network, recurrent neural network, works in a whole sequence at a time, but we've got a for loop to go through each part of the sequence separately. So we have to add a leading unit access to the start to basically say, this is a sequence of length one, okay? So we're not really taking advantage of the recurrent net much at all. We could easily rewrite this with a linear layer, actually. That'd be an interesting experiment if you wanted to try it. So we basically take our input and we feed it into our embedding and we add something to the front saying, treat this as a sequence of length one, and then we pass that to our RNN. We then get the output of that RNN, feed it into our dropout and feed it into our linear layer. So there's two extra things to be aware of. Well, I guess it's really one thing. The one thing is, what's this? What is the input to that embedding, okay? And the answer is, it's the previous word that we translated. See how the input here is the previous word here. The input here is the previous word here. So the basic idea is, if you're trying to translate, if you're about to translate, you know, tell me the fourth word of the new sentence, but you don't know what the third word you just said was, that's gonna be really hard. So we're gonna feed that in at each time step. Let's make it as easy as possible. And so what was the previous word at the start? Well, there was none. So specifically, we're gonna start out with a beginning of stream token. So the beginning of stream token is a zero. So, let's start out our decoder with a beginning of stream token, which is zero. Okay? And of course, we're doing a mini batch. So we need batch size number of them. But let's just think about one part of that batch. So we start out with a zero. We look up that zero in our embedding matrix to find out what the vector for the beginning of stream token is. We stick a unit axis on the front to say we have a single sequence length of beginning of stream token. We stick that through our RNN. Which gets not only the fact that there's a zero, which is beginning of stream, but also the hidden state, which at this point is whatever came out of our encoder. Okay? So this now, its job is to try and figure out what is the first word. Okay? What's the first word to translate the sentence? Pop through some drop out, go through one linear layer in order to convert that into the correct size for our decoder embedding matrix. Right? Append that to our list of translated words. And now we need to figure out what word that was because we need to feed it to the next time step. We need to feed it to the next time step. Right? So remember what we actually output here and look at, use a debugger. Right? Put it here. What is out P? Out P is a tensor. How big is the tensor? So before you look it up in the debugger, try and figure it out from first principles and check your rate. So out P is a tensor whose length is equal to the number of words in our English vocabulary and it contains the probability for every one of those words that it is that word. Does that make sense? Right? Then if we now say out P dot data dot max, that looks in its tensor to find out which word has the highest probability, okay? And max in PyTorch returns two things. The first thing is what is that max probability? And the second is what is the index into the array of that max probability? And so we want that second item, index number one, which is the word index with the largest thing, okay? So now that contains the word, okay? Well, the word index into our vocabulary of the word. If it's a one, right? You might remember one was padding, then that means we're done, right? That means we've reached the end because we finished with a bunch of padding, right? If it's not one, let's go back and continue, right? Now, deck imp is whatever the highest probability word was, right? So we're keeping through either until we get to the largest length of a sentence or until everything in our mini batch is padding. And each time we've appended our outputs, each time, not the word, but the probabilities, okay? To this list, which we stack up into a tensor and we can now go ahead and feed that to a loss function. So before we go to a break, since we've done one and two, let's do three, which is a loss function. The loss function is categorical cross-entry loss, okay? We've got a list of probabilities for each of our classes, where the classes are all the words in our English vocab, and we have a target, which is the correct class, i.e., which is the correct word at this location. There's two tweaks, which is why we need to write our own little loss function, but you can see basically it's going to be cross-entry loss, right? And the tweaks are as bullets. Tweak number one is we might have stopped a little bit early, right? And so the sequence length that we generated may be different to the sequence length of the target, in which case we need to add some padding, right? PyTorch padding function is weird. If you have a rank three tensor, which we do, we have batch size by, sorry, we have sequence length by batch size by number of words in the vocab. A rank three tensor requires a six tuple. Each pair and things in that tuple is the padding before and then the padding after that dimension, right? So in this case, the first dimension has no padding, the second dimension has no padding, the third dimension has no padding on the left and as match padding is required on the right, okay? So it's good to know how to use that function. Now that we've added any padding, that's necessary. The only other thing we need to do is cross entropy loss expects a rank two tensor, I meant matrix, but we've got sequence length by batch size. So let's just flatten out the sequence length and batch size into a, that's what that minus one in view does, okay? So flatten out that for both of them and now we can go ahead and call cross entropy. That's it. So now we can just use standard approach. Here's our sequence to sequence RNN, that's this one here. So that is a standard PyTorch module. Stick it on the GPU. Hopefully by now you've noticed you can call dot CUDA, but if you call to GPU, then it doesn't put it on the GPU if you don't have one. You can also set fastai.core.useGPU to false to force it to not use the GPU and that can be super handy for debugging, right? We then need something that tells it how to handle learning rates, learning rate groups. So there's a thing called single model that you can pass it to which treats the whole thing as a single learning rate group. So this is like the easiest way to turn a PyTorch module into a fastai model. Here's the model data object we created before. We could then just call learner to turn that into a learner, but if we call RNN learner, RNN learner is a learner. It defines cross entropy as a default criteria. In this case, we're overwriting that anyway, so that's not what we care about, but it does add in these save encoder and loading encoder things that can be handy sometimes. So we could have, in this case, we really put it to said learner, but RNN learner also works, okay? So here's how we turn our PyTorch module into a fastai model into a learner. And once we have a learner, give it our new loss function, and then we can call lrfind and we can call fit and it runs for a while and we can save it. So all the normal learn stuff now works. The, remember the model attribute of a learner is a standard PyTorch model, so we can pass that some x, which we can grab out of our validation set, or you could use learn.predict.array or whatever you like to get some predictions. And then we can convert those predictions into words by grabbing going.max1 to grab the index of the highest probability words to get some predictions. And then we can go through a few examples and print out the French, the correct English, and the predicted English for things that are not padding. And here we go, right? So amazingly enough, this kind of like simplest possible written largely from scratch PyTorch module on only 50,000 sentences is sometimes capable on a validation set of giving you exactly the right answer. Sometimes the right answer in slightly different wording and sometimes sentences that really aren't grammatically sensible or even have too many question marks. So we're well on the right track. I think you would agree. So even the simplest possible, sec to sec, trained for a very small number of epochs without any pre-training other than the use of wording vettings is surprisingly good. So I think the message here, and we're gonna improve this in a moment after the break, but I think the message here is even sequence to sequence models you think as simple as them could possibly work even with less data than you think you could learn from can be surprisingly effective and in certain situations, this may even be enough for your needs. So we're gonna learn a few tricks after the break which will make this much better. So let's come back at 750. So one question that came up during the break is that some of the tokens that are missing in fast text like had a curly quote rather than a straight quote, for example, and the question was, would it help to normalize punctuation? And the answer for this particular case is probably yes. The difference between curly quotes and straight quotes is really semantic. You do have to be very careful though, because it may turn out that people using beautiful curly quotes like using more formal language and they're actually writing in a different way. So I generally, if you're gonna do some kind of pre-processing like punctuation normalization, you should definitely check your results with and without, because like nearly always that kind of pre-processing makes things worse, even when I'm sure it won't. This is another question. Hello. What may be some ways of regularizing these sequence to sequence models besides dropout and weight gain? Let me think about that during the week. Yeah. It's like, you know, AWD, LSTM, which we've been relying on a lot, has so many great, I mean, it's all dropout. Well, not all dropout. There's dropout of many different kinds. There's, and then there's the, we haven't talked about it much, but there's also a kind of a regularization based on activations and stuff like that as well, and on changes and whatever. I just haven't seen anybody put anything like that amount of work into regularization of sequence to sequence models. And I think there's a huge opportunity for somebody to do like the AWD, LSTM of sector sec, which might be as simple as dealing all the ideas from AWD, LSTM, and using them directly in sector sec. That'd be pretty easy to try, I think. And there's been an interesting paper that actually Stephen Meredith added in the last couple of weeks where he used an idea, which I don't know if he stole it from me, but it was certainly something I had also recently done and talked about on Twitter. Either way, I'm thrilled that he's done it, which was to take all of those different AWD, LSTM, hyperparameters and train a bunch of different models and then use a random forest to find out with feature importance, which ones actually matter the most and then figure out how to set them. Yeah, so I think you could totally use this approach to figure out for sequence to sequence regularization approaches, which ones are best and optimize them. That would be amazing. Yeah, but at the moment, I think, I don't know that there are additional ideas to sequence to sequence regularization that I can think of beyond what's in that paper for regular language model stuff and probably all those same approaches would work. Okay, so tricks. Trick number one, go bi-directional. So for classification, my approach to bi-directional that I've suggested you use is take all of your token sequences, spin them around and train a new language model and train a new classifier. And I also mentioned the wiki text pre-trained model. If you replace FWD with BWD in the name, you'll get the pre-trained backward model I created for you. So you can use that. Get a set of predictions and then average the predictions just like a normal ensemble. And that's kind of how we do bi-dire for that kind of classification. There may be ways to do it end to end, but I haven't quite figured them out yet. They're not in fast AI yet and I don't think anybody's written the paper about them yet. So if you figure it out, that's an interesting line of research. But because we're not doing massive documents where we have to kind of chunk it into separate bits and then pull over them and whatever, we can do bi-dire very easily in this case, which is literally as simple as adding bi-directional equals true to our encoder. People tend not to do bi-directional for the decoder I think partly because it's kind of considered cheating, but I don't know, like I was just talking to somebody at the break about it, you know, maybe it can work in some situations, although it might need to be more of a ensembling approach in the decoder because you kind of, it's a bit less obvious. Anyway, in the encoder, it's very, very simple. Bi-directional equals true. And we now have, with bi-directional equals true rather than just having an RNN, which is going this direction, we have a second RNN that's going in this direction. And so that second RNN literally is visiting them each token in the opposing order. So when we get the final hidden state, it's here rather than here, right? But the hidden state is of the same size. So the final result is that we end up with a tensor that's got an extra two long axis, right? And depending on what library you use, often that will be then combined with the number of layers thing. So if you've got two layers and bi-directional, that tensor dimension is now of length four. With PyTorch, it kind of depends which bit of the process you're looking at as to whether you get a separate result for each layer and offer each bi-directional bit and so forth. You have to look up the docs and it will tell you inputs, outputs, tensor sizes appropriate to the number of layers and whether you have bi-directional equals true. In this particular case, you'll basically see all the changes I've had to make. So for example, you'll see when I added bi-directional equals true, my linear layer now needs number of hidden times two to reflect the fact that we have that second direction in our hidden state now. You'll see in it hidden, it's now self dot number of layers times two here. Okay, so you're gonna see there's a few places where there's been an extra two that has to be thrown in. Yes, Unette. Why making any color bi-directional is considered cheating? Well, it's not just that it's cheating, it's like we have this loop going on, you know? It's not as simple as just kind of having two tensors. And then like, how do you turn those two separate loops into a final result? After talking about it during the break, I've kind of gone from like, hey, everybody knows it doesn't work to like, oh, maybe it kind of could work, but it requires more thought. It's quite possible during the week, I realize it's a dumb idea and I was being stupid, but we'll think about it. Another question people had, why do you need to have an end to that loop? Why do I have a what to the loop? Why do you need to like have an end to that loop? You have like a range, if you're- Oh, a range? Yeah, why can't you- Oh, I mean, it's because when I start training, everything's random, so this will probably never be true. So later on, it'll pretty much always break out eventually, but yeah, it's basically like, we're gonna go forever. It's really important to remember like when you're designing an architecture that when you start, the model knows nothing about anything, right? So you kind of wanna make sure it's doing something that's this vaguely sensible. Okay, so bi-directional means we had, let's see how we go here. We got out to 358, cross-entropy loss with a single direction, with bi-direction is down to 351, okay? So that improved it a bit, that's good. And as I say, it shouldn't really slow things down too much, bi-directional does mean there's a little bit more sequential processing have to happen, but it's generally a good win. In the Google translation model of the eight layers, only the first layer is bi-directional because it allows it to do more in parallel. So if you create really deep models, you may need to think about which ones are bi-directional, otherwise you'll have performance issues. Okay, so 351. Now let's talk about teacher forcing. So teacher forcing is gonna come back to this idea that when the model starts learning, it knows nothing about nothing. So when the throttle starts learning, it is not gonna spit out, er, at this point, it's gonna spit out some random meaningless word. Because it doesn't know anything about German or about English or about the idea of language or anything. And then it's gonna feed it down here as an input and be totally unhelpful. And so that means that early learning is gonna be very, very difficult because it's feeding in an input that's stupid into a model that knows nothing and somehow it's gonna get better, right? So that's, it's not asking too much, eventually it gets there, but it's definitely not as helpful as we can be. So what if instead of feeding in, what if instead of feeding in the thing I predicted just now, right? What if instead we feed in the actual correct word it was meant to be, right? Now, we can't do that at inference time because by definition we don't know the correct word and we've been asked to translate it, right? We can't require a correct translation in order to do translation, right? So the way I've set this up is I've got this thing called PR force, which is probability of forcing. And if some random number is less than that probability then I'm gonna replace my decoder input with the actual correct thing, right? And if we've already gone too far, right? If it's already longer than the target sentence I'm just gonna stop. Obviously I can't give it the correct thing. So you can see how beautiful PyTorch is for this, right? Because like if you tried to do this with some static graph thing like classic TensorFlow, well, I try it, right? Like one of the key reasons that we switched to PyTorch at this exact point in last year's class was because Jeremy tried to implement teacher forcing in Keras Intensive Flow and went even more insane than he started, right? And I was like, it was weeks of getting nowhere. And then I literally on Twitter, I think it was on Jacob Pathy, I saw announced, said something about, oh, there's this thing called PyTorch that just came out and it's really cool. And I tried it that day by the next day I had teacher forcing with me. And so like I was like, oh my gosh, you know? And like all the stuff of trying to debug things, it was suddenly so much easier and this kind of dynamic stuff is so much easier. So this is a great example of like, hey, I get to use random numbers and if statements and stuff. So yeah, so here's the basic idea is at the start of training, let's set PR force really high, right? So that nearly always it gets the actual correct, you know, previous word. And so it has a useful input, right? And then as I train a bit more, let's decrease PR force so that by the end PR force is zero and it has to learn properly, which is fine because it's now actually feeding in sensible inputs most of the time anyway. So let's now write something such that in the training loop, it gradually decreases PR force. So how do you do that? Well, one approach would be to write our own training loop, okay? But let's not do that because we already have a training loop that has progress bars and uses exponential weighted averages to smooth out the losses and keeps track of metrics. And you know, it does a bunch of things which they're not rocket science but they're kind of convenient. And they also kind of keep track of, you know, calling the reset for RNNs at the start of an E-pop to make sure that the hidden states set to zeros and you know, little things like that we'd rather not have to write that in scratch. So what we've tended to find is that as I start to kind of write some new thing and I'm like, oh, I need to kind of replace some part of the code I'll then kind of add some little hook so that we can all use that hook to make things easier. And this particular case, there's a hook that I've ended up using all the damn time now which is the hook called the stepper. And so if you look at our code, model.py is where our fit function lives, right? And so the fit function in model.py is kind of, we've seen it before. I think it's like the lowest level thing that doesn't require a learner. It doesn't really require anything much at all. It just requires a standard PyTorch model and a model data object. You just need to know how many epochs, a standard PyTorch optimizer and a standard PyTorch loss function, right? So you can call, we've hardly ever used it in the class. We normally call learn.fit, but learn.fit calls this. So this is our lowest level thing. But we've looked at the source code here sometimes and we've seen how it loops through each epoch and it loops through each thing in our batch and calls stepper.step, right? And so stepper.step is the thing that's responsible for calling the model, getting the loss, finding the loss function and calling the optimizer. And so by default, stepper.step uses a particular class called stepper, which there's a few things you don't know where to add, but basically it calls the model, right? So the model ends up inside m, zeroes the gradients, calls the loss function, calls backwards, does gradient flipping if necessary and then calls the optimizer, right? So, you know, they're the basic steps that back when we looked at kind of pie torch from scratch we had to do. So the nice thing is we can replace that with something else rather than replacing the training loop, right? So if you inherit from stepper and then write your own version of step, right? You can just copy and paste the contents of step and add whatever you like, right? Or if it's something that you're gonna do before or afterwards, you could even call super.step. In this case, I rather suspect I've been unnecessarily complicated here. I probably could have replaced, commented out all of that and just said super.step x is comma y comma epoch because I think this is an exact copy of everything, right? But, you know, as I say, when I'm prototyping, I don't think carefully about how to minimize my code. I copied and pasted the contents of the code from step and I added a single line to the top which was to replace PR force in my module with something that gradually decreased linearly for the first 10 epochs and after 10 epochs, it was zero, okay? So, total hack, but good enough to try it out and so the nice thing is that I can now, everything else is the same. I've replaced, I've added these three lines of code to my module and the only thing I need to do other that's differently is when I call fit is I pass in my customized step-by-pass, okay? And so that's gonna do teacher forcing and so we don't have bi-directional, so we're just changing one thing at a time. So we should compare this to our unidirectional results which was 3.58 and this is 3.49, okay? So that was an improvement. So that's great. Needed to make sure I at least did 10 epochs because before that it was cheating by using the teacher forcing, so yeah. Okay, so that's good, that's an improvement. So we've got another trick and this next trick is a bigger trick. It's a pretty cool trick and it's called attention and the basic idea of attention is this, which is expecting the entirety of the sentence to be summarized into this single hidden vector is asking a lot. It has to know what was said and how it was said and everything necessary to create the sentence in German. And so the idea of attention is basically like, maybe we're asking too much, right? Particularly because we could use this form of model where we output every step of the loop to not just have a hidden state at the end but to hit a hidden state after every single word. And like, why not try and use that information? It's like it's already there but at so far we've just been throwing it away. And not only that but bidirectional we've got every step, we've got two vectors of state that we can use. So how could we use this piece of state, this piece of state, this piece of state, this piece of state and this piece of state rather than just the final state. And so the basic idea is, well, let's say I'm doing this word, translating this word right now. Which of these five pieces of state do I want? And of course the answer is if I'm doing, well actually let's pick a more interesting word. Let's pick this one. So if I'm trying to do loved, then clearly the hidden state I want is this one. Because this is the word. Okay. And then for this preposition, is it a preposition? Whatever, this little word here. No, it's not a preposition. I guess it's part of the verb. So for this part of the verb, I probably would need like this and this and this to kind of make sure that I've got kind of the tense right and know that I actually need this part of the verb and so forth. So depending on which bit I'm translating, I'm gonna need one or more bits of these of these various hidden states. And in fact, you know, like, I probably want some weighting of them. So like what I'm doing here, I probably mainly want this state, right? But I maybe want a little bit of that one and a little bit of that one, right? So in other words, for these five pieces of hidden state, we want a weighted average, right? And we want it weighted by something that can figure out which bits of the sentence are most important right now. So how do we figure out something like which bits of the sentence are important right now? We create a neural net. And we train the neural net to figure it out. When do we train that neural net? And to end. So let's now train two neural nets. Well, we've actually already kind of got a bunch, right? We've got an RNN encoder. We've got an RNN decoder. We've got a couple of linear layers. What the hell? Let's add another neural net into the mix, okay? And this neural net is gonna spit out a weight for every one of these things. And we're gonna take the weighted average at every step. And it's just another set of parameters that we learn all at the same time, okay? And so that's called attention. So the idea is that once that attention's been learned, we can see this terrific demo from Chris Oler and Sean Carter. Each different word is gonna take a weighted average. See how the weights are different depending on which word is being translated, right? And you can see how it's kind of figuring out the color, the deepness of the blue is how much weight it's using. You can see that each word is basically, or here, which word are we translating from. So when we say European, we need to know that both of these two parts are gonna be influenced, or if we're doing economic, both of these three parts are gonna be influenced, including the gender of the definite article, and so forth, right? So check out this dill.pub article. These things are all like little, nice little interactive diagrams. Basically shows you how our attention works and what the actual attention looks like in a trained translation model, okay? So let's try and implement attention. So with attention, it's basically, this is all identical, right? And the encoder is identical, and all of this bit of the decoder is identical. There's one difference, which is that we, let's see what this happens. Here we go. We basically are gonna take a weighted average, and the way that we're gonna do the weighted average is we create a little neural net, which we're gonna see here and here, and then we use Softmax, because of course the nice thing about Softmax is that we want to ensure that all of the weights that we're using add up to one, and we also kind of expect that one of those weights should probably be quite a bit higher than the other ones, right? And so Softmax gives us the guarantee that they add up to one, and because it's the eta though in it, it tends to encourage one of the weights to be higher than the other ones, right? So let's see how this works. So what's gonna happen is we're gonna take the last layer's hidden state, and we're gonna stick it into a linear layer, and then we're gonna stick it into a nonlinear activation, and then we're gonna do a matrix multiply, and so if you think about it, linear layer, nonlinear activation, matrix multiply, that's a neural net. It's a neural net with one hidden layer, okay? Stick it into a Softmax, okay? And then we can use that to weight our encoder outputs, right? So now rather than just taking the last encoder output, we've got, this is gonna be the whole tensor of all of the encoder outputs, which I just weight by this little neural net that I created. And that's basically it, right? So, okay. So what I'll do is I'll put on the wiki thread a couple of papers to check out. There was basically one amazing paper that really originally introduced this idea of attention, and I say amazing, because it actually introduced a couple of key things which have really changed how people work in this field. They say area of attention has been used not just for text, but for things like reading text out of pictures or kind of doing various stuff with computer vision and stuff like that. And then there's a second paper, which actually Jeffrey Hinton was involved in called Grammar as a Foreign Language, which used this idea of RNNs with attention to basically try to replace rules-based grammar with an RNN, which automatically basically tagged the grammatical, each word based on this grammar and turned out to do it better than any rules-based system, which today actually kind of seems obvious. I think we're now used to the idea that neural nets do lots of this stuff better than rules-based systems, but at the time it was considered really surprising. Anyway, one nice thing is that their kind of summary of how attention works is really nice and concise. Can you please explain attention again? Sure, so here's the idea. I like that nice, crisp request. That's very easy to understand. Okay, let's go back and look at our original encoder. So an RNN spits out two things. It spits out a list of the state after every time step, and it also tells you the state at the last time step. And we used the state at the last time step to create the input state for our decoder. Which is what we see here, one vector. But we know that it's actually creating a vector at every time step, so wouldn't it be nice to use them all? But wouldn't it be nice to use the ones that's most relevant to translating the word I'm translating now? So wouldn't it be nice to be able to take a weighted average of the hidden state at each time step, weighted by whatever is the appropriate weight right now? Which, for example, in this case, libta would definitely be time step number two, is what it's all about, because that's the word I'm translating. So how do we get a list of weights that is suitable for the word we're training right now? Well, the answer is by training a neural net to figure out the list of weights. And so anytime we wanna figure out how to train a little neural net that does any task, the easiest way normally always to do that is to include it in your module and train it in line with everything else. The minimal possible neural net is something that contains two layers, and one non-linear activation function. So here is one linear layer, okay? And in fact, you know, instead of a linear layer, we can even just grab a random matrix, if we don't care about bias, right? And so here's a random matrix. It's just a random tensor wrapped up in a parameter. A parameter, I remember, is just a PyTorch variable. It's like identical to a variable, but it just tells PyTorch, I want you to learn the weights for this, please, right? So here we've got a linear layer, here we've got a random matrix. And so here at this point where we start out our decoder, let's take that final, let's take the current hidden state of the decoder, right? Put that into a linear layer, right? Because like, what's the information we use to decide what words we should focus on next? Well, the only information we have to go on is what the decoder's hidden state is now, right? So let's grab that, put it into the linear layer, put it through a non-linearity, put it through one more non-linear layer, this one actually doesn't have a bias in it, so it's actually just a matrix multiply, put that into a softmax, and that's it, right? That's a little neural net. It doesn't do anything, it's just a neural net, no neural nets do anything, they're just linear layers with non-linear activations with random weights, right? But it starts to do something if we give it a job to do, right? And in this case, the job we give it to do is to say, don't just take the final state, but now let's use all of the encoder states, and let's take all of them and multiply them by the output of that little neural net, right? And so, given that the things in this little neural net are learnable weights, hopefully it's gonna learn to weight those encoder outputs, those encoder hidden states by something useful, right? That's all a neural net ever does, is we give it some random weights to start with and a job to do and hope that it learns to do the job. And in the end, it turns out that it does, right? So everything else in here is identical to what it was before. We've got teacher forcing, it's not bidirectional. So we can see how this goes, right? You can see, actually I am, oh yes, here we are, using bidirectional, oh sorry, using teacher forcing. So teacher forcing had 3.49, and so now we've got nearly exactly the same thing, but we've got this little minimal neural net figuring out what weightings to give our inputs. Oh wow, now it's down down to 3.37, right? Remember, these things are logs, right? So either the power of this is quite a significant change. So 3.37, let's try it out. Not bad, right? Where are they located? What are their skills? What do you do? They're still not perfect, why or why not, but it's quite a few of them are correct. And again, considering that we're asking it to learn about the very idea of language for two different languages and how to translate them between the two and grammar and vocabulary, and we only have 50,000 sentences and a lot of the words only appear once, I would say this is actually pretty amazing. Yes, Inet. Why do we use tongue H instead of value for attention, mini net? I don't quite remember, it's been a while since I looked at it. You should totally try using value and see how it goes. Obviously, Than, the key difference is that it can kind of go in each direction and it's limited both at the top and the bottom. I know very often, like for the gates inside RNNs and LSTNs and GRUs, Than often works out better, but it's been about a year since I actually looked at that specific question, so I'll look at it during the week. The short answer is you should try a different activation function and see if you can get a better result. Be interested to hear what you find out. So what we can do also is we can actually grab the attentions out of the model, right? So I actually added this return attention equals true here. Look, see here in my forward? Like forward, you can put anything you like in forward. So I added a return attention parameter, false by default, because obviously the training loop, it doesn't know anything about it. But then I just had something here saying, if return attention, then stick the attentions on as well, right? And the attentions is simply that value, A, just check it in a list, right? So we can now call the model with return attention equals true and get back the probabilities and the attentions, which means as well as printing out these here, we can draw pictures at each time step of the attention. And so you can see at the start, the attention's all in the first word, second word, third word, a couple of different words. And this is just for one particular sentence, right? But so you can kind of see this is the equivalent. This is like, when you're Chris Oler and Sean Carter, you make things that look like this. When you're Jeremy Howard, the exact same information looks like this. But it's the same thing, okay? Just pretend that it's beautiful. But so you can see basically at each different time step, we've got a different attention. And it's really important when you try to build something like this, like you don't really know if it's not working, right? Because if it's not working, and as per usual, my first 12 attempts at this were broken. And they were broken in the sense that it wasn't really learning anything useful. And so therefore it was basically giving equal attention to everything. And therefore it wasn't worse. It just wasn't better. Right, or it wasn't much better. And so until you actually find ways to visualize the thing in a way that you know what it ought to look like ahead of time, you don't really know if it's working. So it's really important that you try to find ways to kind of check your intermediate steps and your outputs. Yes, Unette? I think there is a little bit of a, so people are asking, what is the loss function for the attention on your network? No, no, no loss functions the attention on your network, right? It's trained end to end. So it's just like, it's just sitting here inside our decoder loop, right? So the loss function for the decoder loop is that this result contains, it's exactly the same as before. Just the outputs, the probabilities of the words, right? So like the loss function, it's the same loss function, right? So how come the, how come the little, little mini neural nets learning something? Well, because in order to make the outputs better and better, it would be great if it made the weights of these that are weighted average better and better, right? So part of creating our output is to please do a good job of finding a good set of weights. And if it doesn't do a good job of finding a good set of weights, then the loss function would improve from that bit. So like end to end learning means like you throw in, you know, everything that you can into one loss function. And the gradients of these, of all the different parameters point in a direction that says basically, hey, you know, if you had put more weight over there, it would have been better. And thanks to the magic of the train rule, it then knows like, oh, well, it would have put more weight over there if you would like change the parameter in this matrix multiply a little bit over there, right? And so that's the magic of end to end learning. So it's a very understandable question of like, how did this little mini neural network? But you've got to realize there's nothing particularly about this code that says, hey, this particular bits a separate little mini neural network any more than the grew is a separate little neural network or this linear layer is a separate little function. Like it's all ends up pushed into one output to the bunch of probabilities, which ends up in one loss function that returns a single number that says this either was or wasn't a good translation, right? And so thanks to the magic of the train rule, we then back propagate little updates to all the parameters to make them a little bit better. So this is a big weird counterintuitive idea and it's totally okay if it's a bit mind-bending, right? And it's the bit where even back to lesson one, you know, it's like, how did we make it find dogs versus cats? We didn't, you know, all we did was we said, this is our data, this is our architecture, this is our loss function, please back propagate into the weights to make them better. And after you've made them better a while, it'll start finding cats and dogs, right? As just in this case, we haven't used somebody else's like convolutional network architecture. We've said, here's like a custom architecture, which we hope is going to be particularly good at this problem, right? And even without this custom architecture, it was still okay, right? But then when we kind of made it in a way that made more sense or you think it ought to do, it worked even better. But at no point did we kind of do anything different other than say, here's a data, here's an architecture, here's a loss function, go and find the parameters, please. Okay, and it did it because that's what neural nets do. Okay, so that is sequence to sequence learning. And if you want to encode an image into using a CNN backbone of some kind and then pass that into a decoder, which is like a RNN with attention and you'll make your Y values, the actual correct captions for each of those images, you will end up with an image caption generator. If you do the same thing with videos and captions, you'll end up with a video caption generator. If you do the same thing with 3D CT scans and radiology reports, you'll end up with a radiology report generator. If you do the same thing with GitHub issues and people's chosen summaries of them, you'll get a GitHub issues summary generator, which in sec to sec, I agree, they're magical, but they work. You know, and I don't feel like people have begun to scratch the surface of how to use sec to sec models in their own domains, right? So like not being a GitHub person, it would never have occurred to me that like, oh, it would be kind of cool to start with some issue and automatically create a summary. But now I'm like, of course, next time I go to GitHub, I want to see a summary written there for me. I don't want to write my own damn commit message through that, you know? Why should I write my own like summary of the code review when I finished adding comments to lots of clients? It should do that for me as well. Now I'm thinking like, oh, GitHub so behind, could be doing this stuff. So what are the things in your industry, you know, that you could like start with a sequence and generate something from it? I can't begin to imagine, right? So again, it's kind of like, it's a fairly new area. The tools for it are not easy to use. They're not even built into FastAI yet, as you can see, hopefully they will be, you know, soon. And I, you know, I don't think anybody knows what the opportunities are. Okay, so I got good news, bad news. The bad news is we have 20 minutes to cover a topic which in last year's course took a whole lesson. The good news is that when I went to rewrite this using FastAI and PyTorch, I ended up with almost no code. So all of the stuff that made it hard last year is basically gone now. So we're gonna do something bringing together for the first time our two little worlds we focused on, text and images. And we're gonna try and bring them together. And so this idea came up really in a paper by this extraordinary deep learning practitioner and researcher named Andrea Frome. And Andrea was at Google at the time. And her basic crazy idea was to say, like, you know, words can have a distributed representation, a space, which particularly at that time was, you know, really was just word vectors, right? And images can be represented in a space. I mean, like in the end, if we have like a fully connected layer, they kind of ended up as like a vector representation. Could we merge the two? Could we somehow encourage the vector space that the images end up with be the same vector space that the words are in? And if we could do that, what would that mean? What could we do with that, right? So what could we do with that covers things like, well, what if I'm wrong? You know, what if I'm predicting that this image is a Beagle and I predict Jumboject, you know, and Yenet's model predicts Korgi. The normal loss function says that Yenet and Jeremy's models are equally good, i.e. they're both wrong, right? But what if we could somehow actually say, no, you know what? Like Korgi's closer to Beagle than it is to Jumboject. So Yenet's model is better than Jeremy's. We should be able to do that, right? Because in word vector space, Beagle and Korgi are pretty close together, but Jumboject, not so much, right? So it would give us a nice situation where hopefully our inferences would be like wrong, insane a ways if they're wrong. It would also allow us to search for things that aren't in our, you know, at an ImageNet send set ID, you know, like a category in ImageNet, like dog and cat. Why did I have to train a whole new model to find dogs versus cats when we already had something that found Korgi's and Tabi's, right? Why can't I just say find me dogs? Well, if I had trained it in word vector space, I totally could, right? Because like there's now a word vector, I can find things with the right image vector. And so forth, right? So we'll look at some cool things we can do with it in a moment, but first of all, let's train a model where this model is not learning a category, a one hot encoded ID where every category is equally far from every other category. Let's instead train a model where we're finding the dependent variable, which is a word vector. So what word vector? Well, obviously the word vector for the word you want, right? So if it's Korgi, let's train it to create a word vector that's the Korgi word vector. And if it's a jumbo jet, let's train it with a dependent variable that says this is the word vector for a jumbo jet, right? So as I said, it's now shockingly easy, right? So let's grab the fast text word vectors again. Load them in, we only need English this time, right? And so here's an example of the word vector for King, right? It's just 300 numbers. So for example, little j Jeremy and big j Jeremy have a correlation of 0.6. I don't like bananas at all. This is good banana and Jeremy 0.14, right? So like words that you would expect to be correlated are correlated in words that should be as far away from each other as possible. Unfortunately, they're still slightly correlated but not so much, right? So let's now grab all of the image net classes because we actually wanna know, you know, which one's Korgi and which one's jumbo jet. So we've got a list of all of those up on faster.ai. We can grab them. And let's also grab a list of all of the nouns in English, which I've made available here as well, okay? So here are the names of each of the 1000 image net classes and here are all of the nouns in English, according to word net, which is a popular thing for kind of representing what words are or not. So we can now go ahead and load that list of nouns and sorry, load the list of image net classes, turn that into a dictionary. So these are the class IDs for the 1000 image net classes that are in the competition data set. There are 1000, okay? So here's an example, N01 is a tench, which apparently is a kind of fish. Let's do the same thing for all those word net nouns and you can see actually it turns out that image net is using word net class names. So that makes it nice and easy to map between the two and word net, you know, most basic thing is an entity and then that includes an abstraction and a physical entity can be an object and so forth, right? So these are our two worlds. We've got the image net 1000 and we've got the 82,000, which are in word net. So we wanna map the two together, which is as simple as creating a couple of dictionaries to map them based on the on the sin set ID or the word net ID. And it turns out that 49,469... Let's see, what have we got here? Sin2WV, sin2WV, so sin set to word vector. Oh, okay. So what I need to do now is grab the 82,000 nouns in word net and try and look them up in fast text, okay? And so I've managed to look up 49,000 of them in fast text, right? So I've now got a dictionary that goes from sin set ID, which is what word net calls them, to word vectors, okay? So that's what this dictionary is, sin to sin set to word vector. And I've also got the same thing specifically for the 1,000 word net classes. So save them away, that's fine. Now I grab all of image net, which you can actually download from Kaggle now. If you look up the Kaggle ImageNet localization competition, that contains the entirety of the image net classifications as well. It's got a validation set of 28,650 items in it. And so I can basically just grab for every image in image net, I can grab using that sin set to word vector, grab it's a fast text word vector, and I can now stick that into this image vectors array, stack that all up into a single matrix and save that away. And so now what I've got is something for every image net image, I've also got the fast text word vector that it's associated with. Just by looking up the sin set ID, going to word net, then going to fast text and grabbing the word vector, okay? And so here's a cool trick. I can now create a model data object, which is specifically it's an image classifier data object. And I've got this thing called from names and array, I'm not sure if we've used it before, but we can pass it a list of file names. And so these are all of the file names in image net, and we can just pass it an array of our dependent variables. And so this is all of the fast text word vectors, right? And then I can pass in the validation indexes, which in this case is just all of the last IDs. I need to make sure that they're the same as image net users, otherwise I'll be cheating, okay? And then I pass in continuous equals true, which means this puts a lie again to this image classifier data, is now really an image regressor data. So continuous equals true, means don't want hot encode my outputs, but treat them just as continuous values. So now I've got a model data object that contains all of my file names and for every file name, a continuous array representing the word vector for that. So I have an X, I have a Y, so I have data. Now I need an architecture and a loss function. Right, once I've got that, I should be done. So let's create an architecture. And so we can, we'll revise this next week, but basically we can use the tricks we've learnt so far, but it's actually incredibly simple. Fast.ai has a comnet builder, which is what, when you say, comflourner.pre-trained, it calls this, and you basically say, okay, what architecture do you want? So we're gonna use ResNet 50. How many classes do you want? In this case, it's not really classes, it's how many outputs do you want, which is the length of the fast text word vector, 300. Obviously it's not multi-class classification, it's not classification at all. Is it regression? Yes, it is regression, okay. And then you can just say, all right, what fully connected layers do you want? So I'm just gonna add one fully connected layer, hidden layer of length 1,024. Why 1,024? Well, I've got the last layer of ResNet 50 is, I think it's 1,024 long. The final output I need is 300 long. I obviously need my penultimate layer to be longer than 300, otherwise there's not enough information. So I kind of just picked something a bit bigger. Maybe different numbers would be better, but this worked for me. How much dropout do you want? I found that the default dropout, I was consistently underfitting, so I just decreased the dropout from 0.5 to 0.2. And so this is now a convolutional neural network that does not have any softmax or anything like that because it's regression, it's just a linear layer at the end. And yeah, that's basically it. That's my model. So I can create a common learner from that model, give it an optimization function. So now all I need, I've got data. I've got an architecture, right? So the architecture, because I said I've got this many 300 outputs, it knows that there are 300 outputs because that's the size of this array, right? So now all I need is a loss function. Now the default loss function for regression is L1 loss, so the absolute differences. That's not bad, right? But unfortunately in really high dimensional spaces, anybody who's kind of studied a bit of machine learning probably knows this in really high dimensional spaces, in this case it's 300 dimensional, basically everything is on the outside, right? And when everything's on the outside, it's not meaningless, but it's a little bit awkward. Like things can, things tend to be close together or far away, it doesn't really mean much in these really high dimensional spaces where everything's on the edge, right? What does mean something though is that if one thing's on the edge over here and one thing's on the edge over here, you can form an angle between those vectors and the angle is meaningful, right? And so that's why we use cosine similarity when we're basically looking for like how close or far apart are things in high dimensional spaces, right? And if you haven't seen cosine similarity before, it's basically the same as Euclidean distance, but it's normalized to be basically a unit norm, right? So it basically divide by the length. So we don't care about the length of the vector, we only care about its angle, okay? So there's a bunch of like stuff that you can easily learn in a couple of hours, but if you haven't seen it before, it's a bit mysterious. For now, just know that lost functions in high dimensional spaces where you're trying to find similarity, you care about angle and you don't care about distance, okay? If you didn't use this custom loss function, it would still work, I tried it, it's just a little bit less good, okay? So we've got an architecture, we've got data, we've got a loss function, therefore we're done, okay? So we can go ahead and fit. Now, I'm training on all of ImageNet, right? That's gonna take a long time. So pre-comput equals true as your friend, okay? You remember pre-comput equals true? That's that thing we learned ages ago that caches the output of the final convolutional layer, right? And just trains the fully connected bit, right? And like even with pre-comput equals true, it takes like three minutes to train an epoch on all of ImageNet. So I trained it for a while longer, so that's like an hour's worth of training, right? But it's pretty cool that with FastAI, we can train a new custom head, basically on all of ImageNet for 40 epochs in an hour or so, okay? And so at the end of all that, we can now say, all right, let's grab the 1000 ImageNet classes, right? And let's predict on a whole validation set, right? And let's just take a look at a few pictures. Okay, so here's a look at a few pictures. And because the validation set is ordered, they're all the stuff is of the same type is in the same place. I don't know what this thing is. And what we can now do is we can now use nearest neighbors search, right? So nearest neighbors search means, you know, here's one 300-dimensional vector, here's a whole lot of other three-dimensional vectors, which things is it closest to, right? And normally that takes a very long time because you have to look through every 300-dimensional vector calculates its distance and find out how far away it is, okay? But there's an amazing, almost unknown library called NMSlib that does that incredibly fast. Like almost nobody's heard of it. Some of you may have tried other nearest neighbors libraries. I guarantee this is faster than what you're using. I can tell you that because it's been benchmarked, like by people who do this stuff for a living. This is by far the fastest on every possible dimension, right? So this is basically a super fast way. We basically look here. This is angular distance, right? So we wanna create an index on angular distance and we're gonna do it on all of our image net word vectors, right, add in a whole batch, create the index, and now I can query a bunch of vectors all at once, get their 10 nearest neighbors. Users multi-threading, it's absolutely fantastic, this library, okay? You can install it from PIP. It just works and it tells you how far away they are and their indexes, right? And so we can now go through and print out the top three. So it turns out that bird actually is a limpkin, okay? So here are, this is the top three for each one. Interestingly, this one doesn't say it's a limpkin and I looked it up, it's the fourth one. I don't know much about birds, but like everything else here is brown with white spots, that's not. So I don't know if that's actually a limpkin or if there's the mist label, but like, sure as hell it doesn't look like the other birds. So, you know, I thought that was pretty interesting that, yeah, it's kind of saying like, I don't think it's that. Now, this is not a particularly hard thing to do because it's only a thousand image net classes and it's not doing anything new, right? But what if we now bring in the entirety of word net and we now say which of those 45,000 things is it closest to? Exactly the same, right? So it's now searching all of word net, right? So now like, let's do something a bit different which is take all of our predictions, right? So basically take our whole validation set of images and create a K and N index of the image representations because remember it's predicting things that are meant to be word vectors. And now let's grab the fast text vector for both and both is not an image net concept, right? And yet, I can now find all of the images in my predicted word vectors in my validation set that are closest to the word vote and it works even though it's not something that was ever trained on. What if we now take engines vector and boats vector and take their average? And what if we now look in our nearest neighbors for that? These are boats with engines. I mean, yes, this is actually a boat with an engine. It just happens to have wings on as well, right? By the way, sale is not an image net thing. Boat is not an image net thing. Here's the average of two things that are not image net things. And yet with one exception, it's found me two sailboats. Well, okay, let's do something else crazy. Let's open up an image in the validation set. Here it is, right? I don't know what it is. Let's call predict array on that image to get, you know, it's kind of like word vector like thing. And let's do a nearest neighbors search on all the other images. And here's all the other images of whatever the hell that is, right? So you can see this is like crazy. We've trained a thing on all of image net in an hour using like a custom head that required basically like two lines of code. And these things like run in like 300 milliseconds to do these searches. Like I actually taught this basic idea last year as well, but it was in Keras and it was just like pages and pages and pages of code and everything took a long time and it was complicated. And back then I kind of said, yeah, I can't begin to think all the stuff you could do with this, I don't think anybody's really thought deeply about this yet, but I think it's fascinating. And so go back and read the device paper because like Andrea had a whole bunch of other thoughts and now that it's so easy to do, hopefully people will dig into this now because I think it's crazy and amazing. All right, thanks everybody. See you next week.