 Lesson 13 lucky 13 Welcome The penultimate lesson So who was heard of the Google Brain residency program, yeah, it's pretty famous, right? I think it's probably maybe the top program to get into the hardest deep learning program to get into the world The reason I mention it is because it so happens one of our students was just accepted. So I'd like everybody to congratulate I'm Sarah Hooker Sarah Hooker is a new Google Brain resident and I thought maybe we could have a brief chat with her to find out about her path to this program Sarah Hooker is a new Google Brain resident and I thought maybe we could have a brief chat with her to find out about her path to this program Throw over the green box try not to hurt anybody Last enough so Hello, congratulations. Well done. You must be pretty excited Yeah So this is a pretty Huge thing to get into so you must be like a machine learning PhD been coding for a few decades so forth. Yes Not at all and I think that's the Exciting thing about the whole process is that really they're trying to find a diverse set Fellows residents for the years. So there's many different buckets. There were definitely PhDs that I sat next to in the interview room who were well published and Incredibly accomplished. There were also undergraduates who this would be their first job after college and My background was that I've been in industry. I started on economic consulting So I did economic modeling statistics and now I work as a data scientist. So how long have you been coding for? So I've been coding two years. Wow. It also makes me very typical in this group. Yeah Yeah, Rachel told me you had a you had a fan moment when Ian Goodfellow interviewed you. Is that true? But it was so brilliant I think that was the the strangest thing about the whole process was that you're so nervous the whole time And you're so excited, but also you're talking to these people whose research you followed for Years and so you're just my my my my Google the hangout first hangout interview is actually with you girl Larachelle and I started learning about neural nets watching his MOOC So it was just it feels surreal and I think that's part of the intent of the program is to Take very very different people at different parts of that career So so you told me that before you started this course your Awareness of neural nets was a little bit of theory from a few blog posts I mean so how much is going through this course kind of helped you get the skills You needed to get into this program So I wasn't coding like implementing architecture before this course So I done I done to be fair like I was fascinated by deep learning for a while So I did if I was pretty deep in the theory But I think a part of why it was blog post was that there wasn't a coherent body of work on the subject and so Now that's Ian's book Which I have read It's really a fantastic treatment. I think that what this course offers and I sense like many of you may have had the same experience is the implementation which was very new to me and and definitely Something I don't think I would have found in a different forum. Yeah, I mean I know from talking to Rachel that some of the interview process was Like it actually sounded the question sounded like they came straight out of the course It was like about how to do transfer learning and like all this kind of practical application stuff. It seemed like Fairly consistent breakdown throughout the process of theory math questions And then a project experience and with the project. They're really trying to gauge implementation and your knowledge of Basically concepts that we've covered very thoroughly in this class so overfitting but Overfitting distribution understanding like how to address data distribution differences So that part I think that that was standard throughout the whole process was that they expect you to They expect their candidates to Have this holistic approach of both coding, but also a strong foundation and the underpinnings So if I could give a pitch for your program, one of the amazing things that Sarah has done in her extraordinary life is to create a Organization called Delta analytics, which is applying data science for social impact Projects is there room for people who might be interested to Contact you if they're interested in in doing that kind of work Honestly, I think that the the caliber of this course I would be so thrilled to work with Anyone from this course going forward. So like the way that it works I'll just we we pair with nonprofits all over the world and you work with engineers and data scientists over a six-month project and right now we Convolved with eight different nonprofits, and then we'll start a new cycle towards end of this year Congratulations again, I think we're all so proud of you Talking of talking of great work. I also wanted to mention the work of another student who also has the Great challenge of having to deal with being a fast AI intern being Brad Brad took up the challenge. I set up two weeks ago in implementing Cyclic Hall learning rates. You might remember that the Cyclic Hall learning rates paper showed faster training of neural nets and also more automated and Brad actually had it coded up Super quickly And so if you if you jump on the forum, you'll find a link to it there Maybe Brad you can add it to our wiki thread. Do we have a wiki thread yet Rachel? Listen 13. Could you create one? You just click at the bottom and say make wiki Honestly this as I mentioned when we first taught this this paper has been so little looked at we don't yet really know like I Have a feeling it's going to turn out that like this is the best way to train like every kind of neural net in every kind of situation So I've asked Brad to try as many possible experiments here And he's going to try and keep automating it This is exactly the kind of thing like Rachel and I at fast AI are trying to do if this really works out is like Get rid of the whole question of how to set learning rates, you know, which currently is such an artisanal thing So yeah, so congratulations Brad on getting that working and it worked really well in Keras I think you know with the callbacks they had the code ends up being quite quite neat as well Yeah This comment last time but I'm so excited about this callbacks and Keras because you could like set callbacks to stop Training if it's like If it's stop Improving and then you can enter another cycle to do different things and then kind of oscillate between like different ways Yeah, the callbacks and fully automates whole process So I kind of implemented similar like zigzag and things that So it's it's Keras is really close that like it's like this 10 lines of code will bring you so far with this callbacks. Yeah Yeah, absolutely. And so remember, you know that Keras has great source code, right? So if you look in your anaconda live python site packages Keras directory You'll find all the current callbacks So when I said to you know go for add some tips as to how to get started I said, you know go here look at the existing callbacks and see if that helps you get started You know anytime you want to build something in Keras One of the best ways to get started is to read the source code for something that it already has that's somewhat similar Torch you want to build and certainly with callbacks. That's an easy way to do that It's been a big week in deep learning as usual everything I taught you in this course is now officially out of date Hopefully though one of the things I have noticed is it's all building on stuff that we've been learning around about Have to begin to brag for I'm drawing my attention to this paper, which is a new style transfer paper which can transfer to Any style in real time? So you don't have to build a separate network for each one this is the kind of thing which you could absolutely turn into an app and Obviously no one's done it yet because his paper just came out So you could be the first one to say, you know, here's a app which can create any Photo into any style you like and honestly, I think this Third column is there the new paper In my opinion if you compare it to kind of the gold standard, which is the Gaddy's paper We originally looked at I actually think it's maybe better, you know like this one looks more Styly, but it's this one So there's an interesting paper a Lot of the basic ideas are the same, but as you'll see it's got some interesting approaches Gans have had a big step forwards These this is the previous state-of-the-art and face generation This is the new as you can see these are now Pretty much at photorealistic at least at 128 by 128. These were only 64 by 64 the old best out of the art so This is a pretty exciting step forward in in in Dan's and as we've talked about People use gans at the moment to create pictures But they can be used as a kind of additional loss function for any kind of generative network And so I think one of the things I always look for is what's under pressure What's under appreciated what's not being used much at the moment and I would say Dan's for a wider range of generative models, you know, so if you want to create a Really great super resolution thing or if you want to create a really great automatic create a line drawing creator or You know colorization system or whatever. I think Gans are a good approach Perhaps the most amazing one is this paper again, we'll add on to wiki Which does a bi-directional transfer even without matching pairs And so for example clearly this would not be possible to do with you required matching pairs take a money and turn it into a photo We don't have any data on how to do that because we don't have any photos of money, but as you can see it works incredibly well Turn a zebra into a horse or a horse into a zebra Summer into winter winter into summer Again, you know, this is a a GAN based system creating photorealistic images or very Impressive artworks. I think this approach to style transfer, you know by putting a GAN layer is perhaps more interesting than the style transfer We come up with so far because it really has to create a Painting that that can't be recognized as not being a real painting. Otherwise the GAN will will fail So I think this is a really interesting approach to style transfer as well Okay, so Really interesting week. These are all papers Which I think any of you guys can can tackle right now because they're all built on the things we've learned in this course If anybody's interested in tackling any of them this week, that'd be super fun to talk about it on the forum Yes, just to clarify is this last paper also the GAN from the prior slide or is this a different paper? No, this is a different paper I'm sorry. I'm embarrassed that I didn't actually write on it where it comes from But the Do that Actually rather than me doing this, maybe somebody can try and find it because I think I showed Brad And maybe Rachel already seen I'll file it in my Twitter and put it on the wiki Cycle GAN, yeah, I knew it was something like that There we go, so I call game. Okay, this guy turns horses into zebras not really just the pictures Okay, thank you for the good question All right We've been talking about MinXif clustering a bit from time to time and so we'll talk about it more next week in terms of applications But the main application we've talked about so far is using it for The kind of faster pre-processing of really large data items like CT scans in order to find Objects of interest in this case lung nodules at my cancers and one of the things I mentioned that I was interested in was experimenting with combining approximate nearest neighbors with with MinXif clustering and so to remind you the basic idea was that Here's our GPU version We went through each mini-batch and with each mini-batch we then basically did a Distance from every element in that mini-batch through every single data item And then we took the weighted sum of all the data items weighted by the by the Gaussian on those weights and I pointed out that Most the vast majority of data points are far enough away that their weights are so close to zero that we could probably ignore them So maybe we could have tried putting a approximate nearest neighbors step beforehand So rather than adding up the entire data set, which could be a million points It could just be the nearest hundred neighbors So I actually tried that during the week This is basically what happens if you so there's an approximate so there's I just used the existing scikit-learn.neighbors Algorithms, so I haven't written my own pie torch version I know one of you guys on the forum has already started writing a pie torch version so I just tried a Ball tree, which is a particular kind of approximate nearest neighbors and the basic idea is you say, okay Here's the data that I want you to index so that I can rapidly do nearest neighbors So now please do a query looking at the first 10 data points and for each one of those 10 data points Show me the three nearest neighbors and that returns something like this right here are the 10 data points you passed in and here Are there three approximate nearest neighbors? And of course all the time itself is its own nearest neighbor. So that's why the first one is always kind of boring So then I thought okay, that looks good. So then I Just put that into the loop of our mean shift classroom. So I said okay each time we go through another epoch Index the data we have so far says chuck it into a one of these ball trees and then when we do our distances Don't do them to every point But instead do it into the nearest points. So I did this in a query here to find the 50 nearest neighbors This then puts that into GPU. I'm turning it into a tensor and doc cootering it So that gives me the list of the indexes Not the data itself So the interestingly the hardest step was the one that seems like it should be the easiest which is I had to then Look up each of these indexes into the data to return the actual points Right now easy enough to do if you just look through it one step at a time But we're trying to use the GPU here right if I tried doing it one step at a time It made the thing take forever because GPUs are not quick at doing things one step at a time. So I created a little batch-wise indexing function which in the end Rachel and I realized we could do actually with very very little code This is basically something where we could pass in our Array and pass in our kind of matrix of indexes and it would return a three-dimensional Tensor basically of every one of those points for every one of those nearest neighbors for every one of those dimensions So that was the only bit that was remotely tricky other than that That's this function other than that the rest of the code is the same so this worked in the sense that it sped things up But it didn't work in the sense of the result. Here was the input we gave it and Here's the output Now what's happened is it turns out each one of these little colored dots Actually now represents 50 points. So what it's done is it's it's taken me to literally right? it's basically found the 50 nearest neighbors to each point and Created a cluster of 50 points and once it's created these clusters of 50 points It's now stuck right there's no way for it to cluster these any further because every time it now goes nearest neighbors It says oh, there's 50 points that are right on top of you So I mentioned this to basically show the kinds of things that can go wrong and hopefully by next week this will be fixed You know it's like When I described this problem this seemed like it was clearly going to work And then as soon as I saw this picture, I immediately saw that it couldn't possibly work You know, so this is kind of the nature of trying things out But the interesting thing is I now realize that the solution to this will be way faster than this right? The solution is what Rachel describes as an approximate approximate nearest neighbors, right? Which is something that doesn't return the 50 probably nearest points It returns 50 points which probabilistically Closer ones are more likely to be there, but they're like zero guarantees like we want it to be less good So, you know if you're interested during the week if one of you wants to try creating an Approximate approximate nearest neighbor's algorithm. I think it'll basically be a case of taking like Lsh or ball tree or something and like removing a lot of the code, you know removing the code that makes it better So hopefully, you know Next week we'll see this working, but I thought it'd be interesting to see that, you know the intermediate stage and actually This kind of thing of like showing you the failures like you don't get to see most of the failures So I was working with some of the students over the weekend on implementing something that we'll see later on today And actually on Saturday afternoon I was sitting with Melissa going through some of this coding and at the end of that Melissa was like, oh, wow It's really interesting to see the process Because it's like you come in on Monday, and it's all just working where else we spent the entire Saturday afternoon like Slowly going through one step at a time and constantly making mistakes and going back and try to see what happened. So Yeah, the actual process of everything you see in class getting it working is full of failures and In fact, Brad is currently working on building one of the two things we're doing for next week And he just came up to me before class me. It's like, yeah, it's finished running and nothing worked at all You know, it's like, yeah, of course nothing ever works at first time And so part of the process is you know and something I've been trying to kind of talk to Brad about with you know with his work is like Recognizing that every time you build something new it won't work means that you need you need to build it in a very different way, right like Right an iPython notebook as if you're going to be teaching it that next Monday's class Right because because it won't work the first time you're they're going to have to go back and be like Which one of these steps failed right so at every step you want to be printing out the results Summarizing the bet, you know the key statistics drawing pictures writing down the reasoning that you did things You know so that then when it doesn't work you can go back and be like, okay, we're where did this go wrong? Or even better. Hopefully seeing the mistake Earlier on rather than wait until it's all finished and so there's a whole lot of like Data science process steps which You know those of you that have worked with data scientists will be pretty familiar with by now Hopefully and those of you that haven't It's really a case of bringing in the software engineering Mindset, you know lots of testing lots of iterations The more data scientists can learn about software engineering practices, I think So very interestingly Today or maybe yesterday Facebook just announced that they have implemented a enormous improvement in the state of the art and approximate nearest natives So you can check this out if you like f a i s s ice By the way everything I mentioned here, I've mentioned earlier on Twitter, right? So if you follow me on Twitter or keep an eye on my Twitter account, you'll see all these things first Regardless of where the class is still running them up so this was particularly interesting to me, you know a because I've been Talking quite a bit during this part of the course about approximate nearest neighbor It's actually really important for deep learning and You can get a sense of how important is by how much Facebook have invested in this right? This is a multi GPU distributed Approximate nearest neighbor system that runs about 10 times faster than the previous state of the art Now I happen to get a particular insight into this because earlier this week I Was at a Conference at Berkeley and thanks very much to Melissa for letting me know about the conference And thanks very much to the people at the conference for letting me in the day before when it had been full for months and I chatted to yarn look on and I chatted to him about this Thing I've been thinking about a lot during this part of the course about transfer learning, you know and like You know, how come people seem to always use the GG to the best transfer learning results when there are so much better systems around and You know, I've been really coming to this conclusion that the reason is fully convolutional nets like resnet and inception net have Very very large very very redundant Intermediate representations. So because they don't have a fully connected layer like in resnet the penultimate layer is 7 by 7 by 2000 Which a is huge and B if you look through that 7 by 7 Most of them are the same more or less as their neighbors, right because most parts of an image is similar so You know, if you and if you want to go back and far enough through enough bottleneck layers You actually get a fairly general representation like we do with VGG when we go to the first only connected layer You're at like 28 by 28 by 500, you know, they're far too big to work with and far too redundant So I asked John the Coon about this and I said, you know as anybody Working on the question of like, how do you capture the benefits of these? much more accurate architectures but create efficient distributed representations And I look and it's like, yeah, it's absolutely of course. It's like Where do I read about it? And it's like, well We use it in Facebook all the time like at Facebook We take every object in Facebook and we create a compressed distributed representation of it and we save it in databases and then we give little bits of code to all the groups so that they can create simple linear models on top of these distributed representations and like this is everywhere in Facebook and I was like, oh, so this is written down somewhere I don't know. Maybe it's in a technical report somewhere. It's I don't know It's like one of these things that I I knew it must be happening And so now I know at least one company very very big company that it's happening at a huge scale And so when they say here that this is used to search for multimedia documents that are similar to each other if we read between the lines they're not talking about looking at pixels or samples of audio they're talking about activations, you know distributed representations And we know now like when we've talked about perceptual losses for example How much better it is when you capture similarity using activations and you do using pixels So here's a huge opportunity for deep learning which Sure Yeah, I was I read briefly a YouTube paper in which they say something like an I'm asking you because maybe you have a better insight on how this is done I have been very interested in understanding it better So they they basically do some kind of embedding of the user an embedding of a video Which is probably something kind of similar to this and then so they first build this this Kind of representation of every user and every video and then on top of that they have another deep learning Network that actually does everything else. Yeah Yeah, so I mean at YouTube my understanding is most of their embeddings are based on More of a collaborative filtering approach like we did in lesson four I'm not I to my knowledge. They're not doing actual video features As far as I know But there's no reason they shouldn't be So I now know that Facebook absolutely out right and so Thinking thinking about this it like I thought about this many years in a medical context, you know medical imaging Every every medical image you have you have a compressed distributed representation of it such as the first fully connected layer of VGG It's stored in a database. I want to say a database. I actually mean a fast indexed approximate nearest neighbors Tree or structure and now you can go and grab, you know Every medical image that displays similar symptoms to this medical image or whatever, right? Yeah So exactly so This this kind of stuff is really exciting and it's like As far as I know it hasn't really been written down anywhere And it's not really being used anywhere much Other than at least Facebook probably Google does this too. Okay So that's just a bit of background We are going to talk About the by LSTM hegemony and these slides come from Chris Manning who he has come across Chris Manning before So Chris is a linguistics professor at Stanford. He did his PhD in linguistics And at some point in the last small number of years He discovered that everything he learned about the linguistics was basically a waste of time because now you can throw a Bidirectional LSTM with attention at pretty much anything and get a better result than everything from his PhD in linguistics So nowadays he actually teaches deep learning and in fact his Deep learning for natural language processing videos just went online today. So if you Want to learn more about that feel free So at this conference last week, this is one of the slides he put up He he is Described himself pretty disappointed about the situation. It's not where he wants linguistics to be but there it is And so this is what we're going to learn about today bidirectional LSTMs with attention So this is what happened when people started throwing bidirectional LSTMs with attention at neural translation This looks a lot like the 2012 image net picture the red is the kind of two generations ago approach to statistical machine translation phrase-based the purple is the last four years of You know next generation approach, which is syntax-based machine translation Neural machine translation didn't really appear properly until 2015 and this is the Path that that's now on and we're probably well beyond that now because of course Google's Neural machine translation system is now online and a lot of the stuff that's coming out of that is appearing in papers now, so Actually back here in about 2013. I actually gave a talk at one of the academic conferences Which is like a introduction to deep learning for people in the statistics academia who maybe weren't familiar with it One of the things I told them was all of you who are in NLP Start learning deep learning now because there's no question in the next three years or so It's going to be the state of the art and pretty much everything and so hopefully some of those people listen to me and today They're very happy like even though We couldn't exactly tell at that point how that was going to happen It was just really really obvious that it's like, you know, it's a it's a system Which you know uses distributed representations and you know has all of the properties that You would expect to see in something we're deep learning of is successful and you know today I Think increasingly people are realizing that large areas of classic statistics and machine learning and the process being replaced So Chris these are all Chris slide still Chris described the four big wins of neural machine translation And I think one of the most interesting ones is number three, which is the statistical approaches tended to use Like n-grams. So like these three words appear these three words appear these three words appear next to each other in these situations You can't get beyond about five grams So it lists of five words because you just get this total exponential explosion of how big that data set is With RNNs remember when we very first learned about RNNs. We talked the reason why stateful memory Long-term dependencies Right. So this is exactly what you want if you want to make sure that your verb tense and everything else lines up with your You know details of your subject. They could be 20 words apart or more, right? So we need to be able to have that kind of state So this is where this works very well interestingly And you know, I'm not deep in the NLP world So I wasn't quite clear on where things are with all of the states that said of the art Chris said in his talk By LSTMs with attention are the state of the art, but basically everything, right? And so Well, we actually have an NLP person here, can I ask you a quick question about this Is this kind of line up with your experience you're like an NLP researcher My research is mostly in information extraction, but you know, first of all, if Chris Manning says that it's true So the other part of this is that in some of those areas while While this is state of the art definitely the state of the art is not that impressive. So that's definitely true But he's right like this is what's up and people have found that for all those tasks And others which are not as severe Which are kind of similar in certain ways that is the state of the art. Okay. Yeah What I was talking about about the kind of Disappointing state of this so like one of these slides actually Chris Manning had a big frowny face and one of the audience is like what's with the frowny face and is you know, and he's basically describing all of the ways in which The state of the art results still a far shorter where we want them to be So, you know NLP is by no means solved, you know, we could kind of say basic image recognition is kind of solved But basic NLP is not really solved But here's a great example Chris gave of something that you know This approach has worked really well for very difficult task. It is to read a story like this right and then in the story highlights one of the words or what phrases is deleted and the The neural net has to figure it out. So if Star Wars is deleted or be characters in movies have gradually become more diverse And you have to predict Star Wars. So this is a challenging problem that really requires some in-depth understanding and The work that Chris showed and actually he was a Senior researcher on is to basically take the query Chuck it into an embedding Take the entire story chuck it into an RNN Take the RNN that comes out do the whole thing with attention And that's it right so So let's talk about how to do this So there's a really nice Article on distilled up up which has a bit of a picture of this So imagine we're translating English into French, so here's our English, right? And so we're trying to translate the European economic area so The is la right, but that is la because of the gender of the noun Area, so you see these little purple lines here This is showing that this neural network model has learned that when it is looking at translating The area and specifically there it also needs to look at The word area in order to figure out that it's la so these purple lines are showing the weights in the neural network For translating that particular word, right? So when it was translating signed This is what I was talking about long-term long-distance dependencies It's not just using the word signed but also looking at what was signed In order to figure out the details of how to how to use this verb, right? And then some things You know, you kind of need to look at combinations. So was signed You know, you have multiple prepositions working together, right? So We need to come up with a an architecture which is capable of learning these attention weights, right? So as we mentioned last week This really fun paper called grammar as a foreign language Jeffrey Hinton was a senior research on this one has a nice little summary This is not where it originally came from but a nice little summary of how attention works So let's let's go through it like we don't often go through We don't often go through things by looking at the math, but I think in this case You know the math is Simple enough That this is actually may be a good practice So let's start off by looking at the notation. Okay, so there's an encoder, right? So remember the encoder is the RNN or number of layers of RNN Which are going to look through the original source sentence. So in this case the English sentence And it's going to spit out a hidden state And if in Keras we say return sequence is equals true. We'll get a hidden state after every single English word Okay, so if you've got 10 words in the English sentence, we'll have 10 pieces of encoder state so the encoder state they're going to call it H right and See here remember these things that are underneath or on the top generally just the same as in NumPy is putting things in square brackets, right? So what they're doing here is they're telling us that ta sorry they're using a ta is the number of words in the English sentence, right? So you've got one through ta bits of state each one's called H Okay, so that's the encoder hidden state and Then the decoder as you run through each step trying to translate this thing into French is getting its own creating its own hidden state That's going to be called D. Right and D will go as far as TB that'll be the highest index of the French sentence So in the end our goal is to replace every one of these so we can write these as D T meaning each one of these items of decoder state We're going to try and create something called D dash T which is going to represent the result of this Attention process, right? So basically it's going to represent the the word in this case the representation in the hidden state of the word for area weighted quite a lot and Economic weighted a little bit and everything else not much at all So not surprisingly the way to do that is with Is with a weighted sum right, so Here remember H. That's our encoder state, right? So that's the state from the RNN run over that English sentence Okay, so the word for area and so forth. Yes, Rachel We have a question to clarify in this case instead of creating one condensed Representation vector that captures the entire English sentence here Are we just looking at the existing hidden states that get generated as the English sentence is processed word-by-word? Yeah, exactly It's just a difference between return sequence is equals true versus return sequence equals false So if true it just throws away everything except the last date Sorry, but I think it was false to throw away everything except the last state return sequence equals true is Keep all the ones in the middle and remember importantly this is a by LSTM a Bidirectional LSTM, right? So we've got one LSTM going forward through the sentence and another one Going backward through the sentence. So every one of these pieces of encoder hidden state represents All of the words before it in that order and all of the words after it in reverse order Start together and remember also it's been layers. So there's like quite a few layers, you know layers of non-linear Neural net layers going on. That's what this state comes from So yeah, so each element of H is already a pretty complex Calculation to get there pretty sophisticated calculation And so we're going to take those and we're going to multiply them by some weight some weight And the weight see how the weights got two indexes on that means that the weight depends on two things What is T? Which is which word in the French translation am I trying to create right now and? I Which is which piece of encoder hidden state am I currently calculating the weight for? Right, so and because this is being calculated With a function, you know In if we were doing this in Python, we'd probably be, you know writing like get weight You know T comma I right and so the line above it Tells you how to go about calculating that way Right, and it tells you not surprisingly that to calculate that way. We're going to be using softmax And we're going to be so using softmax on top of some other function So why softmax? Well, if we're doing a weighted sum we want the weights to add up to one That's one reason secondly most of the time in translation The thing that you're translating is largely just one word Right, so 1992 is translated as 1992. I don't even know how to say that. That's you know That's just August right, but then sometimes it's mainly one word like in But a little bit of some others so a softmax because it's either there's something divided by some of either the something Remember it tends to be Very big For just one item in the vector and fairly small for everything else But by using softmax we we capture that quite naturally so that's why That's why we use softmax to calculate these weights right, so softmax of You right you is another function or the result of another function and what is this? this is just a Multi-layer perceptron with one hidden layer a neural net one hidden layer if you have a look it's a couple of bits of data multiplied by weight matrices Put through a non-linearity Modified by another weight matrix What's through another non-linearity? So that's the definition of a neural net with one hidden layer So when it all comes down to it what this all says is in order to calculate the weights on Each of the source encoding words for each of the target translated words trained a tiny little neural net with one hidden layer Which learns to figure out which source words to translate now? Okay, and so remember one of the things that Chris mentioned in his like what why is neural machine translation good One of his reasons was because it's end-to-end trainable We're going to embed this mini neural net inside the bigger RNN so the whole thing is going to be just SGD all in one go Okay, so that's what we're going to try and do so. I just want to make sure That's clear Before we look at the code for how to do it if anybody What a couple questions Just quick so how's a hidden state looks like is it like set of activations or yeah So the hit state is just a normal LSTM hidden state. So it's just a vector, right? if we go back to We got a question right over there So Anytime like you want to remind yourself of what's really going on with RNNs go back to this lesson five RNN PowerPoint because we go through what we remember like, okay This is not an RNN, you know, here's a basic neural net and we could like Have a multi-layer neural net and then we can have a multi-layer neural net Which we have a second input coming in still not an RNN We could have two things coming in at two different times and then we could also tie these weight matrices and the the notebook That goes with this class We do all this like by hand in keras with all the weight tying and everything so you see every step And then we realize oh that last picture could have been drawn like that Okay, and we also realize We could do this as well, right? So so all of these circles Represents just a bunch of activations just a vector of activations. And so we have a bunch of these layers Until eventually we decide to keep some final set of activations. So this is this is for we're doing Yes, so then the I don't know what to call the attentional model. Yeah, it's relatively simple It's just a single single layer layer. Is there a reason that it wasn't more complex or some fancy architecture? You know, I've been wondering that myself. I don't know I Guess the answer is it seems to work, you know, there's no reason you couldn't have two hidden words I Guess, you know, it's so easy to Go back and look and see what the attentions are and presumably so far the way people are using this they're finding that The attentional model is getting the correct stuff. I think if you try and do an attentional model for something Else and that doesn't happen. That's a yeah, check another layer in fact This this terrific post actually shows a really cool example of this which is on the bottom here is a sound wave And on the top is the speech recognition and so, you know, people are actually using Attention models to automatically do speech recognition by figuring out Which parts of the sound wave represent which letters? And in fact, one of the super cool things to come out This this week was the the taco Tron Which is one of the best names for a paper have come across and the taco Tron Has some Fantastic examples actually of what it sounds like. Let me see if I can get this Basilar membrane and otolaryngology are not auto correlations So the cool thing here is like It changes depending on like where the punctuation is pretty impressively or even whether those capitals check this out The buses aren't the problem. They actually provide a solution The buses aren't the problem. They actually provide a solution or questions Does the quick brown fox jump over the lazy dog? so With this kind of end-to-end training Like this right You don't have to really build anything special to make that happen. You just need to make sure that your labeled data Is correct? And actually somebody pointed out something really neat the other day, which is if you want to build a speech recognition system But one easy way to do it would be like Grab some audible.com audio books and the actual books and like you could have 40 hours of training data like that, right? In fact, if you grab Stephen Fry reading Harry Potter, then you could like have every Harry Potter voice as well So this is a this is a super amazing technique and surprisingly enough This single hidden layer seems to be enough to do attention pretty well So last week we were looking at the spelling bee right where the inputs were things like Key and we were having to figure out to spell it so we key We tried it without attention and we didn't get great results We looked at the paper from the original but I know at our Paper that showed that with longer sentences in particular you get much better results if you do use attention So let's have a look and see what that looks like so Keras doesn't really have anything to do this effectively so I had to write something So before I show what I wrote let me describe kind of what it looks like Most of it looks exactly the same as the original spelling bee model. We have our list of phonemes as input We have our list of Letters to spell the word with as our decoder Decoder input We chuck them both through embeddings We then do bidirectional RNN on the phonemes and then we chuck an RNN on top of that and chuck an RNN on top of that Right, you might remember from last time this get RNN function. Just to remind you There's just something that's returning in LSTM, right and by default. We're saying return sequence equals true Right, so last week The final one was get RNN false We put everything into a single little package and then we fed that to the RNN, but now we're leaving Return sequence is equals true all the way through so that we can then pass it to this special attention layer that I created We'll look at how this works in a moment, but let's first of all talk about like what does an intention layer need? It needs to know What kind of RNN layer do you want it to create? Right, so I just pass it a function which creates an RNN layer Okay, in your decoder, how many layers do you want to create? So I say okay create three LSTM layers in the decoder and Then what information is it going to need to create that? It's going to need the all of the encoder state. So that's just sitting in here And then we can do something else to make it easier We can do something called teacher forcing and teacher forcing is we are going to pass it As well as the encoder state We're going to pass it the answer But we're going to pass it the answer for the previous time period right, so in other words If we're trying to learn how to spell the wiki at Step one We're not going to pass it any layer at step two. We'll pass it z at step three We'll pass it y step forward pass it w in other words We're going to tell it what the previous time steps correct answer was Why do we do that? It makes it much easier to train particularly early on right early on it has no idea how to spell and So it gets most things wrong most of the time and so the late the later letters Everything's wrong before it. So how can it possibly know what letter to use now? right So with teacher forcing we're going to take our Thanks Rachel. We're going to take our input Data being all of those encoder hidden states But we're also going to tell it all right even if you got the previous letter wrong This is what it should have been right and so it's just it's going to make it a bit easier for it And all we do is we just concatenate together the hidden state plus that That decoder input so that's called teacher forcing right you don't have to use teacher forcing But if you do it just makes it faster and easier to train So that's why we pass in this special Decoder input and just to show you the decoder input Yeah, right The decoder input. Okay. Yeah, so It's all of my labels right except not the very last one right so that's going to be one less than the full number of Items to spell letters to spell and then I concatenate a column Of ones onto the front now. Why is it a column of well throw ones times? Go token and then go token was something that I set up back here Yeah, I decided that an asterisk right so an asterisk is going to be like a special character Which means this is the start of the word All right, so we're going to get past every time we're going to get past if it was a wiki We'd get past asterisk ZYWICK All right, so that's going to be this teacher forcing input Okay, so this attention layer is going to create a three-layer RNN It's going to get given all of the encoder state. It's going to get given that special decoder input And it's going to spit out Just a list of states again And so we can just do the usual thing of doing a time distributed dense with our appropriate vocab size softmax to turn that into Our target activations so putting aside how we calculate this Everything else is pretty familiar and so we can then just go ahead build that model and pilot fit it passing in both the The phonemes and those decoder inputs Train it for a while a bit of annealing turn it a bit more Until eventually we have actually a pretty good accuracy 51% And Here it is. Did you have a question Rachel? Yes, there's a question Do you know if the current work strictly uses the hidden states of the final RNN layer or if people tried using attention sums over different indices over multiple layers the way we did when using perceptual loss for multiple layers You know, that's actually a great question. The answer is to my knowledge people are always using the last layer hidden state But I kind of accidentally used all of them once and got some better results So I don't know if I screwed something up or if this is a good idea So that might be something if you want to test it, you know to try it out Yeah, I was like after fixing a bug things got worse, which is never a good sign So maybe it's not a bug after all So remember last time we had trouble with a really long words lots of phonemes, but here's 11 phonemes and It's done it perfectly Right, so that's that's a good sign that this is handling some longer ones pretty well And you know, it's 51% good or bad. I don't know right because like spelling is You know It's not exact, right? There's no way you could Algorithmically get a hundred percent spelling right like is it meant to be speak like this or speak like this? This is from a test set remember, right? There's no way for it to know So I think it's done pretty well so The only thing remaining is how to do this attention layer. So the bad news is that the attention layer is the largest The largest Custom layer we've looked at I mean, it's not terrible, but it's it's a hundred lines of code Okay But a lot of it's kind of pretty repetitive So it really depends how interested you are a in custom layers and be in attention as to how much you want to look at this But basically I'll show you the key pieces, right? There are two really important methods that get called in your custom layer and they are build and call Build is the method that's called in your custom layer when your layer is actually inserted into a neural network and That that causes it to get built right when once it's inserted into into a Keras neural network That's the point at which it knows how big everything has to be right because you told it How big the input was and then you've got you know the various layers connected to each other And so eventually it can figure out how big is you know in fact specifically What is the shape of the input right? So when your build method is called you get told? What is the shape of your input and this is the point where you now have to set everything up to work with that? Input shape right so in this case Build gets called as soon as it finishes running this line Right because it knows that it got this Input right and it can figure out this input shape by going all the way back through all this right Now we have two inputs Right, we have the encoder state and the decoder input So we can pull those apart into the encoder shape and the decoder input shape So some things which seem like they might be hard actually incredibly easy to create my three layers of RNNs is a single line of code Right remember I just passed in the RNN generating function, right? So I just call it for each layer right and with the list comprehension I now have there is Three RNN layers in a list that was easy On the other hand to do this I'm going to have to do it all by hand right and in fact there's some things are being hidden here for example a Bias term is not included here Right, but we definitely want one right like people often don't include bias terms because you like you can always avoid it by like adding a column of ones your data So like it's common in math not to mention the bias term, but generally speaking anytime you see The neural net unless they state otherwise there's probably a bias term there So we have to in build we're going to have to create w1 w2 and V right To create them Keras actually has a convenient method that comes from the layer superclass called at weight Right, so at weight you basically say okay. What's the size of the weight matrix? What's your initialization function and give it a name? so I just took that and brought it w right So here, I guess go Www right passing in all right, so w1 how big is that going to be? W2 how big is that going to be here's my bias How big is that going to be? Here's my V how big is that going to be and then I got two other things here, which is a w3 And it be three and that's because this is kind of hand-waved over something which is this final answer Basically needs to become the hidden state right of your next layer of the RNN, but it might not be the right size So then I just added one more transformation to the end to basically Make it the right size and in some of the attention model papers they do lay that out So I don't know like whether Hinton's group Just ignored it or maybe they built it so that the shakes worked anyway. I don't know So we've got one extra one extra affine transformation A minor note For those of you playing with PyTorch, I'm sure you've discovered how cool it is that you can go self.add module And it keeps track of like everything that you need to train all the parameters Keras is not so clever So that's why we have to go self.add weight for every one of these weights Because Keras has to know what to train when you optimize. What should you train? Unfortunately when we created all these RNNs these RNNs have lots of weights, which we haven't told Keras that we need to optimize them So I actually hadn't I have not found any examples of custom layers anywhere on the internet where people have actually done this But I figured out what you can do is you can basically I created this little function that goes through every one of my layers every one of those RNN layers and Finds all of the attributes that Keras needs to know about so it needs to know what are the trainable weights Non-trainable weights losses updates and constraints So anyway, you can copy and paste this code if you want to create a custom layer Which itself contains some Keras layers Copy and paste this code and it seems to work That's really nice Okay, so that's the first main thing you have to create when you create a custom layer is build Creating build is really boring because you have to get all these Bloody dimensions to be correct, which is a pain in the ass. And so when I built that I had this little Bunch of like tensorflow Playing around code where I would just keep going in and being like, okay, if this is the size of my x This is the size of my w1. Let's try, you know doing each calculation So I like did each calculation in a cell one step at a time and you don't look to see what all the shapes Were and then went back and put them in so, you know, like I don't think any normal mortal person can like Write all of these dimensionalities and actually get them to work without doing them one little step at a time So make things easy for yourself So the other key thing that happens in a layer is to call Call and call is the thing that basically that's your it's your forward pass Right, this is like, okay, this is calculate go calculate. And so you get passed in the data the calculate on right so in this case We actually have to Step through the steps of an RNN And Keras actually has a k dot RNN function Which is very very similar to the Arno dot scan you guys remember the Arno dot scan This is basically something which it calls some function for each step It steps through each part of your input has some initial states And so forth, right? It's it's almost identical to the Arno dot scan, right? It's a really low-level thing and so Keras Really doesn't have a Convenient user-facing API for like creating custom RNN code. In fact, this is something which nobody's really figured out yet TensorFlow has just released a new kind of Custom RNN API, but there isn't any documentation for it So I was hoping to teach it in this course, but I think it's just a little too early And I don't know if it's good or not. So, you know, this is this is like a Bit of an open question really is like how do you create something as easy to use as Keras? But with the flexibility to design your own Kind of RNN details, you know, and as you can see in this case, it wasn't it wasn't convenient at all You know, I had to go back and Kind of right, you know run this scan function from scratch and set everything up in scratch Anyway, so basically all the work happens in here Which means all the work is happening in this step function I created. So here's where the actual calculations are done So here's my step function. And so in the end when it actually gets to it. This is now looks a lot like The hint and teams code, right We basically can see us doing the the dot and at the bias and here's the fan and here's the V and the softmax and here's the bit where we do the weighted sum, right? And then this is this one extra step I mentioned which is to get it to the right shape for the RNN again And so now that I've done all of that pre-processing Right to basically I've started with X and I ended up creating this thing called H You can now go ahead and go through all of those three RNNs calling step on each of those Okay, so That's you know, this probably shouldn't have been that hard in the end But it's just the nature of kind of the Keras API for this stuff that it doesn't really exist that we had to go in and create it You know all we really wanted to do was say Okay, Keras before you run the three decoder RNNs take your input and Modify it in this way, right? So that's But we have to do it for every step, you know That's basically what was missing in some way and Keras to easily stay to say like change the step So I've spoken to friends while you know the Keras author about this He's well aware that this is not convenient right now and he really wants to fix it But I mean, it's this is difficult to get right and no one's quite done it yet We have a question and I didn't get the getting it to the right shape again part could you explain that again? Sure So step Thank you for the question step is Being called for every time step So at the end of it we have to return the new hidden state and then that new hidden state is going to come back into the next step, right? So in other words the thing that we end up with here needs to be the same shape as we had Here right so But it that doesn't happen, you know automatically Because we're actually taking This this weighted sum and we're actually concatenating onto it the You know the decoder input we end up with something that's a Totally different shape So but you can always change the shape of state by Shocking it through you know by multiplying it by a matrix Where the thing you multiply it by has a number of columns that you want to create right so I just made sure that w3 has the same number of columns as The h that we want had and so as a result by the by the end of this We've got something which we can feed back into the RNN again, you know, we just need to make sure you know an RNN Step can't change the shape that it started with right because it needs to be able to keep going step step step Every step needs to be the same Ancestors you showed how you figured out tested the tensor shapes How did you debug the attention class itself as a whole? Are RNNs easier and cleaner in PyTorch with the attention class have been relatively easier in PyTorch What if I told you we were about to look at So the the main thing for me to debug really was build right because This bit is Jeffrey Hinton's or Jeffrey Hinton's team's equations, right? There's not too much to get wrong here So really it was a case of like Doing it as I say like step-by-step in cells and printing out the shape at each point and you'll see that like I created these You know all of these different dimensions at the start, right? And I made sure all of these numbers are different so that every time I saw the number 64 I knew that was the batch size and every time I saw four I knew that was the time steps 32 was the input dim 48 is the output dim So I could go through and I could see okay. Here's my shape here. Here's my shape here And then you know all I just did was I just chucked in some print statements, you know print hw2 print you print a and so forth and so I Didn't need to see the contents of them when you print a tensor flow tensor like that It prints the dimensions of it which in my experience once your dimensions work You're done and as long as you didn't make the mistake of having multiple things having the same number of dimensions because like, you know as long as your dot product You know bits and you know if if h and w2 Didn't line up for rows and columns, you know the error you get is pretty clear You know, it'll be a tensor flow say these mismatch and it'll tell you what the dimensions are and so you can you know So generally speaking getting these things to match isn't too bad Yeah, I don't know it's it's no fun at all But it's kind of mean you'll you know, you'll get there eventually so Now that it's done, you know, I've written it right so feel free to use it as you can see it's easy to use and at the end You know, we get these pretty good results one thing to point out is When we go So we get the predictions, right by just saying model dot predict without test set right and Then those predictions You'll notice we go Split well, it's a little later one predictions I Split on underscore why do we split on underscore? That's because when I created the vocabulary earlier on I set underscore to be the zeroth element and Remember that all of our words are padded at the end of zeros. So when the Decoder predicts that this is the end of the word. It's going to spit out Zero or underscore Right, so this is kind of how a decoder knows to stop now It doesn't actually stop like in terms of computation The decoder is still going to keep calculating all of the rest of the steps because we don't have the ability at least in Keras to say stop now, you know, everything's rectangles. So hopefully the decoder learns pretty quickly that if the previous token was Underscore the next token will always be underscore right once you finish to stay fish So that's a minor issue there. Okay, so that's that so we're going to have a Seven-minute break and when we come back, we're going to see this in pie torch and we're going to use it for actual language Translation, so I'll see you at eight o'clock So I am so happy to see So one of our students Vincent like I was at study group a couple of weeks ago and was like Writing all these like I can values and I can vectors equations. So what are you doing? He's like This idea is something in this dial transfer. It's like we need to get out and I can feel it That's like it's interesting. Keep going, you know, then last week I like so I'm doing some more and he gets one of those some strange noisy pictures on the screen and Then on Friday, you know, quite a few of us got together to do some hacking and you know, I still saw him doing the same thing I know there's something here. It's gonna get close and he just told me he sent me an email. So here is a photo And here is Painting and here is the regular style transfer result. It's not bad And here is what happens when you use his new mathematical technique So hopefully by next week I will understand this enough that either him or I can explain this to you But I know one of the key differences He's actually using the earth movements distance, which is the basis of the losses time again or buses time again I've managed to avoid teaching about eigenvalues and no conductors so far Yeah, I don't know how we're gonna do that. Anyway That is like yeah, so that this this this is Where's Vincent? Yeah, this is this has got to be a paper like yeah, you know, if this you've created a whole new technique This is super exciting. So congratulations Look forward to learning more about it, you know, people just keep doing cool stuff. I love it You know you guys are just sipping along all right, so So let's translate English into French, right now. Here's a problem the The teacher forcing is fine. Oh question I Was um gonna say just the two questions, I think from right before the break One was could we use time-distributed lambda right before the RNN? Okay, so so no you can't So what this question is getting at is like You know, why don't we use the lambda player to do the you know the attention Calculations and then feed that into a standard RNN But remember those calculations are being done inside the step function, right? So in other words each step Uses attention to calculate the output of the step which impacts the next step So it needs to be inside the RNN. You can't just pre-process the whole air And then the next question was is there any reason why we used Hyperbolic tan as opposed to sigmoid No, no reason at all. So so Hyperbolic tan and sigmoid are the same thing hyperbolic tan goes from minus one to one sigmoids goes from zero to one Hi, that's fine Okay, teacher forcing was this thing where we're concatenating the the previous correct answers embedding with our attention-weighted encoder inputs in order to kind of help our Help our model to kind of keep track of where it ought to be And that helps the training now. There's nothing wrong with training that way But you can't use that at inference time at test time, right because You don't know the correct answer So my Keras model here is totally cheating Right. I'm passing in the previous letters correct answer to every step, but in real life. I don't know that so What we actually need to do is to an inference time You don't use teacher forcing, but instead you take the predicted Previous steps result and feed it in for the next step Okay, I Have no idea how to do that in Keras Like it it drove me crazy trying to figure out how to do that in Keras And that was the thing that pushed me to pie torch, right? Is that just I? Was so sick of this goddamn attention liar the idea of like going back and trying to put this thing in Very crazy, and I don't even know how to do it further more We actually want this to be dynamic it turns out the best way If you use teacher forcing the whole time You actually end up with a model that gets like sloppy it like it learns to take advantage of the fact That it's about to be told what the previous thing should have been so it kind of uses a more speculative approach, right? So actually the best training Approaches as it goes through the epochs Initially, it uses teacher forcing every time and like halfway through it uses teacher forcing Randomly half the time and at the last epoch it doesn't use teacher forcing at all, right? So kind of learns to wean itself off this extra information and That kind of dynamic thing that's really hard or maybe even impossible to do in Keras So for all these reasons I moved to Pytorch, right? I haven't done all of these things in Pytorch I've done most of them, but like the dynamic Changing of teacher forcing. I've actually left for one of you guys to try out. Anyway, the basic ideas are all here So let's look at the Pytorch version interestingly the Pytorch version in terms of like the attention model itself turns out to be way easier But we have to write more code because there's less Structure for NLP models like for for computer vision stuff. There's the Pytorch vision project which has all that data loading and Models and blah blah blah. We don't seem to have an NLP kind of version of that. So there's a bit more code to write here Anyway, so let's translate English to French and so What I did was I downloaded this This giga French Corpus, which I'll put a link to on the wiki Basically, this is a really cool idea. What this researcher did was he went to lots of Canadian websites And and used a screenscraper to figure out whether there was a little button saying switch from English to French And the screenscraper would automatically click the button and then assume that those two Screens were the same right and then he tokenized them into sentences and provides this corpus of like a billion words So this is pretty great I Didn't really want to create like a complete English French translator because I just didn't have the time to run it for long enough So I tried to think okay. What's like a an interesting subset of English to French So I thought okay. What if we learn to translate English questions Let's start with WH. So what who where why Right, you know and so when I looked at it it turned out that That was That was something, you know, that was like 80,000 or something there's a lot of sentences basically but the nice thing is that all of those sentences are You know have a somewhat similar structure So it's like it's we're going to learn everything about translating English into French that we're going to be doing it with a slight subset So that's why I did this Red Jacks which says okay look for things that start with WH and end with a question mark In English and where the French can be anything at all, but it should end with a question mark Okay so I Went through the French and English Files This is this really cool trick We've mentioned once before that once you go open in Python that returns a generator that you can loop through and so if you zip the Two together you've got the English questions and the French questions and then just go through and run the red Jacks And then return the ones that both of the red Jacks smashed. Okay, so here's some here's the first six examples So that looks good. And as you can see a lot of them are really simple and You know some of them are more complex As per usual dump that in a pickle file so we can load it in quickly later All right, so we've got the English questions and the French questions I'm going to show you all the steps for real world NLP so you can like see all the pieces and we're going to do everything by hand So you can get a sense of exactly what happens So the next step is tokenization So tokenization is taking a sentence and turning it into a number of basically in the words Right, but this is not quite straightforward because like what's a word, right? So is Is that a word right or is that a word or is that a word? Right, and so, you know, I had to make some decisions based on my view of what was likely to work And which is basically is like, okay. I think that's the word right, so I just wrote So regular expressions, right for doing kind of heuristic tokenization now you can use NLTK the natural language toolkit that has a bunch of tokenizers in Honestly though, I actually found My hacky rules based tokenizer like I was happier with it than any of the NLT tokenizers I tried with this problem So you can see that basically for example, if you have any letter followed by apostrophe s Right, then I want you to make apostrophe s a word, right because it's it's kind of more like off than anything else Or else if it's a Letter followed by an apostrophe followed by a letter. That's probably French in which case everything up to the apostrophe is one word and everything after it is another word Right, so you can basically see it here. I'm tokenize the French Kerr, it's Kerr, but we're okay. That is exactly what I want And here's the English right and so then let's test a very accurate statement like Rachel's baby is cuter than others Is and you can see here the apostrophe s is okay doing the right thing as opposed to the apostrophe Okay, so check out my tokenizer all looks good Makes accurate statements about Rachel's baby as well. So all good So now that we've got that tokenizing We can go ahead and like do the standard thing that we do every time we work with an LP Which is to turn our list of words into a list of numbers. Okay, we always do it the same way basically Create our vocabulary. What are all of the possible words and how often they appear? Insert a padding character Insert the star of stream character. This is like the asterix in the previous one Create the reverse mapping from word to ID by using this with all enumerate trick and Then go through every sentence and turn it into a list of tokens by calling that dictionary And you have a question behind you Rachel You need a stammer before you Convert into numbers. No, no not at all we Um different Words with different stems are different words and have different translations. So we want to Keep that that whole thing, you know, the tokenizer really is just to Is to do something that we think we can do in a purely rules-based way the question of like how do we deal with? Morphological differences is actually highly complex varies a lot by language. We want to learn it in New York now all right, so this is going to end up returning the list of IDs for each sentence the vocabulary the reverse vocabulary and just a frequency counts for the vocabulary Great Next step is to turn these words into word vectors so I'm earlier on we used word to back because word to back has these like multi word things But for translation, I don't want multiple things. I want single word things. So that's why I'm going to use glove So go ahead and turn that into a dictionary from the word to the word vector Also grab some French word vectors found this really fantastic site that's got some nice French word vectors And so now go ahead and build a little thing that can go through my vocabulary Create a big array of zeros and then go ahead and fill in all of those zeros with Word vectors, but if you can of course sometimes you look up a word And it's not in your word vector word vector list in which case you can just stick a random So this is all stuff. We've done many many times now I also am keeping track of how often I find the word just to make sure and for English out of 19 and a half thousand in the vocab we find 17 point two thousand Word vectors in love. So this is looking pretty good So most of our word vectors are being found and for French a little bit less, but still not bad that's probably because of my particular tokenization strategy might have been different to the tokenization strategy they use to these word vectors and You still have the audio popping crackling problem from time to time. Oh I Yeah, I still have it in the same mode as last time, so I think we've used all the tricks we know Thanks letting us know Okay, and then of course the other thing we have to do with NLP is to make everything rectangular Notice here. I'm calling a Keras function, right? If Keras does something you need Please use it even if you're in pytorch. It's fine, right? Yeah, I've heard a number of people say, oh, I'm trying to use pytorch But I hate that it doesn't have pad sequences. Well, if you import pad sequences Train test split to grab 10% of the data the train versus test And here we have it, right so 47,000 train 5000 test and here's an example of the French sentence and English sentence and all the padding So we tend to have to now have to do all the data loading stuff ourselves, right? So I've gone ahead and created a get batch function that is going to go ahead and Return a random permutation of Len X so Len X is the length of All of our sentences. So this is going to return and then grab the first batch size of them so this will return 16 random numbers between 0 and 47,097 And so then I can just return those for English and those for French So it's okay if you don't have a data loader, right? This is all it actually takes to create a basic generator, right? This is basically doing the same thing as a generator if you don't need any data augmentation And again, you know, here's a piece of code You can steal right you need a generator in pytorch is here's a generator in pytorch pass in your Data and your labels and your batch size and it will return a batch of each all right, so We have time to show you this I mentioned last time that I created Broadcasting functions for pytorch basically what had happened was I had all This keras code that worked and I wanted to port it to pytorch and pytorch doesn't have broadcasting So none of this stuff worked and like so my pytorch was like way more complex than my keras Because there was dot squeeze and dot unsqueeze and dot expands everywhere so I wrote add subtract bottle pi to divide and Dot such that they have the exact same broadcasting semantics as keras This is actually pretty interesting and I really wish I had time to show you But maybe you can have a look at it during the week and ask questions on the forum Because it's so little code, right? so the amount of code to make all of that work is Basically that and That it's incredibly little code But importantly, I also want to show you how I build this stuff. I always use test driven development for this kind of thing So basically I created a whole bunch of Matrices vectors three-dimensional tensors and four-dimensional tensors transposed versions of them Wrote something that checks the two things are the same and then I just go to head went ahead and like Tried making sure that all of these things ought to be the same as all of these other things and first of all, I wrote it You know all the tests And then I gradually went through when putting code so that the tests started passing and then I went back and Kept refactoring the code until it was simpler and simpler So you can see in the end, you know all of my little functions are nice and small and what this meant was I could now Write an attention model using almost exactly the same Notation as before You know, so That was how I created these broadcasting operators Okay, so given that they exist here is a non-attention encoder and So you can see basically all it is is Create some embeddings Create a grew And then in the forward pass Run the embeddings and then run the grew on that Okay, and in pytorch a grew You can you don't have to actually write three Grooves next to each other you can just say number of layers equals and that's going to stack the grooves on top of each other Okay So that's that Pretty straightforward The decoder Is also pretty straightforward again create the embeddings Get grew Except for the decoder we also need a linear layer at the end and keras to be a dense layer Which is the correct size for our output vocabulary, right? We actually have to remember for that inference time We don't just want the state We actually want to get out something that we can like do an arg max on to find out like which word Have we just translated this into so we need is this layer to turn it into the correct size for the for the French vocabulary size So the forward pass here This is a little different as you'll see when we get to where we actually use this. This is actually just a Single step And So this is basically just you doing one letter at a time as you'll see when we get here and here's the actual softmax on the Dense layer that we created. So it's just embedding grew dense layer softmax, okay And we return both the new hidden state which we'll be using for the next step as well as the actual Result of the output That's all fine. So here's the attention decoder and as you can see rather than being a hundred lines of code. It's a Screen and a half right and the nice thing is This is all basically the same as Keras Right my W123 my D23 my V my grew Right, and then my final output And then this is basically all the same as Keras as well. All right, so I've got my my dot and Then multiply softmax is the weighted sum Is that cat is the W3 out on the bias all the grew and Then dense layer and softmax And again, we both return the Actual predictions as well as the hidden states Now The thing is We have to write our own training All right, because this is pytorch So We're gonna have to go ahead and do a bit more work here. All right, so basically here's the code That trains an epoch All right create an optimizer for the encoder and optimize it for the decoder The criterion is you know what they call the the loss function negative or likelihood. That's the same as the cross entropy Here's that get batch function we created ourselves earlier to grab one batch of French and English Put them under the GPU and then call train. We'll look at that in a moment. Okay check of the loss from time to time print out the loss All right, so all the works actually happening in train Which is here. So like each of these things is less than the screen of code, but we still you know, we had to write it ourselves So encode our Decoder and encoder And remember with pytorch you have to manually call zero grad right so you have to zero out the gradient at the start of your training loop And then go through each word In your target, so we've already encoded it and call the decoder passing in the decoder input the hidden state and the encoder outputs And So then next decoder input comes out from there Keep check of the loss And then we have to call drop backward manually. We have to call the optimizer set manually return it So it's not very interesting code And it's the kind of stuff which I suspect, you know Hopefully the pytorch community and maybe some of us will all be able to contribute to Getting rid of all this oil plate kind of code kind of Keras style over time. I'm sure that'll happen Anyway for now, there's the code Now that it's there we can create our encoder. We can create our attention decoder. We can train it for a while And then we can test it now for testing I need another function right because I want to turn off teacher forcing, right? So you'll see in this function This is not very well refactored. I've copied some code here, but again, basically I encode and then I go through my target length call the decoder But look here. I now take my decoder output and find the top One so this is basically arg max, right? So this is saying okay. What what word did we predict? So we're not using teacher forcing. We're not saying what was the actual Word because that would be cheating. We're saying what word did we predict, right? So that Now becomes the input to the next loop, right? So this is how we've turned off teacher forcing So the kind of exercise for one or more of you guys this week if you're interested is to do the thing I told you earlier which is to kind of combine this With the training loop and make the training loop as it goes through the epochs Gradually move from always using teacher forcing to over time Randomly using it less and less and less until at the very end it never uses teacher forcing and always uses this If you get that working, which is won't take you long. I think it's pretty straightforward You'll get better test results than I'm showing you here Okay, anyway, let's test it. So to do a French to English. We basically are going to take our French sentence turn it into a list of IDs By tokenizing it and turning to IDs patting it with zeros All that evaluate function. I just showed you and then join it together And there it is So this was the correct English This was the French we were given and this was our prediction. So, you know, it's not the same But it's still correct You know as a speaker So that's that's looking pretty good So There it is. There's translation. Um, I Guess there's a couple of things I wanted to briefly mention One is at least are all stuff that you guys can play with if you're interested This decoding loop There's much better ways to do it, right what I'm doing here by taking the top one Every time is I'm assuming in the decoder that the top prediction is the correct one But what if like two words were nearly 50-50 right? This is 51% this one's 49% and I go Oh, it's definitely the 51% That might be a bad idea. And so I'm going to actually steal some slides From Graham new big from the Nara Institute of Science Technology Who has shown a fantastic? simple example of what you could do instead so He's doing something slightly different, which is to say what if we had a sentence Like natural language processing bracket and LP bracket and your job was to figure out I'm not how to translate it, but to figure out what part of speech each of those things work These are weird letters are Ten tree bank part of speech tags and is now J is adjective maybe is bird Left bracket right bracket. So the correct answer is that natural is an adjective And language is a noun and processing is a noun and so forth. This would be the correct path through these options Now, how would you Create this path? Well, you could start out by figuring out how likely Natural is in language in general to be a noun versus an adjective versus a verb and so forth And then having done that for every single one of those you could figure out How likely it is for everyone of these to then be a noun and then to be an adjective and then to be a verb and so forth And you could keep repeating this process adding up the log predictions all the way along all the way to the end And pick the path which was the best Now the problem is of course is that that's basically five times five times five Because if you already had five choices, you know exponentially complex and remember in our case We're not picking from five things. We're picking from 40,000 or 20,000 or whatever, you know the vocabulary of our French language things This is called the biturbi algorithm So the biturbi algorithm for machine translation is NP complete if you haven't come across that before it basically means forget it Okay, so Let's not do that But I'm sure you can see how obvious the the answer is Rather than doing the biturbi algorithm Why don't we just pick the top few hypotheses so far, right? so if Here are the scores for what is the word natural. It's probably not a left bracket It's probably not a right bracket, right? It might be one of these Okay, well, let's assume. It's one of the top three All right. Okay, so given it's one of the top three. What might be next? All right, and again, let's just pick the top three combinations And keep going through that. This is called beam search in practice Every state-of-the-art algorithm for neural language translation uses this for decoding now writing this again It's going to be less than a screen of code I haven't written it Why not go write it, you know this week add it to this code add beam search like it's Here's the entire pseudo code, right and and I'm sure you could write it in probably less code than that So there's been such there's one thing to mention you got something right now We have two questions, but kind of going back to your notebook. Yeah, I'm done with beam search One is do you know if there are any training methods that capture the fact that what is the population of Canada and What is Canada's population are very nearly the same? No, I don't but I'm not sure it even matters because on average, you know, you're you know a better system will be one which You know translates those 50% of the time into one versus the other like The best the best translator approach I don't think it's going to vary depending on the answer to that question. So I'm not sure that it's that important All right, and then could we translate between Chinese and English using the same method? Yes, we can but it would be better If we use the technique I'm going to show you next So the technique I'm going to show you next is described in this paper Neural machine translation of rare words with subword units So interestingly this actually came up today when I was chatting to Brad and Brad was asking me How do I create an analysis of people's tweets? using their particular vocabulary, but make it like You know not fall apart if they use some word in the future that I haven't seen before Right and that's a very similar question to how do I translate language when somebody uses a word I haven't seen before or more generally You know, maybe I don't want to use a hundred and sixty thousand words in my vocabulary That's a huge embedding matrix. It takes a lot of memory. It takes a lot of time. It's going to be hard to train So what do you do? And so the answer is you use something called B P P Which is it's basically an encoder and What it's going to do is it's going to take a sentence like Hello, I Am Jeremy Paul stop and it's going to basically say, okay I'm going to try and turn this into a list of tokens that aren't necessarily the same as the words and so the first thing I'm going to do is I'm going to look at every pair of letters H E The L L L L and so forth to my whole training set I'm going to say which pair of letters is the most common That and so maybe the answer is Yeah Right, so I take those two and I turn them into a so we start off With like every character separate so we're now to combine these into a single Entity called here Right, and then I go through when I do that again and maybe the next answer is That actually am appears a lot. So then I'm going to take that turn that into AM and so maybe the next thing is actually E R E is the next most common thing I can find so now we take this and Replace it with E R E And you can keep doing this, right? You never cross word boundaries Right, so in theory we could do this forever until we end up eventually with the words again But instead what happens with this BPE encoder is you provide a single Corralta, which is what is the maximum number of Combiners you want to do like a common default would be like 10,000. So at the end of that you're going to end up with like 10,000 Subword sequences, so I might we will end up like turning this state sentence into something like H E L L O And then there's a special end of word so space I space and Space maybe it'll be then be like J and then E R E And then M Y Something like that, right now the cool thing is that You can do this by going to this GitHub site and downloading this and running it on your on your on your file of Pros and it will spit out The exact same thing, but it's gonna it'll stick at space between every one of these BPE codes right so in other words To use this With what I just showed you you don't have to write any code. You can just take the English and the French sentences Run it through this and it will spit out You know BPE encoded versions of those Having said that, you know, I think Maybe the optimal approach Would be to actually write something Which first of all figured out like which are the most common 20,000 words or 10,000 words and left those words alone and maybe only run then the BPE encoding on the On the things that truly rate rare words because sometimes BPE encoding actually Splits things up in ways that isn't quite what you want Anyway, so this is a super important technique and so for Chinese You know for those of you that speak Chinese, you know that It's actually not at all clear where words begin and end. I mean not only are there no spaces But grammatically in Chinese there are Well, one example would be you can have a sequence of two verbs where the Sorry a verb in an adjective or a verb in a verb Which basically the second verb or adjective describes the result of the first one and so like you could treat that as a single word You perfectly reasonable thing to do or you could treat it as three words That would also be a perfectly reasonable thing to do and there's no right answer, right? And this kind of weird stuff happens all the time in Chinese You can insert the character nir into the middle of a two A two-character verb and it turns that into a new word, which means that that thing can't be done This is a new word was in our three words. It's it's very hard to tell so it's dead if you use BPE with Chinese You can kind of tokenize it in a way that's entirely statistical So I think that would be the answer to that question Okay, so I I want to leave translation there because I desperately wanted to insert a discussion about segmentation because like we kept on talking about segmentation and I Really kind of realized in the last week how exciting a particular Path of segmentation has turned out to be and I really really really wanted to explain it to you Before the end of the course, so I'm going to do I'm half of it today and half of it next week if we can So let me show you why this is exciting So segmentation is about taking something Like this and Turning it into something like this Where every color represents a thing so it's kind of pink is road this light purple is line This blue is footpath. This purple is car. This red is building and so forth So this is quite a challenging thing to do, but it's really important For anything that's going to understand the world react to it So clearly robotics and self-driving cars Absolutely have to do this very quickly very accurately But really a whole range of computer vision problems need to be able to do this For example last week we saw That hackathon winning entry from a couple of our students that was able to say take this cat and Blur it out or apply this style transfer to this cat or whatever So you need a way of being able to know exactly where the cat is And for something like one of the things I showed was removed the background Now you actually need to be able to do a really good job of segmenting out the cat because if you get it slightly wrong Then you remove the background, you know You often see this like when people use Photoshop badly, you know You'll end up with like the bit between the ears, you know, there's still there or you know Their fur looks really spiky or something. So if you want to create the next generation Photoshop, you need to be really good at this So there's lots of reasons to need to be really good at this Now it turns out that the there's a fantastic way of doing this with a fantastic name called the hundred layers tiramisu And the hundred layers tiramisu is a fully convolutional dense net and so therefore we're not going to look at this yet instead we're going to look at The dense net Because we can't understand the tiramisu without the dense net So here is the paper that introduced the dense net as it turns out You really really need to also know about the dense net for other reasons and let me show you the reason It's only recently. I fully appreciate this Here is The results for the dense net and if you look down here you will recognize that it has been compared to genuinely state-of-the-art stuff Right, so network and network highway network fractal net res net res net is stochastic depth and wide res net okay, these are genuinely state-of-the-art architectures and It's being looked at on some Heavily studied data sets. This is sci-fi 10. This is sci-fi 100 Plus is with data augmentation without plus is without data augmentation So sci-fi 100 The previous state-of-the-art This is like massively well studied was 28 All right, and that itself was way above everybody else this paper got 19 and a half Like you guys have seen enough of these now to know that you don't decrease by 30 plus percent With computer vision nowadays against that the results. So this is like a huge advance Now the reason this huge advance is important because specifically these no data augmentation columns So here's this here's a sci-fi 10 one similar thing like going down from 7.3 to 5.1, you know, that's a 20 or 30 percent decrease as well What these represent these these without data augmentation columns they basically represent the performance of this data set on A limited amount of data right if you're not using data augmentation You're basically forcing it to have to work with less data and like I know that a huge number of you are Wanting to build stuff where you don't have much data So if you're one of those people you definitely need to use dense net right at this stage dense net is by far The state-of-the-art result or data sets where you don't have much data Okay So so I want to teach you about dense nets for two reasons the first is so that next week We can learn about the tiramisu 100 layer tiramisu, but also so that we can find out how to create way way way way better computer vision models where you have limited data Okay, so let's learn how to do this And I can actually describe in a simple sentence the single sentence A dense net is a res net Where you replace addition with concatenation but that's actually the entirety of What the dense net is but understanding that what it is and why it works is more involved So let's remind ourselves about res net. Okay, we've looked at this many times But there's no harm in just reminding ourselves with res net We have some input Have some input We put it through a convolution to get some activations and another convolution to get some activations and we also Have the identity and this is addition right, so basically We end up with our kind of You know layer T plus one equals Some function, you know the convolutions that is of a layer T plus Layer T itself Right, and then to remind you what I've normally shown after that is then to say okay, so our function equals the difference so it's basically calculating Residual it's calculating a function that can find the error, right? So every time we look at res net we Okay, so what if We do exactly the same thing but we replace that With concatenate Join them together And remember this is just one block, right? So we've got like block block block block block. So what that means is that you know after the first layer we have both the result of some convolutions and The original input because we like literally copied it and concatenate it and then after the second layer we've got some convolutions on convolutions and The original first layer of convolutions and the original input And furthermore that second layer of convolutions was itself operating on this concat So that second layer of convolutions was operating both on the original data as well as on the outcome of the first set of convolutions So when people draw dense nets They tend to draw it like this they show like Every layer going to every layer after it Now I didn't define it that way because it's much easier in practice to implement it by just saying each layer Equals all of the previous layer concatenated with the convolution on top of the previous layer And so you can cabinet recursively means that you always have all of your previous layers there as well, right? So sometimes people get confused when they see This picture where everything is shown connecting to each other, but then you look at the code And it looks like it's only connected to the previous layer and that's because the previous layer itself was connected to the previous layer Which was the previous layer Okay, so Because we keep concatenating the number of Filters is getting bigger and bigger Right, so we're going to have to be careful not to add too many filters at each layer So the number of filters that are added at each layer They call the growth rate and For some reason they use the letter K. That is K for growth rate They tend to use the values of 12 or 24 in the Tiramisu paper they tend to use the value of 16 So every layer has 12 more filters in the previous layer and you generally can have like a hundred layers All right, so after a hundred layers you could have 1200 or 2400 filters, which is getting to me quite a lot Yeah, so be it so Interestingly Although they they kind of add up You actually end up with less parameters than normal So you can see that this here of sci-fi 10 Even this is a state-of-the-art result 7% And this is with only a million parameters, right? So this is beating Resnet with 10 million parameters by a third Right, this is why it's working so well with so little data I'm not convinced that this is the right approach for Or there's a massively better approach for image net right this is the picture for image net This is the important one The number of flux so this is the number of floating point operations amount of time that it takes your computer to do it Versus the error rate You can see the dense net and res net have about the same error rate and dense nets like You know, it's about twice as fast a bit less like it's still better, but it's not like massively better and There are actually better architectures than res net for for image net nowadays So So really if you're using something that's more of the 100 to 100,000 images range You probably want to be using dense net if it's more than a hundred thousand images. Yeah, maybe it doesn't matter so much So you want to see the code Just a moment So interestingly this turned out to be something that suited Keras really well this is a You know these kind of like things where you're using standard kinds of layers Connected in different ways Keras is fantastically great for as you'll see So I'm going to use sci-fi 10. I just copied and pasted this basically from the Keras Keras data sets to sci-fi 10 So there's an example of sci-fi 10 It's a funny old data set is 32 by 32 pixel images. So that's what a sci-fi 10 truck looks like Um There from not to 255 sorry there from not to 255. I want to make them not to 1 so I divide them by 255 All right, so we're going to try to figure out this is a truck. We have 10 categories in sci-fi 10 So The less code that you write The less chance that there is for an error Right, so I try to refactor out everything that happens more than once And so even the simple thing like activation value. I create a function for Dropout if you have dropout, I create a function for Batch norm with a particular mode and access is a function for and then applying value on top of batch norm for right Convolutions I always have this in it this border mode this L2 regularization and this dropout. So there's a function for Then we have batch norm Then value then convolution and dropout. So there's a function for In the paper, they also have something called a bottleneck player. This is a one by one convolution where I basically Compressed the number of filters down into the growth factor times four right, and so this is a way of reducing the dimensionality You'll see here When they use bottleneck and something called compression. We'll see in a moment They call it dense net BC and you can see that that reduces the number of parameters even more and therefore makes it even more Accurate, right? So generally speaking, you'll probably be wanting to use bottleneck, but it's just a one by one con with This case 48 filters. So it's reducing the dimensionality through that so basically what happens in Is you have a number of dense blocks each dense block basically consists of a number of these convolutions Followed by concatenation, right? So I go through each layer convolution Concatenate and you see how I'm actually replacing X with it, right? So it's concatenating to itself again and again. So it's getting longer and longer, right? And then from time to time I then add in a transition block Which is simply a one by one convolution Followed by a pooling layer, right? So just like every computer vision model we used to a bunch of computation layers And then pull bunch of computation pull bunch of computation pull So this is looks like a pretty standard kind of Architecture oh And then the dimension compression the other thing that is in each transition block You can optionally have this thing called compression, which normally you would set up to 0.5. That just says okay the number of filters take however many filters you currently have and Model pi by 0.5. So this is basically something where every time you have a pooling layer You also decrease the number of filters. So when you have this bottleneck layer and you have this compression of 0.5 That's what dense net BC refers to Yes, and can we do transfer learning on dense net? Absolutely you can and in fact pie torch Just came the new version just came out yesterday or today and has dense net preach as some pre-trained dense net miles Having said that because the Size of the activations continues to increase increase and increase that we again have this problem that there isn't really a nice kind of Small number of activations that you could really build on top of so I'm not sure. I'm not sure how practical it would be, but you certainly could All right so Here is the whole dense net model Basically There's Four layers which aren't part of the dense blocks, which is that there's an initial three by three convolution There's a global average pooling layer and there's also a really you and a batch norm So if you subtract the four and then divide by the number of dense blocks that tells you how many layers you need for every block So you do our three by three conv then we go through each of those lay those layers Get a dense block and for everyone except for the last layer. We also do a transition block. That's the one with the pooling Finally do a batch norm value global average pooling And then a dense layer to create the right number of classes So that's basically it if you create that Compile it And fit it So I ran it last night and so I couldn't quite run it as long as they did but I did get to 93.23 which Easily beats all of the state of the art that's somewhere about Six in a bit, so I didn't have time to run it for as long as they did but I certainly replicated their state of the art result and you know as you can see using nothing but Basically two screen pools of carrots So I read through that pretty quickly, but honestly this is all stuff that you guys are pretty familiar with so if you Read the paper. It's a really easy paper to read Read the code. It's really easy code to read if you haven't done much like Implementing papers with code. This is a great place to start because it's like there's no math in the paper It's pretty clear the Keras code is really easy to read. There's no new concepts So like this would be a great way to get started And as I said, you know the students and I some of the students and I basically started on this on Friday and Got it knocked out. So I think this is pretty exciting Alright, so it's night o'clock. Thanks everybody. I'm looking forward to chatting to you during the week And I'll see you on next Monday for our last class