 So tonight, I'm going to be doing the beginners talk. Although, if we're a little bit more advanced than most of the beginners talks that we do, I'm going to be going through sequence-to-sequence models and the neural machine translation. Do people know what neural machine translation is? People have heard the term before? Yes, no, yeah. Okay, I'll explain as we go. Okay, so let's start off with sequence-to-sequence models. So sequence-to-sequence models are used for a whole bunch of different stuff, everything from chatbots to speech-to-text, to dialogue systems, to Q&A, to image captioning, which Martin's going to be looking at later on tonight, into things like Google's order reply system for Gmail emails. So why sequence-to-sequence? The key thing with these is that sequences preserve the order of the input. So when we talk about normal sort of basic neural nets, there's no real good way or there's certainly not as good a way to represent the concept of time and of things changing over time. So sequence-to-sequence models allow us to process information that has a time or an order element to it. They allow us to preserve information that couldn't be done with via normal neural networks. Okay, so there's sort of two main components to a sequence-to-sequence model. You've got your encoder, which basically takes the info in in time steps, creates a hidden state, and then sets it up to be passed to the decoder. And then your decoder takes that hidden state and then uses that to start predicting things. The other thing that's crucial with these kind of models is you need a lot of data, like an unbelievable amount of data often. Okay, so what I would say is one of the key ideas behind this is that the aim is to convert a sequence into a fixed-sized feature vector. So let's say we're going to be looking at translation, where it's talking about a sequence of words and we're then going to try and get that into a fixed-sized feature vector, which can then be used to predict whatever that is going to be translated into an indifferent language. And it's really important that it's basically going to do a few different things. It's got to remember the important things in that first sequence, and it's also kind of got to lose any unnecessary information in that sequence as well. So this is what it looks like. And this is sort of what the concept of neural machine translation looks like, too. So if you see over here, we've got this is our text going in. This is the prediction at the top. So we've got an embedding layer. I'll go through each of these layers as we go in more detail. We've got a set of weights, which then goes into an LSTM or a series of LSTMs as it's been shown here. This is basically unrolling the encoder and the decoder. Then on the other side on the decoder, we've basically got at the top, we've got sort of like a time-distributed dense network, which is then used for basically predicting across the vocabulary that we're trying to translate into. Now this kind of model could be exactly the same for chatbots. It can be used for chatbots. It can be used for so many different things. The key thing, though, is that it's all doing it in a certain sequence in a certain order. So one of the big things, last time Martin went through LSTMs. Who was here last time for LSTMs? Only about half of you. OK, maybe we should explain a little bit about it. So LSTMs are a type of RNN that basically is very good at remembering information over time and through time steps. I don't have a diagram of actually the makeup of it, but it also has the ability to remember certain information and forget certain information. The key advance in LSTMs in the past year or two would be this concept of bi-directional encoders and bi-directional LSTMs. So here what we're having happen is we've got one series of LSTMs that are covering the text forward, and then right above that we've got another series that's coming from the other side. So our weights, in this case you see A's, is basically the hidden state. And we end up having two hidden states, one from the forward direction, one from the backward direction. And this allows the network to basically learn about that text and to learn what actually is going on in here. Now the big thing is that we often have a lot more than just two layers. I'll come back to that. Let's look at the bi-LSTM secret. Bi-directional LSTMs generally work better than almost everything else for NLP at the moment, especially when added with attention. They basically seem to be outperforming everything. Does anyone here know Chris Manning? So Chris Manning is one of the lecturers at Stanford. He's also one of the key people who, behind glove embeddings and behind a number of quite well-known papers, at a talk recently he basically said that it was almost depressing that with NLP nowadays, if you want to get the best result, just throw bi-directional LSTMs plus attention. And generally that will produce the best result. And it's just a simple fact that this seems to be the model that works. Now I showed it before. That said, the often state of the art in NLP is still very much lacking, which is also sort of the sad thing with a lot of this stuff. I showed a model before with just two layers. Here we've actually got two bi-directional LSTMs, so we've got four layers happening. Often for things, for bigger models, you may have six different bi-directional LSTMs that you're going into. The key thing is each one of these has a set of weights inside it. And each one is learning. And if we're passing the sequences up, each one is influencing the one above it. And as it goes forward in time, it produces a hidden state of representing everything that's in those words, or ideally everything that's in those words. OK, so the decoder. So when we come out of, in a sequence-to-sequence model, often when we come out, now I will preface this by saying that there are a lot of different ways to do sequence-to-sequence models. And there are quite a few different types of architectures for sequence-to-sequence models. And I'll show some of those later on. But often what you do is you come out of your encoder and you come up with what's called a context vector. So the context vector basically is sort of like the one snapshot of that entire sequence that's happened before now. And that is used then to predict that output. So we basically have a dense layer with softmax, just like a normal neural network that we've looked at quite a few times before. The thing is, though, it's time distributed, meaning that we have one of these for each time step. So if you see the circles at the top, that represents your entire vocabulary. If we're doing it in text and we're trying to get back to words, obviously we'll talk about it. Everything gets coded into numbers before we put it in. But if we're trying to get back to words, that top layer there will have one neuron for every single word that you have in your vocabulary. So that top layer will often be super big. I think the one I'm going to show you tonight is a very simple example. And it's 27,000 neurons wide or something. And that's just a reasonably simple example. The key thing, though, is that once that makes a prediction, so here we've got a go token being passed in, and it's making the prediction I. Then what happens is we feed that as the input on the next cell. And it now makes the prediction M. Then we take that, we feed that into the next one, and it makes the prediction good. In theory, if you get it all right, it should feed out whatever you're trying to translate or whatever the answer for a chatbot is. And this is one of the ways, last time that I talked about generative chatbots. This is one of the ways you can actually reproduce a lot of things with generative chatbots, too. OK, so the first thing to understand is that the sequence, the model, will have a length of a sequence. In the one that I'm going to show you, that length is 30 time steps long. So it means any sentence that's 30 steps or less I can put into that model. Or even if it's more, I just truncate it at the end. But to do that, you have to put in what's called padding. So we basically have to pad out the words with some sort of token that the network's going to learn to recognize. And this is a very simple concept to get. I think it's just, if you look at it, we've got the sentence saying, hello, how are you? With a question mark at the end. In this case, the padded length or the sequence length is eight steps long. So basically I'm passing in, hello, how are you? Question mark, pad, pad, pad. Generally the pad will be a zero. And it's just how we fill out the steps in this. You will also see this being used if you're doing any sort of text with CNN stuff. Often you will use padding and stuff as well. So it's a very important concept to understand. All the input sequences must be the same length as each other. All output sequences must be the same length as each other. But input and output don't have to equal. So you could have 30 steps in, three steps out. So that's one way, and that might be, you might use that for some sort of time series prediction where you're basically feeding in a lot of steps, but then you're only gonna predict three steps out into the future. For example, if you were doing something with stocks or something else that relates to that sort of thing. Okay, so the concept of tokens. You will see tokens basically used in a lot of this stuff. So the pad token is basically just like a spacing sort of system. EOS token is end of sentence. Igo is telling the decoder to start. So all these things basically sort of give conditional information to the network to what it should be doing. Eventually, if your network's training pretty well, it should start to realize that if it sees a pad token, the next token or the next prediction is gonna be pad also. Cause we don't have pad in the middle of things. Okay, we also have out of vocabulary token. I'll talk about that a little bit more. Basically what it is, is as we reduce our vocabulary, if we're trying to train something, maybe we don't wanna train on words that only appear once or twice in the corpus. Then often we will replace those words with either an unknown or an out of vocabulary token. The last token is a little bit different. So I put an example of this, and I'll talk about this a bit more later. This is used for the new machine translation when they wanna sort of say that this phrase is about to be translated into a certain language. Often what they will do is pass in a token of what the target language is. So it's kind of like telling the network, hey, this phrase, I'm gonna want you to encode it so that in this case Spanish, we'll be able to understand it and decode it into Spanish. Talk a bit more of that later on. Okay, the other thing we need is we need lookup tables. So we need a way to convert, obviously all of our neural networks, including LSTMs, et cetera, grooves, RNNs, we can't just put words into them. We have to convert it into numbers. So this is just making it a simple lookup table where every number will relate to a certain word. Often you'll do that based on the frequency of words, so the more frequent the word is, the lower the number is. There's different thinking about that. Also, you'll see here that I've got question marks as being a, is also gets a number, right? For the network, a question mark and a word are basically the same thing. They're just a token. All it's looking at is looking at them as tokens. So you can see here that the first word in both these sentences is number 23. And then there's no other similarities in the two of them. Okay, so this is what I was talking about before. We often discard words that are used very little and we replace them with an unknown token or an out of vocabulary token. The reason we do that is because the bigger the vocabulary is, the harder it gets to being able to predict. Or the more, often the more compute power is gonna need to be able to predict. Remember the dense layer we have at the top. If I've got a vocabulary of 200,000 words then that means there has to be a layer of 200,000 neurons at the top of that layer which is gonna go into a softmax to predict the word out. Okay, embeddings. Again, Martin covered embeddings a lot last time so I'm not gonna go into them deep. But the embeddings allow us to extract more semantic meaning from the words. Often we're gonna use pre-trained embeddings like your word to vex, like glove, those kinds of things. In certain cases you will probably wanna train your own, you know, instead of embeddings. But basically the whole thing with the embeddings is because they're being trained across billions of words they are able to sort of spot relationships and we're just basically just leveraging that and bringing that into our network. Okay, so another thing to understand that I think often confuses people is our inputs to our network, even though we talk about it as being one time step often we're feeding a lot more into it. So here the input is gonna be a glove embedding which is 100 dimensions. The max length is, sorry, well the sequence length is 30 long and we're putting a batch size of 64. So basically my tensor shape is 64 by 30 by 100. It's being fed in each time. Okay, stuff to do with the NMT. So language translation is a massive business, right? It's $40 billion a year business. Google's translating over 100 billion words a day currently and I'm sure it's going up every single day. Facebook has been working on their own systems for translation as well. You've got e-commerce, eBay is another big company that's starting to use NMT and translation a lot as well. This is a really big sort of industry. Why is it hard? So translating words is really hard for the simple reason that the correct words depend or the correct word depends on all the other words around it. And this is why we need something that takes sequence into account as we're doing it. You'll definitely find that orders of words can change in different languages. I'm sure you've all sort of had experiences of that in, especially in Asian languages, you just find the grammar is very different than European languages. Rules don't work. So up until sort of very recently, a lot of this was being done with sort of statistical methods plus rules. And while they could get to a certain level, they would use linguists to come up with these rules. And the problem was that the rules would get more and more complicated. If you're trying to put it into code, it just doesn't work very well. There's a famous quote from someone who worked in the field that basically says, every time I fire a linguist, my accuracy rate goes up. Because the more that they moved away from these sort of rules and moved purely onto sort of statistical approaches, the better they got. But even then, traditional SMT, you know, statistical machine translation was also very complicated in the way that it worked. If we look at some examples of this, so this is taken from one of Chris Manning's presentations and it just shows you the phrase-based statistical machine language, syntax-based, and neural MT. And you can see that neural MT has only been around for sort of two years or so. And already in that time, it's accuracy rates have gone up way above the other ways of doing it. So much so that most people who are working on translation stuff are now all moving to this way of doing it. Okay, so another interesting thing about the neural machine translation, and I'll give you an article that you can read at the end about this, is that previously before this, when Google was making translation systems, they would generally be trying to translate 80 languages. So trying to translate back and forth between 80 languages. And that really meant that they had to have like 6,400 different systems, because they needed a system that basically would translate French to Chinese, French to English, French to every single language. The cool thing with NMT is it kind of makes an interlingua with those tokens at the start to be able to tell it, okay, translate the hidden state for this language, and then it's able to export to any language. And it's one of the things that sort of massively advanced it. And it's actually very interesting to sort of, like I said, I'll give you an article at the end. It's really interesting to look at that in many ways these systems are creating their own sort of universal language, their own sort of lingua franca of like, does anyone remember Esperanto? It was tried a long time ago and never really sort of took off. But that's the sort of concept these have made. And it's very interesting to look at those mathematically because you'll see that it sort of plots out, obviously it's not doing it in words, it's doing it in math, but the way it plots relationships towards is the same in whether it's Japanese, French, English. And so it allows it to basically, very quickly, build systems that can translate to things that often before they didn't even have a system that could do that. Okay, so I've gone through this already, I think. This is just showing that context vector in the middle that basically as we come out of the encoder, we've got a context vector. Let's look at some code. So what I've done is put together a very simple simplified version. I've actually done a few different ones, but I'm gonna walk you through one of them. First of all, let's look at the pre-processing. I've put it into keras, so it should be reasonably simple. We're gonna be translating from English to French. So, oh, sorry, yes. Oh, that's the wrong one. How's that? Yeah. Okay, so I'm gonna be going through it and showing you, we're gonna be using pre-made embeddings, and we're basically using a corpus of French and English, and I'm filtering it down just to the questions. So, the first thing is to do is basically just get the corpus in, create some regex expressions. So I'm looking for things in English that basically are sort of your who, what, where, right, type questions. And then from that, we can see that we come up with about 52,000. Now that might sound like a lot, but for this, that's not anywhere near enough. Generally, you're looking at sort of, for a bad system, they're using sort of a minimum of three million different pairs that are pre-translated. A lot of the sources for this kind of stuff is places like the UN, the EU, where basically they actually have all this stuff translated into so many different languages already. And you often spot that when you're using Google Translate or something, that certain types of language it's really good at, more formal language, sort of stuff that's not as formal, it's definitely not as good at. Okay, so you can see here I've got my English, my French, I'm basically just gonna pickle it all. So what I've done is on GitHub, I've put the pickle files so you can just pull all those in. You don't need to go, the whole corpus I think is like a couple of gig already. Okay, then we wanna make our tokens. So we're basically just making a simple token system. And you can see here, now I'm basically just taking our centers. Can everyone see that? Yes. All right. And I'm basically just tokenizing it. Just very simply breaking things into tokens. Then later on we're gonna put some padding in it, but we're not gonna put the padding in when it's in text. We're gonna put the padding in when we've converted it to math or to numbers. Then basically we come through, we make our IDs. We get our word embeddings. So I'm using Glove with only 100 dimensions here for the word embeddings. I think the minimum that Google uses for a language is 1,024 embedding for the translation and stuff like that. So I want you to realize that all these things, this should not work, right? We're using way too little information. We're using way too small, et cetera. Okay, so then we bring in some French embeddings, which I'm sort of using for another part of it. Okay, now we bring it in and use Keras to do our padding. And you see now that we've basically got our padding of basically our 52,000 questions, all 30 long. And so if that sentence was longer than 30, it's been cut off, it's been truncated. If it was shorter, it's been padded. The cool thing with Keras here is it basically just does it for you. Makes it very simple. And actually here's an example of one of them. So we can see here, so basically I randomized them a bit, I shuffled them, I set up a train test set. And then we look at it and we can see that, okay, in this sentence, I don't know which word is 18, but we can see that we've got 18, 52. It gets to a certain point and then we've got zeros all after that. So that sentence is only that many words long. Okay, let's look at the model. All right, so then in the model basically, I've pickled everything, like I said, so that you can just load it straight in. So this notebook you can just run straight through. Before the model, I've basically sitting up some simple Keras callbacks to handle things like learning rate change, early stopping, things like that. If you haven't seen that before in Keras, here's an example of it. I was mucking around with different types of learning rate dropping. I, both sort of manual ones and sort of like one that was sort of automatic. And then I come to the embedding. So basically here I'm using the bi-directional LSTMs. I'm doing return sequences true, so passing up each sequence as it gets processed. I've got a normal LSTM at the end. If I wanted to, I could basically do, you know, four or five of these, but obviously the training time's gonna go up a lot. It goes through that, and then I've got my time-distributed stuff coming out where I'm, eventually I'm getting down to the French vocab, the length of the French vocab is the number of neurons there and then a softmax to get it out. This is what our, this is what it looks like. So this is a very simple, small model and it's already, you know, you're already looking at over eight million parameters in the model. I run it for 20 epochs and it learns. All right, I wouldn't say it's perfect by far from it, but it's starting to, you can see that it's learning. Now my guess here, you notice that the validation set kind of plateaus. My guess here is that probably because the vocab is so small, there are certain words that are in the test set but just not in the train set at all. So it's kind of has no way of learning those words, right? If we were doing something with a vocabulary of, you know, if we had like two million sentences or something like that, there's a much higher probability that in our train set we would have every single word in the vocabulary. Let's see how we went. So who could speak French? Anyone? Okay, for him. How did we go? So we can see that it's learned something. I don't know how accurate it is when I ask it what the size of Canada is, it can, at least it understands Canada, doesn't understand Australia though, it can think it's Canada. And if I had to guess, I would bet that that's probably because the word Australia is just not in the vocabulary anywhere. So what I did also was print out some where we could actually see, we could actually see like what's there and also yeah, okay, what's there and then the prediction. So we can see here we've got some of it right, but it's starting to fall apart towards the end of the sentence. Again here, we've got some words right, but we have a lot of words that are not right. And this is literally training for about 30 minutes on a pretty good GPU. Okay, let's see here, what has it got? Now here it's interesting, we've got some words right, but we've got the order wrong. I'll explain why in a minute, but you can see clearly that the model is starting to learn, right? Still has a long way to go. Let me go. And this is sort of just a very vanilla sort of a sequence-to-sequence network. And you find that there's some really sort of key things with these kinds of networks is that they will work pretty well on short sentences, but they won't work on anything sort of longer. LSDMs generally can only remember up to about 30 steps. If we add attention to a few other things, maybe you can get out to 50 steps. This is one of the biggest limitations at the moment in NLP, is that we don't have anything that really can go back in time and say remember even a few paragraphs, let alone a whole book or something. So you find that these things drop off very quickly after 30 and if they're not trained enough, like in our case, they drop off even sooner. So there's a few tricks you can do to get around that. You can kind of flip the input and train it backwards, backwards going in, forwards coming out. We'll often put the end words closer together or you can go to some more advanced stuff. So there's a few sort of advanced things that are used in production level and MT systems. The first would be attention. So basically this is from the first attention paper and you'll see that where sort of normal LSTM or RNN would drop off, its ability to remember drops off very quickly. The ones with attention, the top line will go out much further. Even then though, after sort of 50 steps, it's not gonna remember too much. It's gonna struggle to remember certain things. So this is a graphic that I took from one of the blogs I've put in the resources to just show you what attention does. Attention kind of looks at the whole thing and works out, okay, which word is most important for this word? So it sort of gives a score to every word in your sentence and with that it's able to sort of get a sense that okay, there are certain words that rely on other words a lot more than other ones. There's very distinct sort of relationship there. The good thing with this is, I'll explain it a bit more. So these do a lot better than just short sentences. You'll find, okay, so this is some of the things we saw before that the previous way of doing it will often generate sentences with a good grammar but will either sort of get names wrong or will repeat itself. And we saw that where it was repeating a question mark, for example, or it was repeating certain words. So think of attention as kind of like a little memory module that basically sits above the network. And there's a number of different ways you can do attention but you think of it sort of like a little memory module that sits above the network and then sort of looks at the words and says okay, which one's the most important? So if we look at this sentence, clearly not all words in this sentence are of equal importance. So I've sort of color coded them to just sort of give you an idea that okay, well, last Friday, that's a very specific relationship. Especially in different languages. If we don't have that word last and it just says Friday, then often people will think we're talking about the future or we're talking about a different time tense. So things that relate to tenses are really important in this case. They really benefit from attention. Where Google's old translation system used to fall down a lot would be in things related to time. Where you would see, one sentence would talk about a sequence of different events happening and the translation would have them in the wrong order. And it would actually read as something totally different than what maybe the English sentence said or what the base language said. So if you look at the thing I put at the top here, think of your encoder and then basically, attention sort of gives a score to every word of how important is this word in this particular sequence or how important is this time step in this particular sequence. So this is really important for not just translation. Like I said earlier on, the state of the art with NLP nowadays is basically bidirectional LSTMs for your encoder with attention on top. And that attention basically allows it to remember so many more sort of steps going forward. The next thing that you can start to do that's a more advanced thing is in training is basically, and here I put an example of using GRUs rather than LSTMs. Sequencer sequences can be RNNs, can be LSTMs, can be GRUs, any of those will work. Just most of the time you will generally want to use an LSTM and if it's a low level one, you want to use a number of bi-LSTMs, bi-directional LSTMs. But okay, so what teacher forcing does is basically as we train it, when it makes its prediction up here, we check, we record whether that's right or wrong and we use that when we're backpropagating the network, but we don't feed it into the next time step, we feed in the correct answer to the next time step. So we call this sort of teacher forcing, where we're basically forcing the decoder part to not just use the output, the last hidden state, but we're forcing it to actually use the correct answers. So it improves your training a lot for doing these things. Obviously, you have to turn it off when you're using the model to do real predictions. Right, because there's nothing to predict, nothing to sort of do it, often what you will do is train with teacher forcing, have the same model one with teacher forcing one without, take your weights, save them, load them into the version without, and then use it to do normal predictions. But this is something that can help the network to learn a lot faster and a lot more accurately. The next thing is basically what's called peaking. So peaking is basically where normally we would just sort of feed in the hidden state of our context vector straight through every step of the next RNN or LSTM. What P, and what happens is that each of these steps, so actually this one, you can see here, this one shouldn't be an A, probably shouldn't be a B. You can see here that each of these is our weights, right? So that context vector gets, our hidden state gets changed each time it goes through the weights. So what we do is we take the version that's getting changed and we use that, but we also give it the version that was the original one that came out of the decoder, so the encoder as well, so that it can kind of like check itself. You're thinking of it sort of like as a system for checking itself and making sure that it's getting it. It also then allows it to get much better. Again, it's something that can improve accuracy. This is just a different picture of it from the paper about it. Okay, I'm almost done. So there's, if you're playing around with Keras, earlier this week I found a model called Seek to Seek in Keras. So it's a third-party library. I think they're currently talking to Francois about merging this into the actual Keras stuff, so at some stage it may come in. They have a bunch of models here and I suggest you go and play with it because it's much easier to sort of set them up. The challenge that I found though was they weren't very accurate as the one that I just made the first time. Now that could be though because I didn't give them enough time to train and stuff like that. They seem to be a lot slower at converging, even though they had, you know, attention and peeking and things like that. Okay, resources. So I've put all this up on GitHub, but if you wanna sort of read some stuff, the first article from New York Times is really interesting basically just about how Google built their NMT system. It's not technical in any way, but it does sort of show you some of the interesting things about the mathematics of sort of this middle language that only the machine understands. There's a really good, the distil.pub, if you don't know that site, you definitely should. It's got a lot of good articles on it. This one's from Chris Ola, all about sort of augmented RNNs. Basically he goes into attention there and talks about using it for different things. If you get a chance, the other two things I would say to check out would be Chris Manning's NLP with Deep Learning from Stanford. I think now they've put all the videos up on YouTube. You can watch them. There's also a really good talk that Chris Manning gave at the Simmons Institute earlier this year, basically going through sort of what is state of the art in NLP and how is it done. The last one is basically a video to Cochle who gave a talk about sequence to sequence. Cochle is kind of the guy who originated a lot of the stuff to do with sequence to sequence. Wrote a number of the papers about it. He works at Google. He is one of the key people behind the smart reply for Gmail. And in this he sort of goes through, you know, basically not code, but goes through sort of the math and goes through an explanation of how these systems work and how these architectures work. The sequence models are still something that is being developed and improved as we go along. There are now a number of different types of attention that approach attention in a different way. Martin's going to show one thing about attention later on which is only a week old or something. But check these out. These are really good things to do. And then some papers, if you're interested in reading papers about it, these are sort of some of the key papers of basically how Google built their system. And also the last one is actually kind of an interesting paper. It's basically like a 60 page guide to sequence to sequence models and all the different types of architectures and stuff like that for sequence to sequence models if you're interested. That's it. All right. We can do questions at the end or? Yeah, okay. And you didn't want to come up? Does anyone have questions or? I can't hear you, sorry. Yep. No, okay. So Google is translating 100 billion words each day. Right. Not that there's 100 billion words. Oh, very high percent, I'm sure. Right? I'm sure it's like an 80, 20 rule where 20% of, even just 20% of the language is all less, probably 5% of the language is make up 90% of the translations they've been doing. But I give you that number just to show you like how big this is becoming. And the thing is that from here, as these things get more and more accurate, you're gonna find them being built into everything. Does anyone know the startup Wego? Right? I feel sorry for them because they had a great product, right? Basically, they came up with, one of the first people to come up with sort of a product that basically, they're from Taiwan, right? US, okay. All right. Anyway, basically their product was sort of like OCRing, Chinese characters, and then just straight away showing it in English, right? Their technology was really good at the time when they came out, stuff like that. Of course, Google's come along, giving it away for free. It's gone everywhere, all right? So, but you do realize that this is a very big business as well. Any other questions? Yep. How is the sentence sequence agnostic? Very few will be sequence agnostic though. I can't think. Okay, give me an example. Okay. Look, language has to have an order, right? If I say the cat sat on the mat, it's very different if I say the mat sat on the cat. Well, it has a meaning. Okay, yes, and that actually, and no, no, no. Okay, actually, but you bring up a good point. One of the things I didn't actually explain, that's also a big part of these models is I think called beam search, which basically is using probability to predict out what is a good sentence and what is not a good sentence. And often in a production model, you will then put that in as well so that you would find, just like the lady said, we wouldn't say the mat sat on the cat. And beam search would find that basically that sequence, the engram of the three or four words, whatever it is, will appear virtually never in normal language so it wouldn't predict that. So this is where beam search comes in really useful as well. Yes. They share the structure, but there are actually several ways to do it. Okay, so they share all the weights for the encoder, right? One of the weights for the encoder, one of the weights for the, often for the decoder. Then you can do it either way, but generally you will, you know, you share the weights. So your sequence, let's say it's 10 LSTMs, one for each step in the sequence. The first LSTM, which takes the first word and the second LSTM, which takes the second word plus the weights that are encoded from the first one. Are they sharing weights, those two LSTMs? Yes, yes. And actually they're just the one LSTM, right? It's like, don't forget I'm unrolling. This is where LSTMs and RNNs get really confusing. I don't forget I'm unrolling this network, meaning that when you see it going like that, you're seeing an unrolled version of something that really is just one unit and it just loops around and feeds back in on itself. So that's where, yeah, they're sharing the same weights, but you do have different weights for often for the encoder and the decoder. Why would you not, why are you using an encoder on that? Why would you just have that, the worst of all of the sequence? Because if you do it like that, then they won't have a sort of sense of that order, right? You will lose that whole sort of sense of the order. But I agree, you know, this is, I've tried to make it reasonably simple because it's something that can get very complicated, yes, and it gets very confusing, right? See, the thing is when we talk about an LSTM, we're just talking about one particular unit, but it will just loop around on itself for the number of sequence times. Then it sort of throws out that context vector, which we can then feed into something else and use it in there. There was another question. Yeah, so sequence modeling on the level of words is pretty good, but what about the other two boundaries, like on the level of paragraphs and the character level? Okay, it just doesn't work. Character level, yes, totally. On paragraphs, not really. You can use skip thoughts and things like that for that kind of thing. The challenge, though, even with character things. So one of the sort of big breakthroughs recently in MNT is a thing called zero shot learning. So for example, remember I talked about unknown words in the vocabulary? Often those will be things like names. So one of the things that Google's worked on a lot is basically having the ability to sort of detect that, oh, well this, I know this is a name. So even this is not in my vocabulary list, I'm gonna try and find a way to pass it along. One of the ways that they often do that is by predicting characters. But the problem is if we're predicting characters and we can only go 30 characters back, we can't predict a lot of words. This is where, okay, I will say this. Don't be confused that often people present deep learning where everything is a nice, beautiful end-to-end model, right? In production, it's almost never like that. They will have different models that do different things and then each, you know, get a variety of different predictions, beam search it all, stick it all back together at the end and deliver that to the user. So even though I'm showing you sort of like a, you know, a sort of simplified model as an end-to-end model, in a real world production system it's not gonna be like that. A good example of that would be the Google Smart Reply. The first network that that gets fed into is nothing about predicting the reply, it's just deciding, can I predict the reply for this email or not? Because there are gonna be lots of types of emails that they can't predict replies for. So they don't wanna mess their data up with those kind of things and try and predict just weird stuff. So the first thing they do is basically make a prediction of can I actually work out what would be a smart reply here or not? And that's very common for all these things is that you will have multiple networks sort of working together to get the result. Any other questions? One more and then we'll, across. Just because when people build networks and try it you can get maybe out to 50, you know, 50 points. Okay, the problem is this, right? So Martin talked about LSTMs and RNNs last time. One of the big challenges with these is as you're passing those weights through you've got the gradient, right, in your weights. Now you think about it, if a number is above one and we keep timesing it by itself, it explodes. If a number is below one and we keep timesing by itself, it vanishes. So this is where you get these either exploding gradients or vanishing gradients. So to keep it from sort of exploding or vanishing 30 steps out becomes a big challenge for the network to do. And it's something that certainly, I do expect at some stage, someone will come up with a new way to do this that can maybe get 10x what we can get now and maybe we can get to 300 to 500 steps. That would be radically big move. Radically big move. Cause suddenly then we could go from just predicting sort of sequences. We could literally do things like Q and A systems on whole paragraphs. And if you started to play around with paragraphs in a book, pretty quickly you'd be able to do Q and A for a whole book. Anyway, go on. It's time, yeah. So you've basically got time and number of words. Isn't it? You're talking about the one about attention, right? You've basically got time going the other number of time steps and then the number of words that it's accurately getting right. And it's using what's called a blue score. Blue score basically looks at engrams and looks and says, okay, based on the corpus that I have, what's the probability of these being accurate? Meaning that like we talked about the cat on the mat versus the mat on the cat. The mat on the cat, the probability of that appearing anywhere except in a transcription of this talk, it's not gonna be very high. And so it uses that to basically then work out, okay, when it's getting things wrong and it generally sort of slopes off quite quickly. And actually I would say that that graph is kind of giving you the best too. Meaning that often you will find things sloping off after sort of five steps like we saw in the model that I showed you that it could get like first four or five, right? After that it was starting to either repeat itself or it's starting to get words wrong. And often it will repeat itself because it gets those weights stuck and it just keeps predicting those same weights and it thinks that it's right. That's where teacher forcing can be really helpful to sort of kick it and get it going better. Anyway, Ian's up next. I'm happy to take questions later on. Anyone wants to come up and talk or anything like that.