 We have today Mike Lewis. He's a research scientist at Facebook AI Research working on natural language processing So previously he was a postdoc at the University of Washington and working with Luke Zettelmoyer And on the search base structure prediction. He completed his PhD at the University of Edinburgh and Based on a combined distributional and logical approaches to semantics He has a master's degree from the University of Oxford and won the best paper award of E-M-N-L-P in 2016 So without further ado, let's get started with today's presentation Okay, well, thank you very much for the introduction, Alfredo, for inviting me to talk So, yeah, this lecture I just wanted to try and give a High level of view of how deep learning is used for natural languages processing these days So I think this first day has been really dramatic progress in NLP in the last few years with applying deep learning So These days, I mean you can get machine translation systems, which will produce translations which blind human erasers prefer to ones produced by professional translators You can make some questioning answering systems that you ask them a question get the Wikipedia And they'll give you answers that erases more accurate than the ones that people will give you I'd like have language models that can like generate several paragraphs of fluent text For all of these things if you'd asked me like maybe five years ago I'd have told you there's absolutely no way any of this is possible in 2020 But there seems to be a few techniques that have been introduced that have really made dramatic differences One really nice property is actually you can achieve all of these things with Fairly generic model. So all the models for all these tasks and actually look very similar There's just a few high-level principles, which are really useful So I'm gonna be covering quite a lot of ground in this quite quickly. So please interrupt me often with questions Right so Yeah, and also one thing I should say is that with all the progress we've been seeing in NLP Probably a lot of stuff I'm gonna tell you right now is going to be out of date in the year or two I mean, I hope we continue to make progress and stuff put in used to be rough today So as well as like explaining some of the models we're using I want to try and sort of give you some intuitions about what kinds of Principles are working out well So hopefully you have a bit more of a sharp life Okay, so The first come topic I want to cover this lecture is language modeling Language modeling isn't necessarily a useful task in itself, but it seems to be a really good building block for introducing All the techniques that we'll need later on so I don't know if you've seen this before. This is an example from a language model called GPT2 Which came out in 2019 So what's going on here is that we've got some humans played some introduction here about an Scientist funny and heard of unicorns the Andes apparently speaking English and then given that text invest this language model to write some more text starts with that and The text we get here is actually quite aggressive like everyone was really shocked that language models could work this well Last year So you can see The continued text seems actually quite plausible for a news article about this. It's um So we talk about unicorns The text is very fluent grammatical. It's not really any flaws there It seems to like events quite a lot details like the name of the scientists who discovered them Obviously all this is a complete nonsense. There's nothing here is true, but I'm also None of this will look like anything the model was ever trained on like I'm pretty sure This paragraph out there was unified. I was like no Neighbors anywhere on the internet. This is all completely new language But it's all actually quite a high quality text I'm not gonna read all this out. But like if you read the rest of the article it wrote then There are some flaws, but they quite hard spot and generally This seems to be quite a good language well So I'm gonna try and show you like the kind of techniques you need to actually build a language while it works as well Okay, so Very briefly what is language model so language model is just a basically density estimation for text so we're going to assign a probability to Every possible string and hopefully we have model puts more probability only strings which are fluent English than other strings So how do you model this density well, obviously there are quite a lot of possible sentences Expansionally many so we can't just predict a classifier so it's directly There are different techniques you can do for this, but the one I'm going to talk about is the one that's most popularly used which is Basically to factorize the distribution using the chain rule So here all we're going to do is Disfactorize you just say it's gonna predict the first word then predict the second word given the first and The third given the previous two This is an exact factorization doesn't cost us anything to do it like this So really what we turned is the density estimation problem into a series of classification problems These classification problems are the form given a bunch of text predict the next word And that's gonna be a theme through a lot of techniques we have in this talk so more concretely we have this from this Example I showed you before we've got this string the model output of like the sciences named the population after the distinctive horn orbits as you go to predict the next word and The correct word in this case is you Okay, so At high level all these language models look something like this Basically, we input this text into a neural network somehow The neural network will map all this context onto a vector this vector represents the next word And then we're going to have some big word evading matrix So word evading matrix will Basically contain a vector for every possible word model knows how to outwards and Then all we do is compute the similarity by just do your product between the context vector and each of these word vectors and We'll get a likelihood of predict the next word Then we'll just train this model by a maximum likelihood in the obvious way so um I Mean detail here is often we don't deal with the words directly we deal with things called Subwords or even characters, but I'm all modeling techniques remain the same. All right, so How all the skill here is in this context encoder, how do you build this? um So kind of the first person who took this is basically convolutional models So this convolutional models kind of encode this Inductive bias that language kind of has this translation of variance property in that we should interpret a phrase The same way no matter what position it occurs in So a typical model might load this we're basically first of all for every word will just map it to some vector Which is just a lookup table into an embedding matrix. So the word will get the same vector no matter what context it appears in And then we'll apply a bunch of layers of the 1d convolutions followed by non linearities until eventually we end up with some kind of Vector representing that context which really Really here the vector means what should the next would be and These models were first. I think this is actually maybe the first language model from Benjio 2003 of this neural language model and These kind of convolutional approaches are actually showed if to work quite well there yandere fun in 2016 if you reply in a modern deep learning techniques Kind of the these models are very fast, which is great I Share speeds very positive the language modeling because typically we use huge amounts training data Wooka this one come downside, which is that they're only able to really condition on the certain concepts of field So in this kind of toy example this word unicorn can only condition on the previous Five words because of the some kernel width the number of layers we're using here And obviously like realistic convolutional models that have a much bigger receptor field like this, but Natural language tends to have extremely long range dependencies. I mean For an extreme example, you can imagine if you're like trying to build a language model of a complete book it might actually help you to be able to condition on the title of the book all time steps and Obviously the title will be hundreds or thousands of words previously and it's quite hard to kind of build this into Build a convolutional receptor field it's big enough to do this Okay, so how do we condition on our context? I? guess The most popular approach until a couple years ago was what's called recurrent neural networks This is kind of a conceptually quite straightforward idea that basically Every time set we're going to maintain some states or have some state coming in from previous time step Which represents what we've read so far We'll combine the state with the Current where we've read and we'll use that to update our states And then we'll just iterate this for as many time steps as we need So I think this seems like quite a natural model of reading I mean, I think for the most part people kind of read left to right and maintain some kind of states as they go At least in principle as well you can model unbounded context like this So At least in principle like the title of a book would affect the hidden states of the last word of the book In practice, so there are some fairly significant issues for this model firstly. There's kind of no free lunch here. So by the time The effect of kind of maintain the state is we're going to compress the whole history of the document reading into a single vector at each time step and And then you can't once you've read a word you can never look it again. You have to memorize that and that means you have to actually have to Cram a huge amount of information into a single vector and this is Personally, it's the kind of a bottleneck model. So, I mean This question about how much information you can really store in one vector, but Also, it's kind of a practical learning problem to you That's that the You get an issue called providing some gradient problem where it means that kind of every time you go through one of these steps then your You'll have some kind of nonlinearity. What should I mean that the effect of words in the past will kind of get exponentially smaller each time step And that means that Once you have No gradient to a particular word in the past. It's very hard to actually some little later on that word was important one final issue I want to mention with RNs is that they're actually quite slow The reason for this is typically training So the reason is that in order to like Build your state for a particular work, you actually have to have built your state for every previous word first That means essentially you have a big for loop that's going over your entire document And the longer your document is the Biggest for loop is And it means you mostly operations you can't actually compute in parallel. You actually have to it sequentially and one GPU hardware is Really based around being able to do operations in parallel Okay So the convolutional network didn't have this problem. It's everything's in parallel, but on the other hand You get a bounder receptor field and The recurrent models you have In principle an infinite amount of Receptive field, but it's quite slow to train So the solution to this is now what's called the transformer Which is the model is using all the state-to-the-art nlp systems these days So I want to go through transformer quite a lot more detail than I did the RNNs or CNNs Transformers are introduced in 2017 by Ashish Goswani and Famous paper called attention is all you need And they really revolutionized lots of nlp So I've included a figure here from the original transformer paper I Don't have to use this to you. It certainly when I first saw this figure in 2017 It took me quite a while to get my head around it and there's quite a lot of details going on in these boxes So I'm going to just slowly drill into them. All right, so what's going on? Basically, you see we have this input state this N times this transformer block and then an output phase so This N times block thing just means we're going to unroll The same block with different parameters a certain number of times there so the example has six layers I think they had the original transform paper which seems quite cute these days These days people train models with billions of parameters and Many many dozens of layers Okay, so I'm just gonna drill into this box it more detail. So This can the core of the transformer the transform block You see it's actually incorporated of two different sub layers Which are both very important and Some way to is maybe the more Obvious one, it's just a big feed forward network It could be any MLP, but it's important. It's actually quite expressive and Beneath we have this multi-headed attention module the multi-head attention is kind of the key building block behind Transformers and why they work So These sublayers also connected by these boxes labeled add or not So the ad part just means this is a residual connection Which helps stopping the radio expansion in large models the norm here means lay normalization I'm going to go to learn on in detail here, but it's actually very important to make these models work Actually some subtleties about how exactly do you do the lay normalization that makes a big difference in practice? Hey, excuse me. I have a question. Sure So this isn't Immediately clear, but could you talk a little bit more about just the intuition behind using multi-headed attention as opposed to a single head? I mean presumably each head learns something different and it ends over its input differently, but What's what was sort of the intuition behind that? I'll ask that question a bit. I'm going to go through Exactly what multi-headed attention is a bit first and then I'll try to get some situations as to why this is a good thing to do And we don't answer your question then please Follow me after a few slides time. Thank you Any other questions at this stage by the way? I Have a question use that the transformer module uses layer no normalization Why can you provide some intuition into why that works better than group normalization or batch normalization? And I don't think I can actually give a very satisfying technical answer this I think a lot of this is quite empirical as to why In LP we learn on works great in computer vision batch normal works greats and Nice props about Lenon is that it doesn't depend on the batch dimension which Bachelor does so in practice, that's quite good big advantage because It's quite hard to stream with large batches with them for large models Yeah, you can see People written lots of papers on actually why things like that's not even work computer vision. I think At least last time I read it's there are still some debates as to actually what it's doing Maybe the attrition individual paid for why even batch normal works and not great So personally I would say this is one of the slightly and satisfying things in deep learning whereas it works, but it's a little bit unclear why Okay, thank you. Yeah, that's way with a more satisfying answer than that Another question is coming here is do transformer share weights across time steps like RNN LSTM's Yeah, great question. I should have made that clear. Um, yeah, so All these ways going to share the cross-time steps So it's kind of convolutional in that sense So you have one block you'll play it every time step You can actually also apply them using weights in every layer and that works quite well, too But it's not what people normally do Any other questions so far? I think this where the questions I read the so far on the chart So what is this mysterious moth head attention thing? So here's another figure. I don't know if this helps So basically Okay Compute these three quantities called DKQ here at least out of the query key and value expectably To the scale dot product attention operation and then calculate the outputs So drilling into this scale product tension and eventually we will run out of boxes to expand It looks something like this So we're going to do compute this query key do a Dot products and stuff packs and use this as a way to sum of the values. I don't worry how that makes sense I will do more detail So let's take this example and Where the context here is let's say these horns silver-white and we're trying to fit the next word which In the example before was a unicorn so For the way we're trying to predict we're going to like compute this value called the query for all the previous words we're going to Compute the quantity called the key and these are sweet linear layers based on the current states of this layer Tomorrow we're going to be coding this in practice So we're going to be seeing this and like all the the small details in the in the code as well Okay So You can think of this query as the model Asking question of its context so far, but it's going to help it predict what the next word should be so the query could be something like I'm Tell me what previous adjectives or tell me what the previous determiner is and And It's a bit of a word like these here And then for the keys they're going to be things that sort of label the current word with telling you some information about it So they could be saying this word is an adjective this word is a determiner This word is a verb something like that What can we say more complex like it could be like a any of our true relation like co-reference or something so Well, it's a complete this question the query and then it's going to compute Just do a dark product with all the keys and uses compute and then use your softmax as well And this is going to induce a distribution over all the previous words so Here you can imagine a query something like tell me what previous adjective is and the Attention will produce this kind of distribution over these three previous words It's going to put most probability mass on either horn or self-right We're also going to compete this other value. These are the quantity called the value And we'll do that for all the previous words as well And maybe the value will tell us something slightly more about what content of the word is and then I'm going to come here This hidden state by basically marginalizing out the attention distribution so here this hidden state is going to be a Wasted sum of the values of all the previous words that's going to be Wait by the probability of that word So That's basically what's going on the left side of this figure here I Left out this detail at the scaling scale. Let's just act to make the gradients more stable Okay but there's another detail here, which is that That's kind of single-headed tension I've described so far, but We're actually going to think old multi-headed intention. That basically just means we're going to compute The same thing with different queries keys and values multiple times in parallel So this question for a relic what the intuition behind that is and I really it's like you actually want It's great the next word you need to know lots different things. So just an example for So let's say the next word here should be unicorns plural To know it should be unicorns. I mean You probably want to know both that it's horned and silver-white because The conjunction of those makes them more likely to be unicorn But you also want to know that the the the term in here wasn't these and not a if it was like a horned silver White it'd be unicorn singular. The fact is these means it should be unicorn plural to get plural So you actually need to like look at all of these three words at once to have a good idea What the next word should be a multi-headed tension is a way of like letting each word look at multiple previous words So what they need to say a question here is why are we actually in need of using the softmax? Why do you use a softmax? Yeah It's a good question. Um Firstly, I mean housing on normalization effect is probably good. I mean, otherwise when you go to longer sequences this Summation would get kind of bigger and bigger the further you went through so having normalization It's probably good. The normalization also kind of lets the model Discard information to so I can say this word Just isn't relevant Which is good, I think I have seen people experiment using things like really use instead so which kind of give for a different way of discarding information, but I think The evidence of the softmaxes work best Other questions. Oh, you may have missed this The mask there in pink Don't know what that is briefly Right. Sure. So the mask is actually important. Um, so I'm gonna say what make point this and So one of the really big wins about this whole statistical multi-headed attention is that it's extremely parallelizable So none of this competition of queries keys and values to give a time step depends on what you're doing at any of the other time steps So unlike a recurrent network, you can actually compute all of these simultaneously Which plays very well with them the kind of hardware we have these days So not only we're gonna view all the different heads at once we're gonna compute all the time steps at once in a single forward pass So that's great except that if you're computing all the time steps at once there's nothing to actually stop you Words looking at the future so this like auto-regressive factorization we're dealing with here then We only want words condition on previous words But as I've described them all so far the words could look at future words, too Which is the problem because then they can cheat and use themselves on future context for a good result so solutions out here is then So it's what we call self-attention masking so a mask is just a upper triangular matrix that sort of Has zero is the lower triangle and like negative infinity in the upper triangle and We're going to just add this to the attention scores and the effect of that is going to be that Every word left has a much higher attention score than every word to the right so the model only end of practice using words To the left. This is just a missing mask without trainable weights Just values either zero or negative infinity So you only mask in case of an application specific training tasks, correct? If you had to just build representations, for example, you wouldn't need to mask because it wouldn't matter Yes, great question. And so we're going to more general representation learning later. So Most we just want to text encoder you don't need to mask and bidirectional context is absolutely helpful in this case of language modeling which we're working through so far then The mask is sort of crucial to make the model mathematically correct and compute the correct factorization Okay So one of the details we need to do to make all this work is Add something should be input convolutional about it. So as I said small sofa is actually Um It's doing a little bias model knows it very little about language the inputs could be anything and it would work. So In particular, it's like you can model a set or graph or anything like this. It should be fine but we know I mean in language, there are some properties which are useful like for example There's an ordering to words, which is really important to how you interpret them and This well doesn't actually I think about that and that's in contrast to the convolutional models and the current models I showed you earlier, which both have different ways of encoding the order of text so one of the techniques that was introduced in this paper was called positional bedding and The different ways you can do this they describe something in the paper, which is slightly weird Actually, I'm not in the scribe, but it works just as well to essentially Learn separate embedding for every time step. So For every position in document zero one two three four five you just learn separate embedding and then you add this to your inputs So your input now is a summation of the word factor and some kind of positional factor And it's very simple, but it gives them all the kind of the order information it needs and it works great Okay, so why these models so good why is everyone use them? I think The really powerful thing about the model is that it kind of gives you direct connections between every pair of words So each word can can directly act to the hidden states of every previous word and That's a contrast convolutional model Could maybe get to the state of all the words where there's perceptive field, but nothing further back in time than that and our current model and The state has to go through this bottleneck at each time step. So You can actually directly access the previous words Beyond the like literally one previous word and the further in the past than that had to get somehow compressed and you can lost information on this Where self-tension can in principle put a hundred percent tension on any word in the distant past and see exactly what was there And this just makes it a really powerful model. It's like avoids things like issues like managing gradients quite effectively Music is just learn very expressive function very easily The other great thing about this is that how parallelizable it is. I mean So the one hand this model is doing quite a lot of computation in that it's doing this So the self-tension operation is quadratic basically because every word can look at every other word That sounds quite slow But the really nice thing is you can do it in parallel So because all these operations are independent to each other you just do it as one big matrix multiplication Even though some sense you probably do kind of more Multiplier operations and you would do with the equivalent RNN You can do all these operations much faster because you do them all at once rather sequentially So this is a really good trade-off. I also want to quickly talk about some other things there. It's great Stuff like Multi-headed tension and additional bedding since that's they've got all the attention and when transformers reverse for least But transformers also came along with a whole bag of other tricks as well. And these tricks are all Actually really important to make it this stuff work And since that's this paper really kind of modernized and I'll be I think So for example, I mentioned about this useful lay normalization before the line on It's really helpful The other stuff doing these things like these learning rate schedules, so For that reason to make transformers work. Well, you have to Sort of linearly warm up your learning rate some from zero to You go linear rate over several thousand steps and People do you want learning rates what learning rates and other settings, but transformers really really need this to work Also the things like the initialization actually really do matter with these and some installations don't work and and they're throwing these other tricks like a label smoothing at the output, which Again wasn't events in this paper, but it turns out quite helpful for the cast like machine translation right and so to Give you some idea how well these things are working these models described super Here's some results on a language welling deck benchmark So the number on the right is what's called perplexity which is a Measurable likelihood of held out data and here lower is better So you see that an LCM for 2016 gets perplexity of about 48 See that yonder funds convolutional models 2016 a doing quite a bit better at about 20 or about 37 Also played around a whole bunch of our number rings like you see Dozens of papers on how you make variations on LSTM's and some of these sketch below 30s as well Then we introduce transformers you get a really big jump down to score to about 18 and 20 and In terms of language modeling, that's a really enormous jump in performance And I should say these games we saw were particularly large on Kind of long context language modeling so language modeling where So there's some benchmarks where you just get single sentence predicts on this task you actually get a whole Wikipedia article so Eventually thousands of words and Transformers really shine on this what you've got thousands of words context model and you need to retain information across all of them okay, and Well, so yeah, so it's quick comparison just to visualize how transformers in LSTM to look Which is going to indicate some points of shame before so in the LSTM You have a lot of your connections to the words like everything's kind of very sequential left to right and transform Are you going to have any of this? So every word is directly connected to every other word um, I Guess you should say it as well in some senses maybe slightly unnatural Well for reading so it kind of suggests that Hope this mall's reading text and every time it reads a word it goes and really reads every other word Very quickly But it's very effective. Okay. One other good thing about transformers is that they do scale up extremely well, so Which has that language modeling you get essentially infinite amounts of data because it's just Hundreds of billions of words out there. It's far more than you'd ever need That means To actually fit this kind of distribution you need very big models and You just keep on a transfer parameters transformers they keep on just working better and better The examples I showed you forward from this GPT-2 model with two billion parameters, which was Quite big for 2019 but really 2020 we're up to 17 billion and There are rumors that a hundred billion of parameter models will be coming along soon Excuse me. I have a question You said transformers are really good for scaling up I was just wondering in the language modeling task if we have like say a 10,000 20,000 word document in an RNN We can just Insert a word step by step and we wouldn't need a lot of memory per se like for for a transformer We need to have a bad size of ten thousand wouldn't be like like the The length of the sequence, but if we have a really long sequence, can we model these long-term dependencies? Yeah, that's a really great question. I actually meant to mention this point so Two things I want to say so firstly you're absolutely right the self-attention is because it's quadratic The expense obviously grows super linearly and that is a problem in practice. It means So mostly Transformers do 512 token context Monday, so it's just fairly affordable GPUs. I Love these language models to do more than that. Let me model a few thousand tokens It's kind of limits what we can do But it's definitely case that a vanilla transformer can't model say a 50,000 word book at all Um There is like this whole cottage industry recently of building various transformers which can do Long sequences and it's very hot topic right now There's a bunch of things you can do but one kind of thing you do is like if you replace the self-attention with something like nearest neighbor stitch you can do the self-attention So quadratic time and that makes it faster There are also versions to try to do kind of sparse attention where like you can't attend to every previous word directly, but you kind of have some dilated Such previous words you can look at and You know quite a direct connection to every word, but you can sort of guarantee that you have short paths across every word There was sort of things like compress to transfer almost to try to bottle it's all into Stop Compress the people distant pass into shorter representations Okay, so you brought up a question RNN so at impress time absolutely an RNN can model Infinite context with a absolutely no additional cost, which is great. I mean It can't put like a million words and I'll put a million first Just fine well the question is actually is it used in this context and then The answer probably it's not so at training time You can't do this a training time you actually have to back propagate through what's called baton race through time where the LSTM would Have to give more than a long context like gradient will have to propagate all the way back through the Sort of all the recurrent steps to make a difference to the distant past there In practice firstly the gradient will vanish like well before 2000 steps and also This is very expensive. So this at him a training time This isn't for free and this back propagation operation will get more and more expensive the longer sequences your modeling So that means you can't actually really learn to Well this the past even if this time is not expensive you just wouldn't know what to do with it and in practice you can't just forget it anyway because the You can't remember that much days at once Right one more quick point on that is interesting I guess one case where they'll where the RNNs do have an advantage is on certain algorithmic tasks So if you aren't modeling language, let's say you're doing something like Addition or trying to like model parity of a string So if you use string of like zeros and ones and ask you Other an even number of ones or not in those cases You basically do actually want to reply to literally the same operation every time step and You don't actually have to have much of memory because I mean your state really just needs to be a zero or one In this case the The RNNs actually do work very well because you can train them on short sequences and then they'll chew great generalizations along sequences all these kinds of toy problems and I Have seen on tribe. I mentioned transform would actually find it much harder to get that kind of generalization That only really applies to these kind of algorithmic problems in terms of modeling natural language then Yeah, it seems like various transformers are going to be much more effective than recurrent nets Thank you, that was really helpful Any other questions on transformers I've been addressing the questions. I could be a text. So I think it's we are all good right now Okay, all right, sir Next time I got to cover is a what's called decoding on inference blank model. So We try this language model. This language model puts a hopefully probability maths on things which good English and No probability on a grammatical and not sense cool things But if you want to like create these samples like I showed you before I've had we actually generate stacks um, so often when you think about Inference of the graphical model what we care What we'd like to do is find the max so if I print sensors which maximizes large models probability um, unfortunately there's um As I mentioned before there's quite a lot of English sentences that are possible and we can't just like score them all to find the max Um, also these models don't really There's not a trick you can do with dynamic programming here. So sometimes you can't find the max of over-expansionary structures when you Have a model that factorizes some way that lets you build a dynamic program Um, which lets you kind of share states across different hypotheses, um but These models don't be composed in a friendly way. So You kind of whatever choice you make the first word could affect all the other other decisions All right, so most Given that uh, one thing to do is to do greedy decoding. Um, this is where we're just going to Take the most likely first word and then given that word pretty much like the second word then come to Uh, get them was like a third word Um That's okay, but it's um, there's no guarantee that's actually going to be the most likely sequence you want to output because if you happen to make a Bad step at some point then you've got no way of backtracking your search to undo any previous decisions Um The slide just says that exhaustive search is impossible. Um And the kind of hits the middle ground of what's called beam search Um, so beam search is a way of trying to keep track of an n best list of hypotheses and then we're going to Uh, just every time step try and keep track update this list With uh new words we've added Uh, it's maybe easy to show you an example. Um, this slide's from avlc at stanford um So We start we output Let's say We have the size of two we'll output these two possible words. These are these are two most likely words Then for each of these words we're going to generate um Workout with uh, two words like the next word sir Um So obviously the word with i or he that does affect what the next word should be and we'll come up with different hypotheses for each of these Um And then every time we're going to kind of compress these down to A list of uh two that we're going to continue firstly So we are looking for the lowest total sum. Is it correct? Uh, yeah, so these are Uh log likelihood so Um, we actually want the highest sum If that makes sense, what should be where zero would be a probability one Oh, yeah, yeah, sure So every score is going to be less than zero, but we're going to find the sequence with the Where the sum of the scores is at the highest Did that make sense? It is. So sorry. Yeah, I got the the sign flipped. I was meaning like in magnitude We're trying to get the the smallest number in magnitude Uh, the smallest in magnitude sure. Yeah, sure. Sorry Sorry, maybe this is not the most Best way to show this again How deep does the um, does this like tree Go into the being search? When do you stop looking for Um, like candidate sequences, I guess Um All right, so we basically One detail I haven't told you is we have this like end of sentence token And the end of sentence token means once you once you are put that then you This hypothesis is finished Um And the aim is to find complete hypotheses. So from from start to the end token That have the highest possible score so um We'll just keep on generating new hypotheses until and uh There's no possible new words for that that would be the K Complete hypothesis we have still Okay, cool There is another question here. Why do you think in nmt very very large beam size will most often result in empty translations? Oh, uh, great question. Um, I hope it ends on this. Um, so So we've said just good in the sense that it's guaranteed to give you a higher scoring hypothesis than the Uh green research I mentioned on the previous slide But it's kind of a catch here, which is that we're not actually At training time. We're typically not using a beam so At training time, we normally just use kind of the autoregressive factorization. I showed you before where like given the n previous correct outputs predict the n plus first word um What we're not doing is Exposing the multiple mistakes in it so So we do beam search. Um You can get all kinds of nonsense showing up in your beam for whatever reason because if you have a very big beam then Probably some of it will be garbage And in these garbage states the model has no idea what's due because there's never a train in the situation Um, so it's kind of the reason what takes me to model to generalize wealth like making great predictions Uh for some completely nonsensical series of words, which is like very powerful training distribution and in these cases the model can can do All kinds of weird and insirable Things like maybe they put a very high probability on something else um For a kind of a classic example this I don't know if you I don't have one here But you may have seen these lines walls get stuck in these feedback loops where they end up just like repeating the same Word off rays infinitely many times I think this is kind of an example of this where once the model starts going into this kind of loop then It doesn't really know what to do and the easiest thing for it to carry on doing is just keep on looping So Yeah, I think The issue mentioned beam search is one of uh Not exposing the model to its own mistakes at training time So it just it actually puts a bunch of probability amounts and all kinds of things it shouldn't do Um kind of the obvious solution to that is like what why don't you have a beam at training time? Um There's uh And the short answer that is because it's expensive It gets going around this whole inference procedure at training time But firstly it gets rid of all the nice parallelism that we have from transformer And it uses all the search and also gives you Many more things to score for Every training example So In practice people tend to just Ignore this problem and train Pretty big model for as long as they can like exploiting like nice past parallels and you get from this Auto regressive version and then A test time people will often tune the size of their beam to get the rest performance. So If things are translation like increase the beam normally helps up to a point then basically worst game when you start uncovering These weird degenerate degenerate outputs, um But yeah, it's just the unsatisfying thing people Have to do Sorry, that's quite a long answer fancier question I think uh I think the student I will see now what the student says Like uh, that was a question from a student. I I know Uh And there is another question a small one on this current slide Why the why is the a and one in green? on the right Um I could admit I don't know. Uh, I saw this side from adi adi. Um I'm not quite sure points. She's trying to make that Okay, no, it's okay Oh, okay. The the point actually someone answer because they are interchangeable Uh, regardless of the the the the one you pick you get both outputs pie and tart Like either or you go for either you go for a or one both of them will tell you pie tart or pie tart uh I'm not sure. Um, I mean Even if you would see that then you're you can't compress these it's not like you could get a dynamic program where you can then sort of collapse these hypotheses because The hidden states both these those two hypotheses would still be different Uh, to any of which path you took together Uh, so I don't I don't know but hopefully this else might be useful. Okay. Okay. Any more questions on this? Okay, this is just a bit more description of how the algorithm looks. Um Basically, all you do is it every time step you generate a distribution of next words for each hypothesis you have and then you Take the top k hypothesis the next words across all your hypotheses and pull these before you go into the next word so All right, so beam search is sometimes the right thing to do but um Often it actually isn't uh, so This is the result of applying beam search to the example I showed you before So this example of gp2 about like scientists finding unicorns in the andes, um You can see here that uh, the world's actually It starts out putting some good stuff in it stuck in this weird feedback loop so I think yes We're just it's going to repeat the same phrase over and over and over again I'll probably just keep on repeating this phrase forever I mean I guess kind of what's going on here is like once you said this phrase twice Maybe it's just saying the third time is actually the most likely thing to do And then always have the hypotheses get very high problem Transitions get very high probability, even if they're not good But there's also kind of a slight different problem, which is like maybe In some cases we don't actually want the most likely sequence after all, um Maybe what we want is something interesting so You see this problem also a lot in like things like dialogue response generation so You're trying to build a model of conversation with someone And if you do this kind of beam search often what you'll get is it will just Give you the most generic response to anything you say so Whatever you say it'll be like oh, that's interesting. Thanks And maybe that actually genuinely is the most likely thing to do Because these responses are the good and most situations Um, however, there's not actually a very good experience a very good system So how about if instead of taking the Max we're going to sample from the model of distribution instead um So this is a Conceptually kind of appealing but it doesn't actually give you very good outputs. So uh This is again The result sampling on that same input turn I mean some of this is kind of good, but then it goes kind of uh It gets more weird and generate as it goes on um And again, you get to an out problem where like once you sample the bad choice Then the models and state it was never in training Um, and then once it's not state is in during training It's more like to give you some more bad output and then you'll get stuck in these horrible feedback loops Okay, um All right, so here's something that actually does work with the technique that was used to get you those Beautiful outputs we had before um Unfortunately, it's not very satisfying technique, but It's here for disclosure. Um So it's called top case sampling. It's produced introduced by uh, angela fan a couple years ago um Based on top case sampling what we're going to do is trachea distribution to just taking the k best and then sample from that um So this is an advantage of giving you this kind of diversity where you're like choosing randomly amongst like good options but tries to Stop this kind of uh Falling off this kind of manifold and actually good language by When we sample something bad, um, so the idea is to just chop off the long tail I'm just sample from the head of the distribution and this is the sampling for the beam search. Is it? Sorry, this isn't beam search. This is just um generation. So we're gonna um There's not gonna be beam here. This could be one hypothesis I guess you could It creates beam search too, but uh This is actually pure sampling. So I can generate Word in this method and use that to generate the next word okay, um So when you Do all that then this is finally the technique that was used to generate this nice sample Obviously this top case sampling is a bit of a hack. It's not very satisfying um I was um an author on that paper so I couldn't mean about the method but um It does seem to work quite well Um, I guess one thing to be aware of is like when you see these great samples something like this, which is probably open AI are very happy to put in their publicity Uh, it's kind of useful to know how it was actually made and this is like this is not like a real sample from all distribution And while it was like not putting Uh And it mostly was probability math on this this is something that's generated by Doing a slight weird inference procedure with the model um I just want to quickly cover like all right, so given some text like this. How do we know if it's any good or not? How do you evaluate this? um So like isn't that like evaluating a lambs model is quite easy I mean it's language modeling is a task density estimation. So you just look at the look like how to have that data um If you want wants to do instead like uh, take some text to model and say it's any good or not then Uh, this is non-true real and uh It's kind of people tend to use like these automating metrics like blue and rouge which measure like and grandma black with a reference, but We're not very satisfying And this is recent research in trying to do automating metrics I should probably speed up a bit. Um, all right. So this is um That's unconditional language models. Um So they're generating samples of text Uh, this actually isn't a very useful thing to do. What's much more useful is uh conditional language models so models which will Give us an attack give us some input generate use an output So for example, you can think about things like, um Given a french sentence like translated into english or given a document generate summary or given a dialogue generate the next response Or you can give them a question and I'll put the answer So these things I call sequence sequence models Because you get given some input sequence and then you have to generate some output sequence Um, because the first models these are introduced by Ilyas Eskimo Um, look kind of like this with the recurrent neural networks where basically you'd have some encoded network Which would read your inputs Produce some vector which is what I just called a block vector And then you'd use this to initialize your decoder which would generate tokens way by way Again, hopefully you get your theme here that like having these kind of bottlenecks and recurrence. So it's not a good thing to do Uh, you also have expressive models, which You might see everything Uh, so there's a variation transformer for sequence sequence models Um, here we're going to have our two stacks an encoded stack and decoded stack Uh, basically the encoded stack is the same as what I showed you before apart from the self-attentionism mask So every token the input can look at every other token in the input Uh, the decoder stack will be similar Except that as well as doing self-attention over itself. It's also going to do attention over um the Complete inputs So this means that every token the output has a direct connection to every previous token in the output Um, and also to every word in the inputs Uh, which makes these models very expressive and powerful And the main transformers were introduced. Um, they uh Got quite nice improvement on translation scores over The previous recurrent and convolutional models So training models typically what we do is we rely on label text. So we Uh, so it's a train translation system. For example, you Try and get lots of manually translated text It turns out one of the best ways to get this is things like parliament proceedings because they always translate the european parliament or rest proceedings and lots of different languages um And then you just use those as inputs and outputs However, like not all languages are represented in the european parliament. Um So and These transformers very data hungry and like more text we can throw at them the best they all do Um, so this question like how we use modeling more text To improve these so this is actually saying it Can be learned without just having input output pairs Uh, the way we could do this is a technique called back translation, which is Quite simple conceptually That's our goal is to train a translation system that inputs german and will output english Um, first of all, we're going to actually do the opposite. We're going to train a reverse translation model which will Uh, give an english output german then we can um Run this model over all the english text we can find and we can Uh, find a lot of english text on the internet and we'll translate it all as german um, and that gives us like lots more pairs of English and german text and then we're going to train a forward model of we'll try and translate this german into english um The nice thing to see here is that It doesn't actually matter how good the Initial model is all right. It doesn't matter if the your reverse model is making mistakes Um, so if your reverse model makes it makes mistakes, then your final train data will contain kind of noisy german translated to clean english um Which might even help Regularizing model, but probably shouldn't pass its performance when you show it Clean german data What's the byte text? Oh, i'm sorry. Um A byte text just means a parallel text. So the same Sentences in two different languages Okay, thanks okay, um The nice thing about translation is that The outputs you get are actually always uh, you know high quality because these are The outputs of the system a thing to you uh Sentences you found in the wild on the internet You're just going to create some noisy inputs that you're going to use for these pair with these outputs Uh, could you go back a slide? Uh The third point translate billions of words of english to german. Is that through the reverse translation model? Exactly. Yeah Okay, cool, and you're saying back translation helps Generate higher quality translations. Uh because of regularization. Is that it? It's not regularization. It's also the really useful thing is gives you lots of clean output data So let's say you want to be a good german to english translation model You kind of need both to understand german, but also be able to Write lots of fluent english as well and send english grammar to you And the back translation gives you a way by incorporating tons of Additional english data beyond what you have translations for um Okay, so kind of kind of combined translation model with a language model and you can also Um iterate this procedure too. So you can uh Use use that whole set up I described before to train a better model and then uh Do this to help you generate better translate better back translations which you can use to train again And this can be really helpful. I mean it helps even in english german, but it's particularly helpful in cases Where you don't have much data, so this is on a Bernese to english translation Where there isn't a lot of parallel data, but you can Get really large improvements by Just iterating this back translation This is from recent work from fair which I forgot to have a reference for And again here are some results on Think this is english german to train some good improvements Uh one of the direction um machine translation people are exploring now is a massively multilingual mt. So people are trying to not just translate between two languages but trying to Take dozens of 100 languages and try and Train a single neural network that can translate between all of these um And if you do this you start to see um A big improvement particularly in languages where you don't have much text presumably the model's learning some kind of More general language independent information Okay So the last topic I want to cover in this is really important which is that self-supervised learning um Right so um young is just sick seeing this cake now, but I think it's actually A good image for this so the idea is um Really that's a Learn stuff like most of the information need is going to be most learning would do has to be unsupervised so We have huge amounts of text without any labels on it, and we just have a little bit of Um Sort of supervised training data that's represented by the cake here being the unsupervised learning and the Supervised learning just being a little bit of icing on top of the cake I think actually um The recent programs self-supervised learning for NLP has really proved this metaphor to work Okay, so Um, I mean I can describe quite a few methods for how you can do self-supervised learning for NLP. Um Just so you can try and get some ideas as to What's actually working? Uh So The first one is right this word to back. Um So the idea of words back was trying to I think it's really the best paper that showed Uh Got people excited about self-supervised learning in NLP Uh, but having some previous work from a cover in western, which it also shows good gains um So the goal here is going to be trying to learn what's called word embedding so effect space representations for words and the Pope is that By just by looking at unlabeled English text we can learn something about what these words mean And so the intuition behind all this is that if two words are Okay, okay, close together in the text and they're likely to have some kind of relationship between each other So the pre-training task we're going to do is going to be a filling in the blanks task so, um in this sentence, I'm going to Mask out this word in the middle, which is unicorns and try and predict what this word should be based on the surrounding context And hope would be that words like I know unknown silverhead or horn will somehow Are more likely to occur in the context of a unicorn and they are I know A word like some other animal Um, so basically this is going to be very simple model where basically we're going to take all these context words We're going to apply some linear projection to these and map them all down to a fixed size context And then just do a soft mess over abicabury Uh, this is so it looks a little bit like a convolutional language model The only difference is for predicting the word in middle not the word at the end And the practice of this model was just a shallow linear projection and it was not a very deep model okay um, so One of the things people find interesting like this was these word embeddings and to show some Kind of surprising stretches to them. Uh, I'll show you here. It's fun. There's like people debate about how meaningful this is um, but basically The claim was that if you took your embedding to the word king, which you train like this These track you're ready for word man, then you add the embedding to the word woman You'll get something that's fairly close to the embedding for the word queen so somehow um It's just this kind of unsupervised fill in the blanks learning task Is inducing this kind of linear structure with kind of meaningful differences between the vectors okay, so I mean this was great and the really good thing about this was there's a really really fast thing to do so you can train this on billions of words of text to back in 2013 um, but there's a big limitation which is these word embeddings are independent of the context so um You get like one back to the word and even the capillary um But it doesn't know anything about how that word plays out the words and we know that to in language like um A sentence is more than just a bag of words. It depends each word interacts with other words somehow And these interactions are in some ways a really powerful thing so in more examples You know, obvious example is like ambiguous words. So lots of words can have many different meanings And these word vectors won't capture that Or the best logistic position of all the meetings Okay, so how do we add context to these? Well, um Let's see The most obvious way is to Do a language model. I think I'm missing a slide here. Uh, basically Uh, what we do is train a conditional language model. Sorry an unconditional language model. This is going to Be exactly this kind of model I described earlier in this book And then um Given this language model the language model will be outputting in the states every Time step prediction the next word and Instead of when we want to sell supervised learning what we're going to do is replace these outputs with um Some other output that depends on our task So the pre-trading phase is just going to be predict the next word but then kind of the icing cake on supervised learning will be the predicts some other property. So I'll show an example here for a task called path speech tagging Which is trying to say Put some labels in every word here. So turn light label Signages and now and then distinctive as an adjective Um, well, you can actually put all kinds of tasks into this kind of framework Um So for example, maybe you can fit some blame like This is a sentiment analysis task where you're like giving some text to predict like from an amazon review and predict the uh rating so This is a review that says what can say about spandana slicer that hasn't already been said about the wheel penicillin or the iphone And this review got five stars So here we're going to predict one output from this language model which is going to be the uh Atopic at the end Which is something kind of task specific label so One of the really nice things about this approach called gpt. Um was that it kind of eliminated tasks specific modeling. So now suddenly we have Uh one model which you can pre-train and we're going to have fine-tune as well to do basically Any task we want to do The involves classification um So before this, um, there's actually a few years when like people were building all these crazy architectures So you build different architectures to do build a question answering model or to do a Sentiment analysis model or something. Um, now you train Pre-train one big model and then it's really easy to fine-tune it to do whatever you like So that was a really big step forward Um Unfortunately that model has kind of the obvious limitation I said the important thing to do was kind of contextualize words. So like let words know Build weird recitations depending on the context But if you pre-train just language model You can only really condition on your leftward context So your expectations for each word necessarily can't depend on the representation for Any future words And that kind of limits what the model can do quite a lot Um, there's one kind of obvious fix that which is um the approach taken by elmo Um, so elmo rather training one left to right language model Um, also trained at the second language model which operates in the reverse direction So this is like, um, so that's the last word in the document and they keep trying to create the previous ones and then You get word recitation by concatenating the output layers of your left to right model and your right to left model uh, so this model Isn't some ways better in that like now your word recitations Can condition on the leftful context and our rightful context And that's really helpful for lots of tasks but It's all kind of limited that you don't really model interaction in these contexts. So these Um, you just get the shallow kind of concatenation of the left representation the right representation And what you really like to do is have kind of like rich interactions between left context to the right context but um, and all this brings me to burs which is uh, maybe you've heard of which is Made a very big difference in lp um so But actually looks quite a lot like word to beck work. It's basically a fillable blanks task. So you take some text You hide some tokens by masking them out And then you just try and fill it in the mask. So you get this text like Something is a golden something muppets and you fill in the um so Okay, the thing I want you to notice here is firstly That actually looks quite a lot like word to beck. Um, word to beck was also given some text fill in the blanks um The reason it works much better is that in word to beck, you would just Have this linear projection for like According to context words Whereas a bit you have them a very large transformer, uh, which can look at much more context and well much Florida actions in So there is a question here Uh, how are context representations maintained when fine tuning for a specific task? Um, how are they maintained? um, I guess Um, it's not clear they are maintained. Um, so When you find you for a particular task You Kind of hook them all as well enough general stuff that language drip the free train task And then during the fine-tuning probably I guess it's forgetting a lot of this stuff that it doesn't need to solve this particular task so If you're fine-tuning your own sentiment analysis or something you probably You probably kind of lose a lot of this information during fine-tuning. That seems fine Thanks Okay, um, so it worked very well. Um It gave like quite large improvements on a bunch of tasks Um, it was actually achieving the performance of humans or at least humans approximated by Amazon Mechanical Turk on a bunch of important uh question-answering benchmarks But um, but it was definitely not the end of the story here. Um But a lot of people are very excited about self-sufficiency training Um, so just to quickly summarize details there So it's very simple model is that you're just going to mask out 15 percent of the tokens And try and fill in the masks to build on that. Um doesn't work at, uh Facebook Uh, yeah, by let by yin han Lu which looked at scaling updates. Um So Actually has second pre-training objective which we showed didn't actually help So can I ask a question for the slide? So there are there were three bars? I think I missed one point the the dark blue What is the dark blue? So we have amazon turk the Thank you. Sorry. I should have um dark blue here is previous state of the art. So Okay, which was um These walls were probably almo, which is oh, yeah, okay so The previous idea was definitely using self-sufficiency training, but uh But improved every element by having this kind of like I see and you actually Include actually is a benchmark that we've been creating here at nyu Exactly. Yeah students were involved. Yeah. So yeah glue is a big benchmark. It's very important Okay, um So it's a bit but it turned out all you had to do was firstly simplify training objective then just Scale is up. So scaling up here means bigger batch sizes Uh huge numbers of gpu's More free training text and then um You get very large gains on top of uh, but in fact Much larger gains on top of Bert and Bert had a bit of previous Elmo work So yellow here is then this new reberto model and uh Actually reberto the question answering is like superhuman by quite a few pines and Also on this glue benchmark where my u is also Uh out from people And this isn't really about doing anything smart. It's just Taking self-supervised training and doing it more All right, um Well, how why do you say like if you go on this slide here? So there is a very large improvement between uh, Bert and then reberto in the gloom, but not such a huge Change maybe in the squad or is it just a zooming factor? I'm Uh, all right. Yeah, um Maybe it's just a zooming factor right those bars on the left are taller. Maybe Yeah, I think Maybe the scale to start in this way. Um, I think the point is if you compare to human performance, um, I know, uh But it was what not point six points better than people whereas reberto is uh, three and a half points better So By that metric is actually quite a big jump Yeah, okay Uh, so I just quickly discuss a few of the other things people have been doing self-supervised training and So it's more called excel net um Basically so in third when you predict all your mass tokens you predict all the masks conditionally independently Um excel net has a trick lets you predict these mass tokens autoregressively, but at random order And they claim some improvements from doing this It's also a span bit which Uh, rather than masking out words you're going to mask out a sequence of consecutive words Um, there is electron which um Rather than masking words we're going to substitute words with Sort of similar ones and then have a binary classification problems tell you which words change on up Uh, there's albus which is um, berth, but you uh, title weights cross layers, um Also, um, xlm and xlm are which look into doing this multilingualy It turns out basically you run the berth pre-training objective, but Rather than just feeding in english text you Feed in text in every language you can find it does a great job learning cross-lingual stretching um the Keep writing what you take from this is really like These are all kind of various something um, there's I mean, but in the end like lots of different things work and the important thing is you have big models you have bidirectional interactions between the words and do you uh If anything the scale you do this at is the most important thing so One limitation these models I described is they're only doing kind of text classification problems but often we'll want to do um Problems where the output isn't the classification problem is actually output some more text So pre-training for sequence sequence modeling Uh, two papers came out about the same time for this one of which I was involved in. Um called berth and tpi um So these models basically are going to pre-train sequence sequence models by denoising text um The pre-training objective looks kind of like that basically all you're going to do Is take some text um Corrupt it somehow by applying some kind of masking scheme And then rather than predict fill in the blanks you're going to feed the corrupted text into a seek seek model and try and predict the complete outputs And um And this is kind of nice because then you guys go beyond just masking text and you apply like any random uh Corruption of the input you want so for example, you don't like shuffle the sensors up to the whole phrases or insert new phrases and the seek seek framework is very flexible um But it turns out just doing simple masking actually works as well as anything else And then if you do this as well as like doing one of things benchmarks like squadron blue Which classification benchmarks you can also get state-of-the-art results on um It has like summarization so um Where the outputs attack so Uh On the left here, we've got some length of document we've fed in we have some models to produce some kind of summary of this and Uh, it is a great job like actually uses context from across the whole document and solve things like co-reference and Generally, it's probably to tell us seems to show some understanding of the inputs. Okay, we're running out of time Um, so briefly Um I don't think it's the end story. I don't think nlp is now solved. Um A few open questions, which I think are interesting including Uh, how we treat like background knowledge into this. Do we just want these models to try and memorize the whole Internet or should we build memory mechanisms somehow? Uh, I still brought up earlier how we model long documents like we're typically doing 512 tokens here. How can we Do say a whole book at once? um One and satisfying thing at this is like we have like the same model architecture can solve all kinds of problems But it tends not to be able to solve all these problems at once and typically you fine-tune a separate model for each task It'll be great if you actually have like one model solves everything um, and um, it's kind of relates to that like we Basically get human level phones name tasks where you have say a hundred thousand Uh labeled examples to your learning from but Can we build models with you well with like one thousand or ten or one? labeled examples and finally that people bring these questions out that's where these models are just Actually understanding language or they're just really good at breaking a branch max Okay, so to wrap up this uh lecture, um I think the main sort of takeaways from this uh um These kind of low bias models like transformers work great We shouldn't try to explicitly model linguistic structure. We should have Very expressed models and showed lots of text and let them learn whatever structure they need um Predicting words and texts is a great, uh unsupervised learning objective um Risk if you want to understand what is crucial to like incorporate words and context in particular bidirectional context Okay, so that's all I have so thank you very much for listening. Let's see if we have some questions now. I think there will be Severin Thank you, Mike Yeah, there's a whole bunch of discussions why we why you were talking Okay, links to various papers and explanations of various concepts in the background Okay, um any more questions? Uh, yeah, I had one um on one of the open questions It's like understanding whether or not these models are actually understanding the language What are some ways that exists right now to quantify and measure that um and how like how How would we do that? Yeah, okay, so What typically happens is someone says I know These models aren't understanding language if they could They'd be able to solve this new task that they're introducing then they introduce some um new task which they bear can't do I'm gonna say they're just doing it then Next week someone trains a bigger model and then actually gets human performance in this task and I think really What's happening is like some people just have intuitions that I know is kind of These neural networks can't be understanding language And they must just be gaming datasets somehow And to the extent that these models do well There must be some kind of weird biases in our datasets and make the models can exploit without really understanding anything um I mean, it's definitely true that like a lot of our datasets do have biases in them and like it's uh Kind of hard to make ones that don't do it thinks a lot skill but On the other hand, it's like people are failing to fuse good counts for examples. So what these models can't do like Basically A good example in recent times was the uh, we know about schema results Where you know, we know about schemas are those sentences that are ambiguous and there is a reference. There is a pronoun that refers to one of the words and you can tell Which word the pronoun refers to unless you know something about how the word works, right? so the standard example is The trophy doesn't fit in the suitcase because it's too large Or the trophy doesn't fit in the suitcase because it's too small and you know in in one case the The trophy refers to the the the pronoun refers to the trophy in the other case it refers to the suitcase And there's there's a list of those and people Created a dataset of those things and until about two years ago The best results were around 60 percent for computers humans do 95 percent or something, but um, the best Uh, the best, you know artificial systems were getting about 60 percent and now I think it's about 90 percent Yes, I'm not right. Yeah, something like that, right. You don't even get any training dates for these This is just a purely unsupervised problem Right. And so so the question is, you know, it's clear that those systems have learned something about Uh, the role of objects and you know a little bit about how the word works by just observing statistics about text Um, but it's relatively superficial. It's not, um, I mean You know, as it's pretty obvious when you look at text as generated Uh, you know, we're talking about unicorn and then the first sentence is uh Unicorns as four horns, right? Which of course doesn't make sense because unicorns have only one that's kind of the point of being a unicorn. So Um, so the whole idea of I mean the whole problem of learning common sense Uh, has not been solved very far from it, but but the systems but they work surprisingly well I mean, it's surprisingly it's surprising how far you can go with just, you know, looking at text Yeah, I mean learning common sense is super hard because in some sense the things you want to learn Other things that aren't written down like no one ever writes down Common sense knowledge. Yeah Probably no one ever writes down like a unicorn has exactly one horn or not four. I mean, it's just Because everyone knows that So probably there are limitations to what you learn from just looking at the text But can you can you tell us something about the the word? I mean about some of the work you're doing on sort of grounded language Uh Sure. Yeah, so I mean um I put nothing about grounding in this um, but there's a whole interesting topic. Um, so I think It's really interesting trying No, no one's like produced text. They put people don't use text because it's like it's my reputation of world. Um so One topic I was really interested in like trying to like kind of look ground dialogue usage and goals. So, um Run channel build like chit chat systems that talk to each other or talk to people to have a conversation like can you build systems? Whether you try to achieve some goal, um We had some work a couple years ago and doing a Listen context in negotiations. So just trying to um There's two of you which are trying to have a conversation to kind of come some agreement with each other um, and The only way you can come to agreement is by having a dialogue in natural language and it's like Some outcomes which are good for you some which are good for them. You have to think it's a compromise um And yeah, I'm kind of interested in the setting because Yeah, it seems like you actually using language for a purpose is going to be really crucial to really understanding things like I think That's going to get hence like there's limits to what we'll learn from just like purely observational use of language that's it just purely Seeing text out there in the wild that other people wrote like really to understand things you want to be that's agents which are using language trends using goal and interacting with people and seeing what works and Learning from that kind of signal as well um Maybe that is just the cherry on the icing on the cake or whatever, but I think that's It is like important that it's there You need a cherry More questions from the audience. Come on guys. Don't be too shy Uh, I have a question. Uh Uh The first point on the open questions. How should we integrate world knowledge? So, uh, The way I was thinking about is that these billion parameter transformers have So much information about the world in them and then if we try to fine tune or train This model on some new data Could we like forget some of the earlier concepts that uh, this model had learned and how would we like quantify what concepts Like has the model forgotten Apart from the validation set Does that make sense? Um Yeah, so probably you will forget it. I mean probably you can fine tune this model to do Something that doesn't need world knowledge. We'll happily forget All the knowledge you told it. Um I mean there's some evidence these models are like memorizing quite a lot of facts. So there's a remarkable Uh result recently from this paper Let's google system called t5 which I think is 12 billion parameters And it's just training themselves to the right way and then you have people and then you um Just fine tune to answer questions which could questions about anything but you don't show it any You don't show it Wikipedia or something to let you know you just like see how it's memorized and You can test how much knowledge is in the model from that and um It's not safety out of that but it like it's kind of scarily good. It's like It turns out if you have 12 billion parameters and you train for a long time you can just Bit huge numbers of facts into these models I'm not sure that's like the most desirable way necessarily to memorize knowledge, but um It seems somewhat effective Okay Thank you everyone for attending. Thank you so much mike for uh For giving a guest lecture It's good to hear that stuff from the host's mouth And uh, we'll see See everyone again Next next week. We're going to ask you tomorrow, right tomorrow. We're going to implement in everything from scratch. So don't forget So tomorrow you get the gory details from the code side of all of this and then Monday you'll hear for you'll hear about graph neural nets from Uh, actually about graph neural net and graph knowledge Is that is graph knowledge used for language as well somehow because right? I I don't know. I have not really uh Knowledge about this part of the field Well, so you can view to some extent you can view all the those cell supervised training in text Uh, like where to vex uh, bird, etc They use a graph and the graph is how Uh, you know how often towards up here next to each other or some distance Away from each other in the text that's you know, the the graph of similarity between words basically is determined by how far You know, how often they they appear nearby And that's that's when you decide to put them in the input of a neural net Because they appear, you know within the sequence. So you can think of those, uh, those those cell supervised learning systems as basically a very simple form of graph neural net Well, the graph always had the same structure. It's always linear, but and it always Indicates your your neighbors in the text Mm-hmm. I see all right. All right. Thank you so much everyone. I like what you mean. Bye. Bye