 Mae'r hyn o'r cyfnod o'r cyfnod o'r cyfnod. Hadeeb was also like advances and kind of the evolution and mine is also kind of an evolution story, but hopefully mine will demonstrate that things have got easier, not more complicated because of all the work done at these other places. So I'm going to talk about current trends in natural language processing but in particular motivated by the paper which Hadeeb mentioned which came out like a week ago from OpenAI. So I was also planning on talking about something else but that could be wrapped into the same story, so here we go. So hopefully if you've been to one of these before you kind of know who I am so I have a background in this machine learning thing with start-ups and finance. I was in New York for a long, long time. In 2014 I moved to Singapore and basically took a year off doing nothing but fun or nothing but deep learning and fun in that sense. Hadeeb sense of fun. Since 2015 I've been doing kind of a serious natural language processing at a local company here. I've been fortunate enough to have a couple of papers. We're running with Sam who's the kind of the not present guy here who runs this group with me. We're also running a developer course. Later on I'll say congratulations on being in the TensorFlow group. The Singapore TensorFlow group, according to Meetup, is probably the largest in the world which is pretty insane for, you know, go Singapore. So I'm pretty proud of that. Well, no, it's you after all. So here's the outline of what I'm talking about. I'm going to talk a little bit about word embedding. So this is kind of more of a beginner-orientated thing. So I'll talk a little bit about word embeddings. I'll talk about adding some context to those. I'll mention what a language model is, which was kind of news to me. Then talk about this whole fine-tuning thing, which is the latest trend, and hopefully there'll be a demo. That depends on Google Colabs, which it may be a bit of a work in progress, but cross fingers. So a word embedding, the idea of this is we want to feed words into a network. So this is going right back to the beginning. We will have a sentence, and we want to put this into some neural network. We need to translate the sentence into numbers. So the key insight which came from the 1950s was that words which mean the same thing tend to occur in the same context. So the idea here is if we had a huge amount of text and we ran a little window across this text, the stuff which appears together is similar in some sense. The stuff which appears way apart from it is dissimilar. So maybe we could use this to kind of train something. So here I've got just a little example of, see some source text, the quick brown fox. Here are some training examples. I want it there to be some relationship between the and quick and the and brown. As I slide this window across, this is quick fox, brown fox, fox jumps, fox over. This will give me a whole set of training examples which would allow me to say this stuff is related, and I could even take words at random from the rest of the text saying this stuff is not the same. In all likelihood it's going to be different. So let's put this into some kind of algorithm. What I want to do is each word, so it could be fox, it could be sits, jumps, space, Singapore, any word at all, I will give a 300 dimensional vector. So this will essentially be one row in a spreadsheet. I will call every word or have a row in a spreadsheet for it. And I will initially make these all random. So the numbers in this row, they'll probably be like a zero on average, plus or minus one, that kind of number. What I'll do is I'll then slide a window over a huge amount of text where huge is like a billion words. It could be many more, but a billion is a good number to get started with. And what I will do is for every word in the window, I will nudge them towards each other. And then I'll slide the window and I keep nudging these things. So the algorithmic details of towards and what that really means, we can leave to the packages, but basically I'm going to nudge the word vectors around until I'm as unsurprised as possible. If I'm given the quick brown something jumps over, I'm quite likely to like the word fox. I'm unlikely to want the word tractor and I'm unlikely to want the word carpet. These nonsense words don't fit as well as fox does. And basically by nudging these things around, gradually my whole word embedding, this whole spreadsheet of numbers is going to hone in on good embeddings. So I'm going to keep iterating this until it's good enough. And the idea is that this whole vector space of words organises itself. And this is without actually telling anything about the English language or whatever language, apart from here is a bunch of valid text. Like Wikipedia, for instance. So if I run this over Wikipedia and then I can use something like TensorFlow, which is, I haven't got a demo of this, but it's a beautiful thing. Basically this is all the words, 100,000 words of the English language projected into three dimensions so I can kind of spin this around. And if I want to see what's in the neighbourhood of the word important, I would get all of these nice, that significant, particular, essential, or kind of interesting words nearby. This has kind of done the job without actually giving it any actual knowledge of English. It's read this thing and unsupervised, it has produced nice vectors. And so this is a beautiful thing from TensorFlow and there are a couple of nice techniques. One of which is called Word2Vec. So all of these, if you were to go and find my download link, I think we'll post all the slides in the meet-up right up. This has all the links to the papers and code. So in Python there's a nice package called Gensym, which will do Word2Vec like in a couple of lines of code. You'll be able to either do an embedding from your own text or download a pre-existing embedding. And here's their diagram of what the word embedding is doing or the process for it is. And basically here's all the different words where the cat sat on the mat, it could be the cat ate on the mat or drank on the mat, but there's lots of things which the middle word couldn't be. So basically the word embedding will do that. This is all in Python, Python for speed. Another popular one is called Glove. Glove is a thing which is invented at Stanford. There's an important guy under this Penningtonet, inside this et al, there's a guy called Richard Sotia who's at Salesforce, he'll see his name, you probably will see him in an et al later, but he's kind of a key person in terms of the Stanford lectures that you might see on YouTube. Now this is 2014, previous one was 2013. So now you've got a blog post as well as the code. So there's kind of a nice thing. Now I've also got some diagrams and hopefully, so here's what a diagram of what some of these embeddings look like if I just take the plane through these kind of family relationships. The interesting thing is that if I take the direction between man and woman, it's pretty much the same as between Sir and Madam and Air and King and Queen. So it seems that the embedding has learnt on its own just something general about gender, which is kind of interesting because it's picked this up without me telling that I want to know this thing. Similarly, here's another one where it's picked up for verbs, or no, this is for adjectives, so short, shorter, shortest. This kind of, this angle thing is a geometric property which is found all on its own amongst all of these things which these are pretty regular, I guess, these are pretty regular, but it will do this even amongst verbs in their tenses, all this thing, it knows all about how stuff works just by reading this billion words or whatever, say Wikipedia. Okay, so that's glove. So the good point about wording beddings is it works and things didn't really work before. So that's positive, right? And basically this allows us to give text as inputs to these models so we can now do something to text. If we were talking about images it's pretty obvious what we're doing. We have pixel values, we can feed pixels to neural networks. But it was not so obvious what we're going to do with text which is how we've gone through this whole RNN, all these various other permutations is that people are searching around for good ways of doing this. We can actually train this in an unsupervised way so we didn't have to actually give it any information yet about English or whatever. And we've got tons of data. It's easy to download a Wikipedia just by finding the right link and you click download in Singapore in less than a minute you'll have six gigabytes of data whatever, it's like pretty cool. So bad points, well one is you get a 300 dimensional vector and who's to say what each element means. I mean basically you can only really compare it against other stuff this is kind of an opaque meaning here. So that troubles some people it doesn't have to trouble us necessarily. Another problem is that each word has just one vector associated with it. Now if we think about the word bank if I talk about the word okay let's try the word cat, okay so the word cat has apart from the usage concatenate cat is a fairly simple word right or dog. But if I talk about bank it could be going to the bank it could be banking in an aeroplane all sorts of usage of this is not clear at all where my vector should be in relation. If I said well should my bank be close to like savings. Oh yes it should be close to my savings and should be close to river well it should be close to both. So suddenly you've got some if by forcing them to both be the same thing you've got a distortion on your embedding space so something could happen or maybe we could fix it up somehow. So in order to fix it up somehow because it's actually quite a big problem one way that people have been looking at it is maybe we could split up into lots of different versions we could somehow detect that there are different types of bank but I really don't want to have a dictionary with humans telling me how many versions there are because that gets away from the whole learn from data thing I'm going to skip over that because it's super difficult. Another thing is to use other data or other models to kind of infer the meanings and other ways to just use more context because actually when you use the word in a sentence it's pretty much unambiguous if I talk about going to the bank it's clear what I'm doing right unless I'm in a boat right on a river. So using other data so this is an idea which is now from just like last August or so right so we know that translation models can be pretty good and fortunately ambiguous words often have different translations so in particular I think have I got the word now no okay so in basically I had prepared a nice thing with German because I didn't say the words in German but the words for German for bank does depend strictly on the meaning in that the if I'm talking about a river bank it might be bank if I'm talking about a a checkbook I'm talking about a castle or something if I'm talking about in an aeroplane it's like herflug and something there's some special words for it which is entirely different in German so if I can use a language model which also knows German it can kind of back inform me about what words I should be using am I using the aeroplane kind of word for English or am I using the the river word in English kind of thing so I can use those that kind of knowledge from the translation model to create better embeddings on my side depending on what it's doing in the sentence and then I can actually just translate the translation bit is useless if I don't care about the translation bit I only cared about the embedding and then I can then go and use my embedding for whatever I actually wanted to use so this is a nice social hidden in this et al so fortunately salesforce bought his start up so he's not doing too badly in that et al so this is kind of a nice paper where this is also going back to this encoder decoder which Hardip was talking about basically if you have a thing which is taking in English and spitting out German basically this is your embeddings going in your pure glove embeddings for instance embeddings this thing is trying to attract attention or whatever in the appropriate way to spit out German but at each point this thing is aligned with the English words but it has German sensibilities right the word for bank will also be talking about the checkbookingness of it in order to attract attention from it when it's outputting the German so once you've trained this model up to do a translation you can then throw away the fact that I don't actually need the translation I can just use this fact that it's producing this better embedding which is now much more informative about the word bank so naturally there's a paper there's a blog and there's code which is in some weird framework okay so now okay now let's talk about using more context it does seem that translating a German is a bit of overkill and as much as now I need a ton of sentences which are in both languages there's a lot going on there why not just use a pure language model now this is something which kind of took me by surprise when generally we've been talking about models of language in a very broad sense here I'm talking about a language model and it's only function is if I'm given a phrase it will predict the next word and that is the model the only model of language which I'm now going to call a language model that's it only predict the next word so if I've got the domestic cat is a small typically furry that would be a nice completion of this thing there are more than 70 cat breeds is probably what you'd say this talk is extremely boring this track is extremely complicated or easy or cat doesn't seem like a good word in here unless I said well I could be cat orientated I've made the talk more cat orientated if you had the full context of the talk cat may not be a bad word for here depends on how much context I'm going to take so now language models are receiving more attention because it used to be super difficult to do this one word prediction problem now we've got all of these fancy attention models it's getting a bit easier but people are beginning to find surprising benefits so well one thing we can do is we can train this using tons and tons of data there's like corpusys of novels and corpusys of all sorts of stuff you can just download there are lots of new attention techniques and people have now discovered that fine tuning these models works unfairly well and the kind of the phrase unfairly well would also apply to ImageNet for instance ImageNet being the big image competition apparently it's been proven that learning how to do that well has had huge impacts on say radiography which hasn't been trained for that but these models are extremely good to train over to these other domains so we can essentially leverage other people's trained models in other domains so this actually works for language so this is the thing so what we do I'm getting onto these other models we'll take an existing pre-trained language model which we'll take I'll show you three so what you do is you take a classifier for your task so your task could be I want to detect the sentiment of this complaint or I want to detect what the parts of speech are I want to detect whether my product is mentioned or all sorts of things but for this I want to then train these weights very very quickly because this is only a very big model which is providing me knowledge of English and I'll then be able to train my small classifier using all of this kind of machinery on very very small like a very small extra knowledge that it's needed and so what this has led to is the sudden breaking of multiple state of the art kind of records so the picture here is you take a full language model which is trained on just huge amounts of English you then take your in this case this is IMDB so this will be a movie reviews you'll take the language in your movie reviews and just kind of fine tune it a bit so that the language model now knows more about movies than just general English and then you go onto this sentiment thing which is your mover review orientated classifier so you fine tune a rather big thing but you haven't got to label date until the very last stage so this is a model catchly called ELMO which came out in February because this is this year they've now got blog and they've got code in TensorFlow and it's now a TensorFlow Hub module and there's some other code and there's a tutorial so this is you know this is a ground breaking paper when they started to talk about this whole fine tuning or improving word embedding things this is a diagram of a diagram of a diagram so basically they're taking your input sentence they're passing it through some LSTMs like a whole bunch of them and at the top of this they'd have a thing which just predicts the next one I think Michael just took a picture of the picture of the picture so basically there's a language model on top of all of this which enables you to train all of these weights but you don't have to train all of these they've trained these weights so and the ELMO vector for each of these things you just take like these middling states and you just add them all up with some little parameters you just kind of super pose all these things and that's a much better embedding than the embedding that you started with because it takes the context from the entire sentence in both directions and basically it's certified to be a good idea and also there's this thing on TF Hub which is kind of Google's new way of distributing not just a model but it's a model with data and like test stuff all this nice stuff all built into something which you can then reuse and compose with other models so this with two lines of Python you can do that big model and just use its outputs and pretend it's a black box so that's kind of neat but that was February right so now there's another catchly another one which is they published the papers in January but they didn't really pile behind it until May called albumfit this is the people from fast.ai which have some good videos it's kind of they have a particular way of thinking as well but this is nice research of course there's a blog, there's code somehow it's not really relevant here and they've got 400 megabyte models which you can download and they're encouraging like open source people to contribute more models in their own languages I don't know whether that works or not but basically this is from their paper they say well this is the pre-trained model then you do a bit of fine tuning with your own kind of language samples so even if you don't have any labels you just stuff your suppose I wanted mover reviews in singlish I just pile in tons of singlish in here which will kind of get it to understand a few more extra words and a few more nuances of you know on the light whatever that you would understand some of these things more than the standard English model would and then instead of having this language model I then build this very small classifier which would say is this a good review is this a negative review so going through these three stages well this one I can do for free this one I don't need I can do massive data but I don't need massive label data and then this one I can use a fairly small sample because I'm leveraging all the rest and this is shown to work pretty well so basically they use here's the number of training examples you need if you did a model from scratch your model with a hundred training examples would be terrible as you go up to 20,000 examples now you're talking that you can train from scratch but it's still totally worse than if you take one of these pre-built models and fine-tuned it and this one is if you've used unlabeled data and label data with just a hundred samples you can do pretty well and so this is a this is kind of a turning point where of this this transfer learning works for text pretty well now so now here's the paper which Hardip talked about I'm definitely not going to explain this in this context but basically this is magic in here and they have 12 sets of this magic and this enables them to build a language model and on top of that you can add any little classifiers that you like and so on to do all of these different tasks naturally they've got blog and code in TensorFlow which is great and they've got pre-trained models and other stuff and they get great results so here's a whole bunch of data sets where the previous state of the art here I don't know let's pick this one rock stories previous state of the art is 77.6 and working at for years to like e-count 0.1% kind of thing they can suddenly beat it by another 8% in one go so this is a huge improvement over state of the art results in one sweep using one model which is kind of insane they've got some nice blog post so the other thing which they mentioned which I thought was neat is some tricks so instead of finding tuning the model which is a good thing to do but does sound like hard work you can do a little black magic the idea here is just use the language model to rate the problem statements so let me explain that so here's the trick for sentiment the normal problem for a sentiment is the review would be I love this movie and the question is this positive so I'd build a classifier to do that the trick is I give it two different reviews I give this to my language model I say I love this movie very positive and I love this movie very negative and I just use the language model to say which of these is the more likely sentence now if my language model is is super good it would know that the first sentence is wholly more likely than the second sentence and so there is my movie review sentiment analyzer in one trick without training anything I've used the off the shelf sentiment thing and I've got results straight away so here's another trick with a thing called Winograd problems so Winograd problems are like the fish ate the worm it was tasty so what was it is it the fish or was it the worm right and the trick is you say well I'm just going to do a substitution for the it here so I'm interested in what was tasty I'm going to substitute the fish or the worm as being it so the fish ate the worm the fish was tasty and I'm going to pass that into my language model and if my language model is super duper then it will know which one of these is most likely and there's and we know that this is a really good method because there's a Google paper a simple method for common sense reasoning which came out with code and anyway I will explain that and basically the Winograd problems are really difficult for computers to do and so people coming up with databases of knowledge like common sense they had all this kind of schema for building knowledge and were scoring like 53% of these right which maybe is not great if you've got two options right it's like not great some of them got three options but you know two options is not good on the other hand they came along and just did this language model thing and it scores in the 60s I mean this is a wholly much better way of doing it except you just have to have tons of data so earlier today we downloaded this saying oh great TensorFlow I can do this but there's a 200 gigabyte download of model because it's an ensemble of massive amounts of Google-ish models so I'm not going to do that in collabs right now so this does really work but you know at Google scale so far so far ok so now I'm going to do a demo if it works I think that guy if it doesn't then it's Google's fault oh ok that's a good that's a good ok so in case ok so do I need to rerun everything or no it's ok so basically if you haven't seen this this is Google collabs it's free one of the nice things with it being free is it's also got a GPU already attached so this is everyone in this room essentially has a GPU available to them but only for 12 hours at a time after 12 hours the machine dies or more frequently but for those as long as you can be saving your checkpoints somewhere else you have a beautiful thing my guess is that Google have got data sensors already set up for these things for four years ago but now these TPUs have come along they have spare capacity so this is a huge resource for running models basically I've got the OpenAI language model basically there's code online you can download it it should be easy so I'm just basically this what I'm doing here is I have the OpenAI model basically laid out in just in one long flow so there's some kind of initialisation staff, some encoding staff for these data sets and then they actually start to these average grads so here's some data which apparently is already in there so I'm going to download both a sentiment data set and a winner grads data set so there's some preprocessing there's some winner grads schema and here's the model so the model basically they've written this all out in full so instead of using libraries they've actually written it out like here's what how Adam works so this if you're just beginning rest assured that this stuff is all stuffed away in libraries you can use in a one liner for some reason OpenAI has made it super open and won't even trust Google to remain solid so anyway there's a whole that they define attention and all these things on the other hand this attention model so this is the transformal model which Hardy explained which is black magic times 12 it is fully defined in this file it's not as if I've paged through hundreds of thousands of lines of code there's maybe 100 lines of code or maybe 200 at the most but it fully defines the model even though it's state of the art in many many ways this is written out in like very basic operations so having done that I can then do some initialization load in some parameters and I can then do this sentiment classification task and basically we wrote a function which says that this thing can get 67% accuracy on this thing without any this is just using the trick so it's not using any fine tuning or anything we also have a here's basically how the tricks works I loved this movie I hated this movie basically this little little function here and so this code will be available like tonight in my github or whatever basically this is taking the language model and then it will taking the logits will be the last step of the model and it's just saying is that logic suggesting more for the word positive or for the word negative so it's just rating if I say the word very at the end of my review would I say the next word has been more positive or more negative so doing this will actually do reviews for me like you can play with this yourself it's like interesting similarly for Winograd so hopefully you can see so if we do a whole bunch of little tests here the cat sat on the mat has a so this is a loss so lower is better so this is a loss of 5 ish the dog sat on the mat it actually likes it more which I don't understand how it would be a dog person the mat sat on the dog on the other hand is much less likely and then mat sat dog the is very unlikely so this basically it's showing that the language model understands what are good sentences what are bad sentences and it understands that mats don't sit on dogs so this is kind of interesting and then this can go there's basically if you have a look at it these are the Winograd problems kind of laid out for the substitution trick and you can actually see charts of fish the fish ate the worm the fish was hungry okay so basically this will give you maps of whether it's more likely to say the word fish or the word worm in all of these things so this allows you to explore the things and it's quite easy to use I mean I could just press run this is a I'm really gonna do it but there's one cell which is loading spacey and it's like an excellent language library in python but apart from that this thing runs it's a five minute job so you can quite easily click on the link which will load this in colab you click on the GPU you can run this thing and just see it live so that is a nice demo from that guy that guy okay so wrap up am I wrapping up? for your problem okay with your problem is you build a whole model you use some glove embeddings you train it and you need tons of data the new way and this has kind of come to the fore in the last couple of months is you take a pre-trained language model you fine tune it on your own data and this data needn't have labels it's just like to get a feel for the language you're gonna be using and then you train on label data so this could be a much smaller data set you don't need so much data and you can expect better results so that's kind of the trend so to wrap up suddenly this transfer learning works for text good models are available you can just download them particularly in English if you're in Chinese we have yet another intern working on that problem so this is something which can be tackled if we go to more esoteric languages it's gonna be harder to get the data but this is a very doable thing these are quite generic problems people took but the models are pretty big so for the elmo thing or sorry the elmfit thing we're talking about 400 meg but this will just sit in memory churning away with less than a gig for the micro... damn I said that for the Google one there's a 200 gig problem that may be more of a challenge on my laptop so I have a repo I'll leave a link for all the we'll all leave links for the slides and they'll be linked to my code my kpi for this is probably these add a star if you like it and so that's me done but I've got some little ads so this is the TensorFlow Meetup group and since you're here you know about it the next one's probably gonna be like the early mid July but we have to see whether that can be coordinated Sam will be back for that so that cross fingers hopefully you found if you're just starting out there's been something interesting for you hopefully if you're like following every little tweet about every paper there's also something for you and we've also got a lightning talk so that guy wherever he is unless he ran away should ah so this is a demonstration that it can happen one thing I should mention is there's some Google news in the last week they've come up with some GPU news in that they've now got these new preemptible prices so this is saying for less than a dollar so this is less than a sing dollar an hour you can get one of these v100 GPUs which is an insanely fast thing or from Nvidia for less than a dollar is pretty good it may be taken away from you but as long as you're saving checkpoints you don't really care you just get another one so that's good and even better news is that they've started releasing these TPUs which is their own silicon for computation which is at less than two US an hour these things may be rough factor these things are like 15 times faster than a 1080ti so if you can coordinate the flow of data which is the main problem into this thing the actual floating rate performance of this could be doing your models 15 times faster than your desktop GPU clearly they can scale this into massive size you will also be paying the money but this is pretty cool now some people have been asking there's been a deep learning developer course that Sam and I have been running we did one last September we're going to do another run basically like this September September October November the first module of this we've been calling jump start and I'm not sure people have connected that this is like the first part of that module to get everyone going now within the deep learning the long course the long course last year was 8 weeks twice a week involved projects it was kind of exhausting hard deep was one of those people who well ask him whether it can improve your prospects right stuff he enjoyed it I think so there's a whole bunch of stuff one of the key factors there is making sure the Singapore government can help Singaporeans and PRs so we're working on that there is this first this is the first module of the jump deep sorry of the full developer course we've run one of these we're probably going to run one more in July we are running one more in July we're probably one more at the beginning of the September full thing but this is the kind of the first module of it is two week days plus some nights you get to play with real models but the key point which people didn't realise really when they came you'd have to do your own project so it's not that there's kind of course work and you know can you get a grade from us according to the grader which is fine for Coursera or whatever but this is more like here's your own project something that you care about when you go to an employer or your next employer or your current employer say I built this you actually did build it it wasn't that it was kind of pre made for you and you filled in some blank so it wasn't that you were in a team who knew of the guy who fetched the coffee and everyone else built the model right so this is a this is a good opportunity to build your own thing and we can try and make it work and that will be a key learning but clearly the full developer course it gets harder and deeper so there we go alright that's me thank you and let's head for the next guy