 learning with BERT. And this is intended to be a more beginner-oriented talk, though it may, my impression of beginners may not be quite matching. What we're trying to do is probably next month, we'll have something which is definitely for beginners, like how to make your first CNN model, all of this kind of thing. So next month will be definitely a kind of a more beginner thing. Hopefully there'll be tips and tricks which are useful to everyone. This is billed as somewhat beginner, but it's also kind of a revelation in terms of what's happened to the natural language processing thing. So about Mito, I have a background in machine learning and startups and finance. I came here in 2013. 2014, I was just having fun reading papers and doing robots and drones and stuff. Since 2015, I've been kind of serious with natural language processing. I am a Google developer expert for machine learning. We organized this with Sam. We've been writing some papers, including this year. We've now got four papers in it, so that's quite good. Just a small company, doing quite well. And we also do a developer course. So this is something which I'll talk about in a little bit. Red Dragon, I think Sam will talk more about this. It's a Google partner. We work, we quite like to develop prototypes for people. But we're also very interested in conversational computing, natural voices, and knowledge bases. So that's kind of what we do when we're not so busy. So I've done the who am I. What I'm going to talk about is traditional deep natural language processing. Now traditional here is, it only goes back about four years, five years, because this stuff didn't exist before 2013, because these techniques were developed then. So when I say traditional, I mean not, this is what most of the courses online will teach you is traditional style. It's all changed this year, and in particular this summer. So this is why in-person courses are quite good. So I'll talk about the traditional thing. I'll talk about what the innovations have been this year and the new hotness, which is BERT, coming from Google. There's some, probably no time for actual code, but I'll give you some hints. OK. So traditional deep natural language processing. The key elements here will be embeddings, super useful. Bidirectional LSTM layers. And then we need to talk about initialization and training. So a traditional model will look something like this. Now how many people know what I mean when I say LSTM? Oh, quite a lot. How many people know exactly what this diagram was? Not exactly. How many people have got a good idea of what this does? Yeah? Should be pretty much the same people. So basically, this is a fairly standard model in that you put in a word at the bottom. These are words going in at the bottom. These will be converted into an embedding. This embedding will then feed into LSTM RN going this way and one going backwards as well. These two things will then be added up. And then this will be fed into some kind of classifier. Or this is a CRF and a classifier. So this is kind of how people have been doing this for a number of years now. And it's been quite effective. So clearly, it's been beating the traditional natural language processing TF IDF kind of methods, which are totally antiquated at that point. So this is becoming less useful now, too. So who here has seen about word embeddings? Quite a lot, OK. So the idea for word embeddings, just a recap, is that for words in text, if a word is close to a word in a big corpus, words in a neighborhood should be similar to each other in some way. So this is kind of a big theme that this principle should apply to all text, right? So what you do to make a word embedding is we assign a vector, like an initially random vector, to every word. So you can think of this as being like an Excel spreadsheet. And down in the A column, you'd have every word. So you start with the, and of, and a little further down, you'd have city, and further down, you'd have space station, or basically, as you get rarer and rarer, you're going down the list of words. And in English, you'll have a couple of 100,000 words. So on the rows, though, you'll have like 300 different numbers, which will represent kind of the vector representation of that word. Now, in order to figure out what the number should be, you take a window, and you slide that across your text. And within that window, you say, well, these, the numbers within this window should kind of average out to indicate each other. They should kind of predict the words within the window. And words outside the window are actually a push away from those. So there's the way of kind of sampling this, so that gradually, as you read this huge corpus, these vectors all get pushed and shuffled around. And what happens is, as you give it a decent sized corpus, which could be Wikipedia or Wikipedia for starters, basically, this vector space will self-organize. And so in the end, you'll get a kind of a, this is a representation from a thing called tensorboard, which is part of TensorFlow. And basically, this can display essentially the cloud of all words in the English language. And you can also say, well, I'm interested in words which are near, say, the word important. And because it's been kind of self-organized, you'll know that significant and particular and essential are all pretty close. So this is a way of essentially for free, just without any linguistic knowledge, apart from giving it a copy of Wikipedia, you've actually self-organized a whole map of the English language. So what you can do with those word embeddings, or that numeric representation then, is put it into an LSTM, which is essentially a network which is the same at every time step, but it passes forward a state each time. So as it reads the sentence, gradually the state gets more and more involved, more and more feature rich. Now what happens, one issue with that is that the end of the sentence becomes very feature rich in the beginning that you hardly know anything that's gonna happen, which is why you run one of these in both directions so you get features from both ends. One of the problems with this LSTM unit though, is if you're doing a big computation, you don't know what the answer is over here until you've calculated the one at the beginning. So this kind of forces a kind of speed limit on how you can train these things, which is embarrassing if you're using a TPU, which doesn't want to do this kind of thing, so unrolling is forcing the sequential calculation. So in terms of initialization, most people just say, well I've got all of this really good stuff in the embedding and that's the only free lunch I've got, right? Apart from that, my weights in the network are just random, I'm gonna use the data for my training task to learn the task, okay? And typically this will use quite a lot of training examples because your network, suppose I'm interested in movie sentiments. If I'm just give it a collection of words, it may have a good embedding for each of these words, but it doesn't know about the syntax of a sentence, it doesn't know about how things connect together. Basically you have to teach, your examples of sentiment have to teach at the English, the semantics of the English language from the beginning. So it needs probably a lot of training examples. Okay, so this was old style as of 2013. Let's talk about the innovations. So there are several things which have come like together and I'm gonna talk to each one of these. There's a thing called byte bearing coding or sentence piece, which has been released open source. There's transformers. So who here knows what a transformer is? Okay, that's more select crowd, okay? There's also language modeling tasks which have come to the fore this year mainly. And then there's also the concept of fine tuning which is more like an April thing so far, right? So what byte bearing coding is, and this has existed for quite a while, but it's a technique which is useful because when I mentioned the word embedding before, what people do is they have a whole list of every word in the English language that they can think of or every word in your vocabulary and train it from scratch. On the other hand, if I came along with a new word, not in the vocabulary, I would just have to assign it an unknown label. I would have no idea what the embedding would be. So if I knew the word for book and the word for booking, but I wanted to know the word rebooking, I would just have no clue. Whereas if I could split the words up into little units of English and English does this quite often, rebooking should be very related to booking and the re kind of does something to the whole booking. So if you can, by doing this kind of technique, I can now make a vocabulary which is infinitely large in that what I do here is I would start out with a character encoding. So I'd say, okay, one for A, two for B, three for C. So I've got 26 characters. And also I'd have an end of a word character. And suppose I actually have a, like here's my initial vocabulary where I have low repeated five times, lowest twice, newer, six times wider. I want to represent this more efficiently. So what I would notice if I did some counting is that R plus the end of a word occurs nine times because this is newer and wider. So I'd then say, well, R end of word is a new symbol. So at the end of my A, B, C, D up to Z, I'd have a new symbol called R end of word. So I've just done a merge and I've now replaced all the R end of words with this new symbol, which could be five or something. And I could do this again and again. So, you know, E-R would be repeated nine times or E-R-W would be replaced nine times. So I've now got an R sound, okay? So this is kind of a useful English ending. You could imagine also that the in a big corpus, the would be taken into a single word very quickly or a single symbol very quickly. So if I just do four steps of merging, basically this would allow me to take an out of vocabulary word like lower and it would just be composed of low and R. She's kind of what you'd want and what people have found is that this works really well for English language and so this enables you to build an infinitely large dictionary just out of the components which arise naturally and you can learn this without any linguistic knowledge because now you all understand how it works and you can also read the paper and download the code from GitHub. So this sentence piece thing which implements this technique and a couple of others is a good way of creating, all of these slides will be on the meetup link. So underneath the meetup discussion all the slides are gonna be there. I welcome the pictures, you know. Okay, so sentence piece is released. Now the next thing is a transformer. Now this is kind of, this is for a single sentence piece. We would then do something with attention written in the box and then various norms and feed forwards and then at the end we'd have some kind of prediction. So this is a purely feed forward kind of network and in these papers you typically do 12 sets of these. You do quite a large pile of these things. These are all based on CNNs rather than RNNs. So when you get to a proper model basically you'll be taking in the words at the bottom with some embedding. You then have this thing which is attention. So instead of an RNN where you're rolling forwards the whole time, the attention is looking simultaneously at all words in the sentence to find out what is the most relevant word. So if I'm talking about going to the bank to make a check deposit, the fact that the word deposit does depend on the word bank and the sense of bank depends on the word check, okay. It's not really a linear relationship these have. Basically all of these words will kind of vote for being picked. And so this attention thing flows up through this model and essentially this is extremely powerful way of doing it. So there's a paper called attention is all you need where they show excellent results just using this attention mechanism without any RNNs or anything like this. This is a pure CNN method. Now the other trick that people have cottoned on to is basically like in beddings we're gonna train on a huge corpus of text but this time we're interested in getting the context a bit more solid. So what we'll do is we'll train the entire network instead of just training a vector at a time we'll train the entire model at a time over the entire text and the kind of things that we'll do is get the model to predict just one more word. So I'll give it a sentence. I went to the bank to deposit a and I expect it to tell me check. And if it doesn't I say no you're wrong tell me you're a check. And then I say then what's the next word it's probably for stop. Next word is otherwise I might run out of money basically there'll be a whole set of things you might say after this and a whole lot of words you would not say. So by doing this you just use a corpus of text to roll out an understanding of the English language in a fully unsupervised way. So instead of just doing these vectors you've trained the entire model as well as the embeddings at the beginning. So that was that this predict the next word was a language task. There's another one called a closed task where basically you take a sentence and you start deleting words say fill in the blanks. Another thing you might do is you could take two sentences and switch between them. And so tell me where the switching point places. So if I started with a sentence about one thing and switched to another thing you could say okay if you can identify that is then you understand language better. So just all of these kind of tricks are basically forcing it to understand language more and more and you can do it almost for free. Another thing you could have is novels. So you could have a sentence which comes from one part of the novel and follow it with a sentence from a different novel. Now if you can detect which one's which like is this a decent sequence of sentences or not you can actually get sense across large passages of text. So these things can be trained almost for free in terms of linguistic knowledge. It's kind of obvious that this is a thing but people weren't doing this quite a while ago or a year ago people weren't playing this game so much. So another thing which came along and this is partly due to fast.ai then hassling these models is you can take a model which is pre-trained on a large corpus which would be like the language model for predict the next word kind of model. And then if I've got to say I want to do movie sentiment I would take a whole bunch of movie reviews but even if I didn't know whether they were good or bad I just take as many reviews as I could find anywhere and then just train the language model to understand movie review language more. So I'm not telling it good or bad yet. I just wanted to understand what people say about movies. And then at the very last moment I say okay I've got a hundred movie review examples learn to say good or bad in response to each one. Now this approach can work much better than the previous approach which is trained from scratch where you might need 10,000 movie reviews because you've got to learn a lot of the movie review language simultaneously. Here these models have got a very good under step well understand a very good measure of English language. Now you're just trying to train it on just the fine tune the last little bit. So this is a super effective technique and this is why these things are getting very good. So in terms of recent progress you've got in February 2018 you've got a thing called ELMO. Olnfit came in May 2018. OpenAI came up with this Transformer model in June 2018. And now the new hotness is Burt, right? Now Burt is clearly the successor to ELMO though I don't think they mentioned that. So this was done in October. So this is fairly new. And so what you do with Burt, so basically this is one of these models. Full of transformers, you do the sentence piece thing at the beginning, you have this whole transformer thing and then basically we need to train it to do lots of different tasks. So one of the tasks could be this which is a squad squad. So basically to do the squad thing is I ask a question and then, sorry, I have a question and I need a response. And so basically the question will be when was the beginning of World War II? And I'd give it the Wikipedia article about World War II and I would know the extent of where the answer was in terms of words. And so the answer for squad is tell me the start word and the end word. And so in order to train this task basically you put in the question, have a separator, put in the passage and then just force it to tell you I'm a start and I'm an end. So this basically all of these kind of common language tasks can be forced into the same model. So rather than build special models for every different task, you can kind of say I want this to perform in this way and just use a standard model. And this is way too small. But basically, and I would encourage people who are interested to read the book paper because it's actually quite well written and quite, it lays it out nicely. So basically you've got the previous state of the arts at the top. And then so you've also heard about the by LSTM thing here. This is the open AI thing. Basically these scores are very much higher than they used to be. People were slowly climbing up. So if we look back in history we would have had the old approaches which would have been scoring say 50. When you got to the deep learning stuff it would have jumped to 60. Now people are clawing their way up. Now we're at 70. I mean that's the kind of the impression to get to. So this BERT performance suddenly has been a sea change. This year these states of the arts have been beaten by wide margins as we go through the year. So what's nice is that BERT has got working code on GitHub. Okay, it's Apache 2 license so anyone can use this wherever you are. They have scripts to produce all the results in the paper. They've released pre-trained models both regular sized and large sized. So large is really for the TPU people. Regular, you need kind of like a 12 gig GPU to use this thing. So this is not a light model. And they've also released in both regular and large English language model, a multi language model which does 102 languages simultaneously and Chinese. So there's kind of, they've been hard at work with their TPU pods and they've pre-trained all this stuff. We know that how much it costs to produce one of these models is a very large amount of money. On the other hand, to fine tune one of these the fine tuning can take place in under an hour or five minutes depending on the size of your task. So the fine tuning stage can be very fast compared to running this and building this thing initially on Google size data. The other nice thing is they've got a ready to run Colab. So I won't do it now, but you click on this and get, just a minute, da, da, da. You get a whole thing with how to run this thing on the cloud TPU or this is a machine which has got TPUs already. This is all for free. So Colab has got TPUs for free or ATPU for free. You can just run this and have a play with it. So that's kind of interesting when it works. Oh, so let me... So if you've got a problem, the old way of doing this was you build your model, you take a glove in bedding which is kind of one of the freely available models, freely available beddings on the internet. You train it up, you need lots of data. The new method is you take a pre-trained BERT kind of model, you'd fine tune it on all the unlabeled data you could ever find. Maybe it's emails, you just put in, you dump in your Gmail or whatever and then you put in the few emails that you care about. You need far fewer of these than you would in a regular model. You don't need as much data and you should expect better results. So as a wrap up, BERT is kind of the latest innovation in this kind of NLP trend which has only been taking place kind of this year. It's like beaten all these state-of-the-art performance benchmarks. It's fully released with all the results just out in the open as code should be. And people are describing this as kind of the ImageNet moment for NLP. In as much as for ImageNet, people train this, the thousand-class image models until they were better than humans basically. But what people didn't realize or didn't pick up on so quickly was that the transfer learning of these models, once you have a superb ImageNet model, you can then use that to recognize images of mammograms of all sorts of different things. You've built a very good image vision model even though you've trained it on dogs mainly. You've got an excellent model which can be applied to lots of tasks and it's all released for free and so this is kind of the very similar kind of moment. Okay, so that's my talk. I will have questions in a little bit. Oh, I will do a couple of adverts though because I'm that guy. So you've already found the Deep Learning Meetup Group because you're here. We'll have the next one in December, here probably. Typical taunt it. So it talks for people starting out. Hopefully if you're a beginner, you found something interesting in what I said. But next month in particular we're gonna focus on having something good. Something from the bleeding edge. This is fairly bleeding edge as well. And hopefully we can get some lightning talks. So if you want to give a five minute talk about anything at all related to this kind of subject, just have a word with us at the end. And also we now figure out we've got over 3,000 members in this group making it one of the largest TensorFlow groups on Earth which is kind of crazy. So all to you, right? So Sam and I run Deep Learning workshops. So we call these jumpstart workshops. There's other courses beyond that for more advanced stuff. But basically what we're trying to do here is for two days and there's online content as well, we force people to actually do stuff on their laptops and also do some kind of project of their own devising or one of our kind of sample projects if you want to. But this is we feel that the hands-on thing is super important because just clicking through some pre-made modules is not so helpful. If you're a Singapore citizen or PR, this cost is heavily subsidized up to apparently up to 100% find out if you're one of those lucky people by going to this SG Innovate thing. And if you've got to the right place on SG Innovate it will look like this. Okay, that will, okay. There's a little about the Deep Learning developer course. This is a full course and we ran one of these last year. Quite, there are a number of people. Some people are even swapping jobs after the better job they achieved after doing the course. It's like awesome return on capital there. So this is something that dates has yet unknown but there's definitely more to come after the jumpstart. But what we want people to do is do the jumpstart before the full thing just so that they know how to run these models and we don't have to explain everything from the beginning again every time. And one final thing, okay, is we have a little company. Sam and I have a little company called Red Dragon that we are an active kind of intern hunt. We are looking for people who want to do deep learning stuff as much as possible though we recognize that people have got academic obligations. We have had someone over the last couple of months working with us and he's now appearing at NIPS. So that's kind of an interesting thing for a Singaporean student at the moment. So we kind of said, when we originally said intern hunt that this was a possibility and this is an actuality. So we're very grateful to him but please cut the approaches if you're such an intern. Okay, we're up next with that guy, that guy. Okay. Okay.