 Y Llywodraeth Espero. Y Llywodraeth, celaf o'i nifer eich erioedol yn maen nhw, ychydig, byddai cael y byddai ceisio'r cyflodd trwy ddodi'r pethau eich cyflodd trwy'r cyflodd, baen nhw'n eu laptywysau hwnnw a fit iawn. Roedd ydyn ni'n un arlausio'r cilydd y byddai buith yn ymddangos cyflodd yn bryd. A phwyl i'r ryoedd nifer o'r gwael. Mae'r projectbeth yn gallu'n bwysig i dystyried, ac mae'n gallu gweithio'n urgynnu'r ddybil ddaeth. Yn delon, felly mae'n ei bethau bod erbyn eich rôl gwirlo'n effaith, a mae'r ddaeth wedi'u rhoi pen'r unrhyw ffordd o'r leirio ffordd o'r dplirdog o'r newid, ac erthod ddych chi'n ddod eu cyfle unrhyw, maen nhw yw'r partnidau yn frydddo, cwrs a ddod o'u ddod o'r modl sy'n tuamol. i gael llawer o'r llwytoedd arno i'r adnod, just because we have some time. For anyone who has already copied the virtual box appliance, what you can do is if you start a virtual box and do import and appliance, the whole thing should start running and offer the opportunity to do Jupyter, all in monger. So, the talk today is extracting names using RNNs I will very quickly introduce myself and then tell you roughly what's going to happen. So I've been doing machine learning, kind of startups, kind of finance for a long time. I've moved from New York City to Singapore in 2013. 2014 I just read papers and wrote code. Since 2015 I've been doing a natural language processing project for a local company here in Singapore and writing a paper or two. So here's a quick overview of what we're talking about. NLP here is natural language processing. I will have one slide on why you might want to do this. Then we'll launch into neural networks, first the very basics, just in case you don't know what's going on for neural networks. And then talk about word embeddings and recurrent neural networks, which are kind of the meat of the talk. And then the workshop, which is where, if you have a laptop, you'll be able to follow along and see how it works, will be implementing something I call uppercase nur, which is kind of an interesting project to learn using an RNN. Then I might talk about some kind of more exotic examples briefly. So natural language processing, this is the one slide if you need to know why. Basically, it would be nice to be able to work with text as an input. While a lot of the deep learning stuff is to do with visual images or self-driving cars or gameplay, the vast majority of people working on floors in Singapore is working with text. Text is kind of like the de facto, human-optimised best way of transmitting information. Very rarely do you try and draw pictures of what you would describe. Basically, everything's text. If we can do a good job with this, we have conquered a huge amount of white-collar busy work. So there's some text analysis you might want to do, finding out what people want with their customer requests. You might want to do translation, particularly in this kind of region. You may want to extract knowledge. So the particular topic which I'm involved in for the local company is we read PDFs, where a PDF is just basically text at XY coordinates on each page. And from that, we want to extract who is in the document, who works where, which company owns which other company, who went to which university, when, and essentially take that from PDF and your report and produce a nice structured database for the output. So this is knowledge extraction. And that's all I'm going to say about natural language processing. If you want to do it, you probably know already. So I will quickly digress into basic neural networks, just so that we're all on the same page. And for the people who just came in, if you have laptops, you're going to want to have one of these USB keys. Looking at the USB keys won't help you. You need to pick up a USB key to install it on your laptop. Take some back. So basically, neural networks, the basic building blocks of these things are not at all magical. We take simple mathematical units and we combine them repeatedly to compute complex functions. So here's the typical example of a single neuron. Basically, at the bottom, we have the inputs, which we call x1, x2, x3. At the top, we have our output, which we call y by convention. And in the middle, what we're doing is we take a weighted sum of the various inputs and we then apply a nonlinearity. So this is the very... In all of this deep learning stuff, this is the very basic unit, but this is replicated many millions of times. So here's a very simple diagram of how to make this a bit more complex. Again, we've got three inputs at the bottom, but instead of just having a direct connection between the input and the output via this one cell, we compute interim values. Now, these interim values we've got in the first hidden layer here, hidden one, we've got five different interim values. And the point of these is that this would enable if we were very lucky, when we train this thing, we're going to actually learn to represent the data in some more sophisticated way. And back in the 90s, this was where all the mystique was, because people couldn't really train these very well. And this caused one of the AI winters. What has happened since the mid-2000s is that people have worked out how to make this deeper, how to make the training work, and they now find it's kind of embarrassingly easy to make work. So the way you train this is called, for our purposes, is going to be supervised learning. Basically, we're going to pick a training case. So we've got a particular set of inputs and a particular target value that we want to return. So it may be that the inputs will be the values of a picture and the output is, this is a cat. The alternative would be this is a dog, say. So you basically pick an input picture. So I want this to say it's a cat. And then I say, well, what does the network say? So I evaluate the output for these inputs. And then I modify all the weights in this network. So that the output Y is closer to saying this is a cat. Now, the beauty of this is that we haven't really specified what it means to be a cat. Basically, if we have sufficient data and sufficient time and we jiggle the weights just right in this modified word here, if we jiggle the weights just right, this thing starts to say cat more often when there's a cat in the picture. That's at least the big hope. Now, one of the key concepts here is a thing called back propagation. Basically, how do we jiggle these weights or how do we modify these weights? Basically, if the weight is connected to an output, it's pretty easy. It's easy to assign blame to why the weight is giving us the wrong idea or not. But otherwise, what you need to do is start propagating the blame of why it's getting it wrong or congratulation messages for why it's getting it right back through the network, which is why it's called back propagation. The errors back propagate through the network. So basically, this has been, people have been doing this for a long time, since the 80s at least. The trick though is that now it works. So, let's have a quick look at what's going on inside. First, we'll have a quick look at the input features, then what each neuron is learning and how the training converges. So, if everyone's got their laptop open, there will be a thing which is, if you go in there, you should have a TensorFlow example. Or, if you're connected to Microsoft online, you could have a look at TensorFlow Playground. If you search for TensorFlow Playground, you'll get a picture somewhat like this. Now, does anyone have a laptop and not understand what's going on? Because I'm just going to have a quick play with this. So, this is a beautiful thing from Google. Google has got a deep learning library called TensorFlow, but this is kind of a JavaScript-y front-end. It's all in JavaScript. So, I'm not sure even it has proper TensorFlow embedded. What we have here is on the left-hand side. So, basically there's the picture that we had before is rotated this way. On the left-hand side, we've got the various features which it's trying to learn or trying to learn from. And then we have some layers, and then we have an output. Now, what we're trying to learn here is if you look at the top over here, basically there's a little scatter of, I'm going to call them orange and blue. I'm probably wrong. What a little scatter of a target shape. And the idea here is can we use the features which is X1, which is just a relation this way, and X2, which is a relation this way to learn how to classify on the right-hand side whether it should be a blue area or an orange area. Forgive my colour blindness. So, the idea here is if I press the big go button, it will use these two inputs to create these midway internal features and then use those to gather them together to make something. So, basically if we do this, so you can see over on this side, you'll see that this is the training error. Basically as these weights for these internal things have moved around, it has now actually constructed an internal representation which has managed to classify the two areas, one from the other. If we go for something like this, so this is a, up in this corner, we have essentially quadrants. If we make it do this, what happens is it's got a tough time, okay. So, you can see that basically very quickly by manipulating these weights, we can make it so it's learnt some internal representation. And the interesting thing here is even though its inputs are just vertical and horizontal, it's actually tried to learn this shape by learning some diagonals. So, it's managed to construct this shape out of various pieces of diagonal. And so, even though I didn't tell it what a good internal representation would be, it's figured out on its own. Now, if you want, you can play around with this. We can add some more things. We can add another hidden layer. We can do something tricky. And so, here, it was a whole bunch of different things, which you can see whether this. So, I know for a fact this is going to struggle. So, gradually it's going to, whoa, okay. So, you can see that out of just a few internal units here, it can start to construct something or which fits better. It is not clear but it actually understands what the image is about or why it's doing what it is. But it certainly is trying to score as many points as possible on some kind of loss function. Now, if we start this again, you'll also find that it, there will be sometimes it doesn't figure it out. No, it's going to figure something out. So, this is also a stochastic process. There's no particular reason to be sure what's going to converge. There is mathematics around this but it peters out very quickly to have a go. Having a go with these things is the reason why people, hopefully, have brought laptops with them because actually seeing this works is the proof that most people are. It's as good a proof as you'll get mathematically. Okay, somewhere in here. Okay, so, this is in a sense, this is the story for neural networks for images. Basically, if you have a bunch of features as your input in some kind of fixed array and you have classes that you want to find, there are other tricks called convolutions. You can have them deeper networks but basically this is all you need to know to make neural networks work in some sense. You can map images to classes. You can do a great many interesting things. The problem when it comes to text is text isn't the same shape as a picture. When you have a sentence, let's have a sentence which is the cat sat on the mat. Okay, how do you represent as a feature of the word cat? So, this is one basic problem is that to represent this so you can then manipulate it is in kind of the entirely wrong format. Another thing is that the sentences aren't in fixed formats at all. How long is a sentence? We want to be able to, we don't want to impose like a, here's a predetermined maximum length of a sentence. I want to be able to read a sentence at any length. So, the question is how do we pass through stuff which is words which we don't yet know how to encode and let me assure you having just pictures of the word is the wrong way to do it. How do we go from essentially the sense of words as written into something which can be manipulated by these neural networks and how can we do that for something of indeterminate length? So, this is in a way I was showing you the basic neural networks just to get a clear understanding of how it would be mapping from features to output and how we'd use the errors in the output to kind of feedback by macro propagation. So, now let's talk about in particular how to set this up for natural language. So, when it comes to handling words, English as a language, as a base case would say have 100,000 words. Typically vocabulary would be more than this but you can see that 100,000 words would be a very lossy, sorry, it's a very inadequate representation because imagine the very first word. How do we identify this word? Maybe you have a word which is what's called a one-hot encoding. Basically it's a one where it is the word and a zero if it's not the word. So, the would be one and then all the zeros and then a might be zero, one and then all zeros. So, this is a one-hot encoding, this is a terrible encoding because it has no idea of the commonality of words. It's 100,000 features for every word location. This is not the right way to do it. It does seem wasteful. But wouldn't it be nice if we could actually learn about how words are related to each other from an actual corpus where a corpus is just a huge quantity of data. So, this is one of the things which people have been trying to play around with since the 90s. They made it much too complicated in the 2000s and in about 2010, this thing called Word2Vec came along and suddenly everything made a lot of sense and it became stunningly easy to do this kind of thing. So, word embedding which is the major advance here. That basically Word2Vec is one nice implementation. There's another one which I kind of like myself called Glove. The basic idea is for every word you will have a 300-dimensional vector. Now this, where do you get these numbers from? Well, you get the numbers by reading text and you don't actually know a priori how the words relate to each other. You just put in large amounts of text. So, what you do with this text is a typical training set would be Wikipedia just for starters. So you take Wikipedia and you run a window across the entire Wikipedia which is say 10 words long. And what I want and my setup for this is I would like my vector so that if I take two words and multiply their vectors together, the number I get as a result will tell me how likely the words are to be in the same window within Wikipedia. So, the cat will occur, co-occur quite often. But the trampoline, I don't know, that's a bad example. Trampoline cat will not occur very often. Trampoline vacuum cat, these three words will not occur very often at all. So, just by reading the text, I can actually start to build a picture of how likely words are to appear together. And the trick here is, and this is what people find, is that if you slide this window across the text and you start everything off with a random vector and you nudge them so that they start to, words that occur next to each other start to have a higher likelihood of being next to each other, what happens is the entire vector space organises itself so that this just works. So, I'm not sure where you can see this. Basically, this is saying, let's take a 100 dimensional vector. We take it, we train it over a window of words and I'm not sure, it's probably difficult to see. Basically, if I'm looking for the word France, which is word number 454, the most similar ones to the word France will be Austria, Belgium, Germany, Italy, Greece, Sweden, Norway. So, this is purely by reading Wikipedia. I've given it no actual knowledge of the English language. I haven't told it about countries but it is on its own, recognised. And the reason why I may have done this is if the word France will often have the word country of France, okay. But the word country will co-occur equally well with each of these other ones or France has a democracy. All these ancillary words will boost all the other countries in exactly the same way as France so they tend to occupy the same space. Jesus is probably a bit controversial. Xbox equally. Riddish, okay, so Riddish, it's just by reading Wikipedia it says Riddish, greenish, bluish, pinkish, purplish, brownish. So not only has it got the colours but it's got the parts of speech or the usefulness in speech or scratched. Scratched, nailed, smashed, punch, popped, crimped, scraped. Or megabits, octets, megabytes per second, bits per second, bored, carrots. So you can see that in these columns it's just picked up some regularities in the English language just by reading text without any preconception about what English. So if you do the same with Spanish, same thing happens. Do the same with Japanese, Chinese, same thing happens in that similar concepts will start to clump together. You can actually form maps between these languages, interesting. Here's another freebie that you get out of this. And this is the classic, this is the classic one. So if you look at the direction between the vector for man and the vector for woman, the difference between them, you can say, well, this is kind of gender. If you say, well, what's the difference between king and queen, the crazy thing is that you get vectors in the same direction. So equally, if you look at the difference between man to king and woman to what, the what would be queen here. So it's actually, as well as picking up these semantic analogies, it also picks up a geometric interpretation. So here's another one. If the difference between walking and walked is the same as swimming and swam. So here it's picked up a kind of cunning grammatical feature of English out of no preconceptions. Similarly, on this side, if you look at the cluster of countries and here you'll find on the other side, you'll find in this big 300 mention space or 100 mention space, here's all the capital cities, you can actually find that the direction between all the countries and their corresponding capital cities lines up much better than you'd expect. And so it's actually picked up this just because words occurring together kind of indicate something. Okay, so that's word embedding. And we can have a look at that because there's a whole word embedding embedded in the virtual appliance. Networks on sequences, okay. The problem with sentences is you've got a variable length input. So you've got all of these words and it's not like a picture where you have a field of view. The number of words could be any length. So how do you apply a single network to this input? So the trick here is basically you imagine that the words coming in sequentially and you process each time step separately with the same network. But when you process something you record your internal state. So after you've seen a word, you update kind of like an internal memory cell and then you look at the next word. So the next word is informed both by the word you see and what you've remembered. You then move along. So gradually you're building a more and more complex internal state as you move along the sentence. Now the trick is at the end of the sentence or wherever it is, you actually get the correct answer. Your classification or whatever. And the question is how do we then backpropagate errors? Basically all we've done is we've had this one network and we've rolled it the state from one place to the next to the next to the next. So the weights affecting this at every stage we can identify. If we have a clever unrolling machine which is what one of these frameworks will do for us it will allow us to backpropagate errors through any number of crazy network architectures. Network architectures. So let's just say this again. So basically for a current neural network and this is why it's called recurrents because you're applying itself to itself. You take a network and you apply it iteratively. You have an internal state which you can carry forward step-wise and everything is still differentiable. And because it's differentiable we can then blame, assign blame from any errors it makes back to every single weight through the whole of his experience. Equally we can assign blame to the embedding as well if you wanted to. So here's a basic RNN. And so people start to draw little diagrams. So basically this is the concept. You have an input, some stuff which feeds back on itself and here's your output. Or you can consider them as unrolling this through time. So this will be the first word in the sentence. So this will be the cat sat. Full stop. When you use the word the it goes in, gets processed however, gives you some output. And it will be then whatever, it will also generate something which is useful to the next word and useful to the next word and useful to the next word. And at the end of the day we then use whatever this is to say okay how are we going to reward everything which went before us. Now this is the simple version. Basically we still just got an input, some multiplication and a delay and an output. So because every node knows the history and all the weights are basically the same network has been used again and again, you can think of the actual depth of the network as being through time rather than through deep. Okay, so here this is the same diagram. Basically here's your input, here's a little plus, a little non-linearity, here's your next state rolling along and along. Now it gets a bit more complicated because one of the problems with this, this basic R and N is that if you have a gradient problem, so if you have a mistake coming out at the end of the chain and we need to start rolling back these errors over time, every little gradient along the way gives us an opportunity for getting it wronger and wronger. And what tends to happen is in this basic R and N is that the gradients explode or the gradients vanish. And so if you've got a very long sentence, you'll never train the word the at the beginning of the sentence or assuming this is the first input word. Because basically this is kind of an unbounded problem, it's these gradients like build up on up on up and multiply on each other. So if you have a gradient, if you have some variable a here, by the time you've applied itself to itself, to itself, to itself, you've got a to the 20 at the end of the sentence, it's very difficult to pass back any gradient information. So what people do is they have a thing called they have gating mechanisms. And basically they start to split here as your input. And instead of just applying this to the output, I will start to have a thing which either forgets or remembers or clears, which enables the state to be kind of stored raw. So rather than being multiplied by a number every time, basically it gets either set or reset. Basically people do this and they say, well look, it can be differentiated. And as soon as it can be differentiated, it can learn. Therefore we're going to use this. And what people find is this idea of having forgetting or remembering is really effective. And it prevents this because the same state can be brought verbatim from one thing, one step to the next. If the gate thing says yes, it will be brought verbatim. Because it's verbatim you have no gradient loss, there's no blow-ups going on. So people do this, okay, this works. This is a one-liner in your framework. The next thing people do is a thing called LSTM units. There's an updating term, there's an input gate, output gate, there's a forget gate. All of this is a whole bunch of matrix manipulation with a somewhat crazy set of nonlinearities and it will be implemented in one line in one of these frameworks. If anyone says, oh yes, we're using LSTMs, it's shorthand for this. But basically this has become a detail which kind of just works. Piled higher and deeper. Also PhD, so. As well as the LSTMs, you can also, extending out time-wise, you can then pile them on top of each other. So not only is the R&N producing state for itself, it can produce state for the next layer of the higher and higher level representations. So you can run these. So this enables it to get very sophisticated. On the other hand, the sophistication is not really in the model, it's in the fact that you need a GPU. So you have your GPU or a cluster and you pile the data through it and you see whether it learns. Basically, if you have something really sophisticated, you'll start off with something simple and you'll pile on more and more until something learns. Once something learns, then you can start to dial it back a bit. What people also do is they run them forwards and backwards. So one of the things, if I've got a sentence and I'm only going forwards, I can never kind of anticipate the verb at the end, for instance, if it's German, whereas if I'm running back with both forwards and backwards, I can carry sense in both directions. People also play games with having a trees-like structure. Anyway, the key point here is that everything is still differentiable because it can be differentiated. So we can blame any output on the weights which caused it. Because of that, I can then fiddle every little weight and train this thing. Because it can be trained, because it's differentiable, it can be trained. Because it can be trained, it can learn to do NLP, which is kind of surprising. Are there any questions? Is there any good way you can vectorise that sentence? Yes. So all I'll do is I'll show you and you'll have it on your thing. So the vectorising the sentence is this word embedding. So this is kind of step one of all of the natural language work, is to come up with a word embedding, which every word will then come up with a bunch of numbers, which is the input. So the input that the word the will be, well the word the is not a very good word kind of thing, but it will have a series of numbers, which is the representation for the cat, is a much more defined place in space, because cat is near lion, it's a fair distance from dog, but it's very far from trampoline. So we can use the numbers for cat, and as you go along the sentence, you'll have a whole bunch of numbers, and we can then roll this R and N along that. So the words vanish almost immediately into the embedding. Does that make sense? Can this be applied to non-lethal languages like Chinese, for example? So the issue with Chinese, the answer is yes. The issue with Chinese is that you have, your alphabet is very large. On the other hand, from my understanding, the actual vocabulary is kind of small in terms of hundreds of thousands. English is an extremely rich language, and also lots and lots of overlapping meanings. Chinese has like a segmentation problem, so not only do you have symbols, you have pairs of symbols or a combination of them, but absolutely people do this all day long. The grammar is also somewhat simplified in Chinese, from what I understand. You said it's kind of surprising. Can you elaborate about why the differential probability is kind of surprising? So it's not surprising that it's differentiable. Given that these units are set up as just being linear combinations of things, with the non-linearities are even not particularly well differentiable. But in fact, they're differentiable only at a few places, and you almost never hit them. So you have a huge function which is differentiable. It is surprising that that can actually learn language, that it's sufficient space that just by manipulating parameters over this whole space, that this thing will actually capture some sense of the language. Can Bayesian analysis? So there is a kind of a branch of Neuron. So the Bayesian guys have proper mathematics, and that's beautiful. And they do the right thing. The problem is that they also have issues with making this stuff work, whereas the NeuralNet guys will turn up and blithely just make it work. So yes, there's a whole bunch of serious work with Bayesian stuff, and there are people who are trying to marry the two together to say, well look, really this network is doing something which is pretty Bayesian. But the problem that the Bayesian guys have is that the NeuralNet guys come and trample all over them with the next crazy innovation. Is that a brute force? A brute force is a very good word for it, yes. But the Bayesian in a way is kind of an idealized, like it's a spherical cow kind of thing, and it's brilliant, and can give you really beautifully correct answers. But the thing is that the NeuralNet white guys want results, which is sad, sorry. The Bayesian gradient is a numerical problem, essentially, so it's more than that. It is a numerical problem as much as that's where it finally gets you. I mean, you will get lost in numerical notes. It's because you have a lot of small numbers and you can't propagate the error, right? But if you have a big enough, like, big float, big double, is it? So you're saying if you had enough precision, it wouldn't be a problem. This is one of the things which held up the field a lot in the 90s, in that people would have these matrices, essentially, they'd set up the networks with small random values. The problem is if all your weights are small and random, if you think about the essential sizing that your input gets resized to, the whole thing just gets smaller, it's much better to have a matrix which is about an expansion factor of one, because then everything kind of takes care of itself. And so this kind of is not a condition number, but it's kind of like the expansion factor of this matrix. If you can make it about one, then you don't get this vanishing problem, whatever. In order to, and so this is one of the key things, I'm always wondering, well, why have all my weights vanished to zero? And once they realize you should set them all to be something like one over root n, that's a much better number to make your matrix about well behaved. Now what people are doing now in order to kind of force that is that they will take an input vector and they will normalize it, so it's about the right kind of scale. And they will do that at multiple levels through the thing, the neural network has no opportunity to kind of get into this mess. So this kind of renormalization, batch normalization is another trick that people do to avoid getting trapped. So while the numerical precision thing is what gets you in the end, if you can sidestep that entirely, you're in a much happier place. Let me press on. So now we're going to get, let me just have a quick check. Okay, let me talk. So this is basically the intro to what's on your laptops now, and we can have a quick go through. Basically the motivation here is named entity recognition, NUR. The motivation here is part of my job was to build a quality NLP system for a single style up. An essential component of doing NLP is to have good named entity recognition. There are various licensing problems which prevent you from using existing NLP systems, and I didn't want to use any expensive data. And so this is kind of a background as to why someone might care about this. This is a quick example of what NUR does. Basically the first sentence is, soon after his graduation, Jim Soon became managing director of Lam Soon. So there are various problems with the sentence. Obviously we want to identify Jim Soon as a person, as an entity, and Lam Soon as the organization that he works for. And to know the difference because I don't know that Lam Soon isn't a person. The fact that managing director has capital letters means I can't really rely on the capital letters. The first word Soon is just naughty. So you can see that this sentence is not so easy. You need to understand a bit about becoming. People become managing directors. You are managing director of a thing, or an organization, or a person. Anyway, if you can understand more about English, then you can pick this up. So the question here is, can we train a recurrent neural network to do NUR? And the way we would do this is we would create a word embedding. We'd get some kind of annotated data set, and then we'd train the RNN on the data set. But, and this is particularly, it's partly to do with, in fact I've handed something out. It's very difficult. It's very expensive to have a human annotated corpus in that we actually want to train on a billion words. And in order to have a properly annotated billion word corpus, I need someone to go through it, marking out where all the names are in a billion words. This is not, this will take longer than I have here. So the trick I'm going to use, and also where people have spent their time, is expensive data. And I'm definitely something I can't hand out. The trick here is I'm going to take Wikipedia, and I'm using NLTK, which is the Python Natural Language Toolkit, and I'm going to annotate Wikipedia using this software tool. So it's going to point out where all the noun phrases are. I'm then going to train the RNN on this machine annotated text to see whether it can learn what NLTK can do. So NLTK has been, people have honed this over years, linguists have looked at this. I'm going to abandon all of that knowledge and just learn what it tells me. And then I'm going to see how it is, see how it performs. But the twist I'm going to make though, to make it interesting, is in some sense, is I'm only going to let the RNN have lower case text. So NLTK, and we'll see this, loves to have capital letters when it's doing this N-ner task. The problem comes in a PDF. It may be that someone's name appears in a title, which is all capital letters. And NLTK will not give you any good information. But if I can make the RNN learn how to do N-ner from caseless text, then that's a win. So that's the thing. So for anyone who has a computer with the thing open, there's a thing called, can anyone find the Jupiter notebook thing? If you go in, has anyone got this? Has anyone not got this? Basically, you'll see here, if you go into the virtual box, Jupiter examples, there's a thing called Deep Learning OVA, which you can launch. When you import that appliance, it will start off an entire Fedora machine. That Fedora machine has TensorFlow, and Theano, and Jupiter, and Word embeddings, and Inception 3, and all sorts of nice stuff already there ready to run for you. There's also a password. If you click on this, there will be a password. So this is running on port 8080. The password is password. It's not secure. Sorry, I will do something similar. So excuse me for a moment. So this is the virtual machine which is in there. It's fully configured for Dora 25, running Python 3. That should be running. Just a window. So on here, there is a number nine RNN tagger. So there's a whole bunch of stuff in here, which is MNIST, there's Google Onnet, there's reinforcement learning, which is kind of cool. Anomalies, it's my stuff. So what I'm going to do, so I'm conscious of the time. Where's the man with the time? He's run away. So what I'm going to do, I think I'm going to spend about 20 minutes going through this in some detail. What hopefully you can see some stuff. Basically we're going to import a whole bunch of good stuff. Rano here is a framework with lasagna, which is a kind of a layer thing. Is everyone okay? And basically I'm going to import NLTK. I'm going to use a sentence splitter. So what we have here, this is number four, is you can see that it's taken a sentence. This is Mr Smith's tokenized text and split it into these tokens. So this is kind of an essential part. Okay, so what have you got? Have you started virtual box? Okay, you don't have to log into virtual box at all. If you go into a browser and go to localhost colon 8080, it should say password. And in that box you type in the word password and enter, and it should give you a... Has anyone got it working? You can show him? Yes? Find someone? No, no, it's okay. So, okay, let's keep pressing along. So the corpus we've got here is the first 100,000 words of Wikipedia. And this corpus sentence tokens is a thing which you'll just pull off in a kind of a python yield way, it's a generator way. It just gives me more and more sentences. So here's a sentence, here's another one, here's another one. So you can just see that this is pulling stuff out of Wikipedia as much as I want, basically, an endless supply of Wikipedia. And this is what NLTK will say for this. So, okay, basically it has pulled out... These are the tokens which it can produce. So if we just... Let's see what part of speech it has on this sample text. So you can see that this number 13, basically, this is just NLTK saying what part of speech everything in this sentence is. So this is something which comes out of the box with python, which is nice. What I'm going to do here is I'm just going to... I only care about nouns. For no, I only care about nouns. And what I'll do now is do a glove embedding. So there is a nice package called glove, which does glove embeddings for you. So basically this, when you import glove, you can give it these lowercase sentences and then just fit this thing. So this is creating the co-occurrence matrix, which is basically which words occur with each other words for the first 100,000 words in Wikipedia. So this says... And now let's create the word embedding. Now, this takes a little while and you'll see how tight I can assume it can be. So this is going to take like 30 seconds. On the other hand, a good word embedding takes a lot longer than 30 seconds. So you'll see. So each time it's going through, it's producing... I think it's using all of my course. It's going through a bunch of epochs. It's going to do 20. Now it may seem like it's taking a long time, but the reality is we're trying to learn the English language right now. It shouldn't be underestimated the difficulty of what it's trying to do here. Unfortunately, if you give a 5-year-old, or if you give a 0-year-old two minutes to learn English, they don't do a very good job. So there we have a 50-dimensional word embedding. And... So unfortunately this is not so good. So basically have a little function here which word embedding, most similar to the word king, is story, Henry. Henry's possible, a band and house. Not so good. Let's have another. Most similar is the word road. Radar. Channel and root aren't so bad. So this is just basically in these 30 seconds by looking at 100,000 words of Wikipedia it's learnt something, not very convincing about the English language, but it has learnt something. There's other helper functions. And this is where... this is basically the analogy. So it takes the vector for Paris, subtracts France, and adds the one for Italy. And my guess is I want the word Rome, but I don't. So Moscow, Paris, Connecticut. It's got the idea that it wants cities and stuff. But it hasn't got a very good idea. So the problem is embedding is poor. Fortunatley, there is a pre-baked one. Basically which is... this uses quite a large corpus trained for quite a while. So now if we now look at this, so if you work at this a lot harder you can have most similar words to road or bridge, highway, root and lane, which is pretty good. This is now saying Rome. Italy, Miland, you're in Venice, which is quite good. There's a whole bunch of other examples. So this is the man, woman, king, queen thing. Queen, daughter, prince and throne are up there. So this is showing that there is a... this word embedding stuff works. You can play with it. It's right there. Training this for 30 seconds doesn't do a whole lot of good. You can download pre-trained glove embeddings, pre-trained word-to-veck embeddings and have a way with it. So what I'll do now is I'll kind of quickly go through this part of speech tagger. May need to demonstrate that it can be done here. Basically I have some helper functions to produce batches of sentences. And I'm going to have this function here will produce a... like the NLT version of NUR for given sentences. And basically this... this is the embeddings for this sentence. So in here in this thing this is the main meat of the network. Basically this is defining some layers as inputs. And here is a forward RNN, which is a grew layer. And here is some special magic. And then here is a backwards grew layer. And at the end we concatenate them together. We then meld them together. And basically we produce a single output. Is this word NUR or not? Now to train this basically we get all the parameters. This single line works out the gradients of every input or every weight within this whole thing. And then this is a thing which updates the training function or updates every weight. So when you're doing this like in practical terms everyone will be using a framework. And that framework will enable you to define how all of the weights how all of the inputs relate to each other. But because the framework has complete knowledge of the functions which you're making it can do differentiation all on its own. So you never have to actually explicitly differentiate anything you just said I want my updates to go in the direction of the differential or there's even more fancy stuff that people do to make this stuff quicker. And eventually this is going to say define the RNA model. So the aim of having this in a virtual box appliance is that people who are keen can actually delve into it. This is actually running on your laptop. And I kind of like rather than just watch me the fact that you could actually take this apart yourselves and play with it I think is a win. Assuming it finishes. Now at this point one thing I should mention is tensorflow. So the theano which I'm using here is something produced by a group in Montreal and has been going for quite a while here we go it's defined. And it's good code well it's python. So what I'm doing here now is some actual iterations. Is it actually training? So what they do for theano is you define the various layers of your network and the calculations that need to be performed. And when you say run theano will spit out C code or C++ code or CUDA code essentially by doing print statements. So this is kind of it's insane that it works but it does work. And for a long time people have been hacking away at this thing and it's amazing that it works. What has happened with Google though is they recognized a good thing and they created a framework partly using the same people that did theano and they wrote it in C++ and they actually used engineering principles and that's much better. So even though theano and tensorflow are very rooted in the same kind of concepts tensorflow comes along and it's fast and it's bulletproof and they've produced backends for lots of different things. It's not particularly efficient because theano was developed in a time when people had much smaller GPUs and cared about every little cycle. The Google developer boxes have 128GB ram as minimum. So tensorflow tends to be less careful with your memory. On the other hand it's well done. So at the moment I'm in between doing the theano thing and moving over to tensorflow since I see that as being more of like an industry industry applicable thing. Even though understanding how tensorflow works is 90% of the way there people are going to want to understand how theano works gets you a huge distance there's very little barrier to entry to doing tensorflow instead. This needs to get up to iteration 1000. It's getting there. And this is a kind of a typical neural network learning cycle in that I've defined the inputs I've defined some outputs I've defined how to train this network to get closer and closer to these outputs and it iterates slowly towards a network which is getting better and you can see that the training loss which started out as being 0.38 this is some measure of how well it's doing now is gradually decreasing and decreasing. This is gradient descent acting in our favour. It's 1000 times because that's as long as I can stand. That's because if I'm running so I was fortunate to have I am fortunate to have a Titan X which will work and then I can use numbers other than 1000. This is another factor in deep learning is that most experiments last about 8 hours mainly because that's as long as people can stand to wait. If the experiment completes in two minutes then you're not trying high enough. If it takes three weeks then you can't do enough in a year. People tend to run it for as long as they can remain sane. One of the issues with doing it on the virtual box is that I can't get to your GPU without huge amounts of trickery pokery. This whole thing is developed in the open. This is a github repo there is a method for running this local and there you can specify yes I want my GPU to be involved and it will do the whole thing. I can be sure that everyone's got it set up in the same way and it runs in 244 seconds. I don't really want to do that. Good. Here is the proof of the pudding and unfortunately it's in debug mode. Here are some sentences and each word is followed by firstly what NLTK thinks and secondly the RNN things. Dr Andrew's 11 identified that as being a name works at Redcat Labs it perfectly agrees with NLTK so basically this is work this is understood what NLTK thinks about every word here let's see what part of speech and assistance and solitude looks like that's all nothing. When are you off to New York NLTK has picked out as being new in New York the RNN failed to pick out new as being a special word here Jitanya is recognised as being a thing but if we then go into the other mode basically I've just low cased everything so Dr Andrew's here you can see that the RNN doesn't actually care what case this is in NLTK absolutely does NLTK is always zero everywhere it does not know anything about NLTK even though it's part of the library whereas Dr Andrew's has been picked out, Redcat Labs has been picked out so the RNN without case information understands about the English language in a pretty surprising way so this is learnt so the embedding I cheated by having a known good embedding on the other hand the RNN has learnt all of this stuff of Wikipedia as we sat here so so sorry there's kind of a QED it actually works NLTK is hopeless for this case whereas the neural network has learnt to do something from NLTK data which NLTK can't do itself so the word embedding I built on the lower cased corpus yes or just I believe I'm not sure where I got the word embeddings ultimately I think either I downloaded them direct from the Stanford site just carved them up because I only wanted 100,000 words you can play around with this there's nothing nothing will be destroyed I think that it might be important because you map the words if you consider the case I think I must have done something with forcing all the cases low otherwise in New York it would have York versus Little York or New versus Little New so let me somewhere in here excuse me let's not do that let's do that let's not do that okay so basically the picture which you saw the code for is you've got these input words you've got a forward passing grew you've got a backward passing grew then these wise correspond to is this a name or not so questions we can add questions so let's just go this is very quickly kind of in a wrap up mode more exotic so we can see that sentences is input kind of works in that we've actually had a concrete example of putting sentences in, getting answers which indicate that it understood something about English can we output sentences as well and what can we do where does that go so one trick that people do is that they feed the outputs as if they were inputs so every node chooses some output you then choose an answer and then you feed the answer as if that was new input so here's an example now the key thing is this is all differentiable since it's all differentiable we can train it so here basically we take an input of some kind say the word is the we do this stuff and pick out the word suggesting and it may suggest the word cat we then say ok, let's, given that you said the and my next word is cat what do I say next and I might say sat and then you keep going eating your own output until you say stop and then that is a way where you can actually start or you can even start with just a fake hidden state produce me some text and the neural network will just spit out words until it says stop each output being the next input so I have that, there's a whole thing on here there's no time basically this allows you to do poetry so this is feeding in the words of Shakespeare first it converts it to pearl then it starts to understand a bit about word letter frequencies so this is another thing it's producing a letter at each time and using what it just produced produces the next letter and the next letter so it's produced it's read the poems of Shakespeare I think a hundred times it's not great in a thousand and it's begun to understand a bit more about Shakespeare, well understand it's better at faking it now if you choose a bigger network and give it the plays of Shakespeare and this is producing one character at a time it actually has like stage directions and it's figured out how to do a whole bunch of stuff so this is kind of cool so this is output just you train it on text and every time it comes up with a letter you say no no the letter should have been this other letter it should have been this other letter it gradually gets the idea that it should indent it properly it's kind of interesting but what can you do with this well it's actually not just for Shakespeare is fun but what you can do is instead of starting it off with just a word of Shakespeare you can start it off with a state here which you generate from an image so what you can do is you can present it with an image of a cat sitting on a mat and you just hope that it spews out the words the cat is sitting on the mat and you do this with lots and lots of images and this is all differentiable so it will learn it so here's some image labelling this is what Google will do on your image if you put in stuff if you upload stuff in your photo folder this is kind of so a person riding on a motorcycle on a dirt road so this is produced entirely by the machine spewing words out at random but guided by what the image is at scene so these first ones are pretty good there's some errors here this is two dogs playing the grass now the thing is this network has obviously made an awful by one error but it has no concept of what really dogs are or running on grasses or playing or anything like this it's just spewed this out because this kind of thing is associated with this kind of picture a skateboarder does a trick on a ramp which isn't quite right a red motorcycle parked on the side of the road this is not a red motorcycle this is a very simple sorry this is exactly what this this is a picture of that thing it doesn't do anything other than units you've kind of seen in that this is a whole bunch of matrix multiplies a whole bunch of funky gate things trains using gradient descent and it seems to understand something another game we can play is called sequence to sequence so basically you can take in some words and we can use the end state of this to spew out other words the reason you might want to do this is to translate between English and Chinese and this is what Google is now doing for translation between all language pairs so it used to be that people used to only want English to French and to have huge corpus from English to French they've now designed this kind of interface in such a way that any language pairs will train all other language pairs so if I have English to French which is easy to find because of the Montreal Parliament it will also help me translate English to Korean given that I can translate Chinese to Korean and English to Chinese all these pairs will now reinforce each other so there's been a step change in how good this translation's got within the last year there's even more crazy stuff so this is kind of the maximal crazy that is worth diagramming basically one problem these things have is if you've got a very long sentence you do tend to forget the order of things and you can imagine a very nasty sentence which would say I brothers, mothers, daughters sisters who's cat is this and it'd be very easy to get lost in this so what people do is have something called an attention mechanism whereby you have your original word original sentence and it spews up kind of markers which will be part of the hidden states and it then gives the output thing the opportunity to look back at those markers and essentially figure out where it is in the sentence but the key thing is these markers are produced by an algorithmic process which is differentiable the looking back can be differentiated everything can be differentiated so this thing all on its own will learn how to pay attention to the sentence so if you can give it an English sentence and ask for the French output you can actually get diagrams of how it will jiggle around the words because it's shifting all the word order around so you can actually see some of this kind of mental process anyway questions so this is home straight now referring to so when you are doing a recognition of names most of the time people don't know names correctly so for example Martin Andrews may be written together so are you able to find some similar representation to identify that this is a name actually the corpus idea works really well for English names because my name is probably in a 400,000 word Martin and Andrews probably in a 400,000 word dictionary similarly with most Chinese names because even though there's a huge variety of Chinese names the individual words are not that unusual but when you get to Indian they seem to be accumulating huge names and there's lots of unique names in Indian so the problem is if I see if I read it I'm not that familiar with Indian names but if I see an Indian name it's clearly an Indian name and it won't be in my dictionary and if I'm doing this tokenisation business of this embedding business I will only have an unknown token associated with it so that's a problem so the next step in this is to kind of go deeper which is let's read every character of this separately and then learn the LSTM of all the characters and then get it to learn implicitly the embedding so it won't understand the words from an external embedding it will just learn from the whole of Wikipedia what every chain of characters means and then you would have a chance of saying okay I found an Indian name so doing it one level deeper on a character basis is kind of where people are the heading now but then you're heading towards it's not so much you can piece the pieces together because the embedding is kind of a nice component you're training a lot across a big corpus I long to try that but anyway the problem with this kind of training is that you only train it on valid stuff and it wouldn't it be nice to have a flag which says invalid input basically and you will never actually see any invalid input so this is another kind of big open question in that there's all this image stuff which will train on a thousand different image classes but none of it will say this is a bad image or alternative fact it's a problem and also it's a very domain specific thing so that if you trained it on news or Wikipedia and you start feeding it tweets it's going to have a very very hard time or movie reviews or two things the size the size is basically generated by how much resources you give it to begin with it's not as if it will auto scale to be bigger because you gave it a more difficult task it will just perform less well now the surprising thing about this little ner task is that just using two units works fine it's crazy that it works so well so in some ways just doing it and seeing where you get any kind of results with a very minimum model is a very valid first step but by the time we get up to translating multi language it's going to need it's going to need big capacity so okay let me just let me get conclude and I'll be around or whatever sorry there is a whole deep learning workshop this is my repo so there is a whole bunch of all the code in here will actually create a virtual box appliance for you if you wanted to but it's also basically it allows you to work on a local basis this stuff deep learning you can apply to NLP it's not such an easy fit as images it's still in its early stage and that people are trying crazy stuff just to see what will work please contact me if you have feedback I should point out there is a new TensorFlow group sorry with apologies to the pi data group so this is we've started a for its very first meeting is in the middle of February it's sponsored by Google it's going to be all TensorFlow so if you love this stuff have got your feet wet with TensorFlow excellent come along to that because that's good it's on the other hand it won't be as diverse all of this stuff is with Theano it's not going to be about Theano and NLTK and stuff it's going to be TensorFlow because otherwise we won't get pizza hiring so I would love to hire someone full time to do commercial NLP and some kind of intern person more kind of prototyping and demos but that's just contact me via email very casual the end thank you comments have asked a question they can actually take away one of these for this please be to pick them up and if you haven't asked a question they will still find them over here and thanks again thanks a lot and thank you for Dr. Martin again