 Mae'n ffordd a'i ddweud o ran gweithio'r sgolon ni'n ffordd am y ddechrau a'r ddweud i'r ddweud o'r cyflwyf yn ei cyflwyf. Rwy'n wedi cael ei ddweud o'r ddweud o'r ddesiftyn, rhan o'r agoriadau a'r ddweud o'r mae'n groes iawn, ond mae'n gweithio'n ddweud o'r ddweud i'r ddweud o'r ddweud, ddim yn cael ei ddweud i'r ddweud o'r ddweud, gy Before we can start to discuss the more exciting stuff we have Carthick who's going to talk about sentiment analysis. So something if you've got a website and you want to understand the complaints then that's what he'll be talking about. We've got Anusha who's then talking, she's talking about summarization, which is an important thing if you want to... You're deluged with data and finally Sam will talk, he's got like a five minute presentation. Cynghwarae'r tymp yn gweithio gyd. Mae'n gwneud i chi gyd o blendodd yn gweithio eich mae'r cyntaf yn y cyfraff yma. Wyddo dvent hynny. Mae'r braf hwn yn gweithio'r braf hwn yn dweud, fel y teimlo, ac mae'n gweithio bod olygu iawn ddigon nhw'n gangb o'r rhoi. Mae ddysgu'r gw realitya, mae citr ac mae yn dda i'r gweithio a gwaith. Mae'n gweithio gweithio gwaith yma ar yr olygu ecild y gallan o'n ein gwerthau. But now, let me go for it, when we're ready, we're still coming in, still coming in. So, I'm going to be talking about recurrent neural networks and text, and this is our fourth meetup. So, if you haven't been to the other three, the introduction to this is going to be rather abrupt. So, this is a bit about me. So, I have a background. I did a PhD ages and ages ago in Machine Intelligence. I was in New York doing finance, kind of start-up-y thing. I moved from New York to Singapore in September 2013. For 2014, the whole year, I basically had fun. And I was doing open source, reading papers, writing code, playing with drones. That was a good time. Since 2015, I've been in serious mode doing natural language processing and deep learning at a local company. And I'm on particularly good behaviour, and particularly serious tonight, because we have four representatives at the front here from the company. And I say it's serious, but we actually have, I have quite a lot of fun anyway. So, I've also been doing workshops and writing papers, and generally a deep learning enthusiast. So, in outline for my piece, I'm going to talk a little bit about basic neural networks, rather abruptly move on to recurrent neural networks, both the basic idea and what problems immediately arise, and then grooves and LSTMs, which are essentially the basic building blocks that people use. Then I'll talk a little bit about natural language processing, which is a more tricky subject than just picture processing in some sense. Tokenisation, also word embeddings. And then I've got a little application demo, which talks about uppercase named entity recognition, which we'll get to in your understanding by the time I go there. So, quick review. A basic neuron does simple computation, layers of neurons can do feature creation. And if you haven't already, sorry, this, if you go to redcatlab's presentations, now if you go to lyriccatlab's slash p, you can actually see this presentation on your own laptop. You can also look at the previous slides which we had from previous talks. They're all online. Basically, when I say simple, this is a single neuron. What we have here is some inputs. So, this may be, this, for instance, these X's could be the humidity, the precipitation, the sunshine, the temperature, the output is, is it winter? So, this is a very simple idea. Basically, in order to map these inputs to this output, what I'll do is I'll sum up these inputs times some weights, and then I'll apply a non-linearity. And this is the result of one thing. Now, if you think about it, by multiplying these weights by the inputs, I'm generating a kind of a linear feature. This non-linearity actually means I'll have linear on one side and then zero everywhere else. So, this is a very simple function. You can't learn much with this function. But if you change the weights, then it will do different stuff. You can then combine these in multiple layers. So, the question is, how would you train all of the weights? If you did a linear regression, you could easily train this first one, but by the time you put them in layers, it's very difficult to see how errors up here will translate to errors or changes in weights down here. So, if you want to understand how this all works, there's a thing called the TensorFlow Playground, which is great fun to have a play with, which is at playground.tensorflow.org. Basically, this allows you to set some Xs or some features on this side and look and try and classify some points into blue and orange on this side. You can add different numbers of neurons within a layer or different numbers of layers here. So, this is great fun to play with. If you press the play button, it will then train and you can see it converge or not. Very nice little example. But the basic thing, the basic takeaway, so this is where I kind of finish with neural networks, is once you do, for doing supervised learning, you have inputs and outputs where you know the data. We've seen what a single neuron can learn, but the goal is we need to train a whole network to predict the outputs from the inputs. What we do is essentially play a blame game. If we're making errors at the output, basically we can assign how much blame for the inputs for the layer before, how much should we jiggle each of these little weights to fix up what it told us. But equally, that will then give us an error at the previous layer. It can then assign blame, it can then assign blame backwards. By assigning blame all the way through the networks, essentially iteratively, you can then blame every single weight for the error that you got. So, if your inputs of the picture said this was a cat and it was a dog, basically you can say, well, what caused me to say that? And every single layer through this network you can fix up a little bit. But of course that fixing up a little bit will then change every other answer you've ever given. So you have to iterate this. And this is where you need a GPU, basically to make this thing finish. The other takeaway here is that these deep networks create features on these intermediate layers and even if you don't understand the structure of the problem yourself, it has often been observed that this will create features which are useful for solving the problem. The mathematics of this are not well understood. So if you want to show your boss, here's a model which will always work for these reasons. This is not the right room to be in. If no other model works, then this is the right place to be because these things often work in practice. Though in theory no one really knows. Okay, that was simple neural networks. That was deep neural networks, now on to images. One of the nice things about an image is you've got some organisation amongst the inputs. So each input here would be the top left pixel, then the one next to it, then the one next to that. But these are very much more related to each other than the temperature and the precipitation and random other variables. Images have coherence amongst them. So you know that there's this whole concept of up and down, left and right. You've also got maybe rotation you could do. There are various other transformations which are self-consistently or expansionally. Self-consistent within the image and wouldn't it be nice to use an operator which understood that? So the idea here is we're going to use a feature instead of just a single number. And the way which we use that is basically we say, well Photoshop knows about features. Photoshop filters enable me to manipulate an entire image at once in the same way. I can apply a blur, I can apply a sharpen, I can apply an edge texture, something like that. So very simple filters, but these will only be controlled by very few parameters. So you'd have a very simple what's called a kernel which you'd apply across the picture. In order to produce the next feature essentially you can just vary these nine numbers and you'll get a whole new picture out. You can produce multiple different pictures, multiple different views. And you pile these all together and you'll have overall a convolution on your network. These work super well and if you wanted to know exactly how well they worked here in the last three meetups. I'm sure we'll get back to images put today's text. So the one thing which we haven't addressed is sequences. So in the previous examples we've had a set of features that we knew or we've had an image of a certain size but a lot of real world data occurs in a sequence. So it's a way, whereas you know often have fixed inputs, lots of domains have sequences of stuff. For instance text you can think of a series of words or English text is definitely a series of words. Or you can think of text as just a whole string of characters including space. Or you could think of text as being a question and response and an email and a whole dialogue as being a sequence of different events that happen. Equally you can think of the audio which I'm spewing at you as being a whole sequence of CD quality 16 bit values. How would you deal with that? Equally video clips. So what you see from me is also a huge sequence of new events. So the question is what kind of technique can we do that you can apply to these again and again. Essentially by a symmetry argument you want to do the same thing again and again. You should have the same parameters again and again. So for processing sequences basically the variable length input doesn't fit the models we've had before. What you want to do is run a network on the inputs at a given time step and then that network will be used at the next time step and the next time step. It's all very well processing these as independent events but the whole point of a sequence is that they're linked together. What you do is you pass along a hidden state from one network to the next one. So this one will have an output which will then feed into the hidden state of this one which will then feed on to the next one, feed on to the next one, feed on to the next one. But the trick is though that this is all the same network just used repeatedly and even though we don't know what the hidden state should represent this thing will learn to have a nice hidden state or a nice internal representation because you incent it to do so because of this whole blame game thing. So if what you can think of it is in a time sequence but if I've got an answer at the end which I'm trying to get to I want to assign blame that blame will then propagate backwards through time it's as if I had a very deep network back to if I want to blame the very first word in the sentence as an input I will have a very deep network to deal with. So in a sense time is depth hopefully we can learn some features internally which are useful and it turns out that this often happens. So this thing is kind of the key point so I'll say that again again recurrent neural networks. So basically we have one network at every step of the input it has kind of an internal state that can be carried forwards each time step like a history but because each of these are kind of a mechanical operation of passing through and multiplying by numbers and adding numbers and then passing it forwards and adding by numbers up to the end at the end you'll have an error and you'll pass it all backwards you can well another word for this blame game is back propagation back propagation of errors is the derivative chain rule so basically because this whole thing is simple operations which you can take derivatives of and which things like TensorFlow will make it super easy to take derivatives of without even having to think about it because it's differentiable you can train this to do tasks and it will learn those tasks so here's a picture of the basic RNN it could be symbolised like this basically a network feeding on itself and we can think of it as a chain where these inputs lead to a network here with some hidden state it passes on to this next one this kind of builds up more and more hidden state inside so again each node knows its history all the weights are tied because this is essentially the same network and you've got network depth being time wise so this is a plain recurrent neural network what we have here this is the same diagram again but basically you start with a hidden state which could be just zero this is the uninitialised beginning pre-sentence for instance then each word you do each word that comes along however it gets into the network we do some multiplications, do some adding it passes on whatever is left over do the same basically you have this mess which is evolving as each new word comes in and at the end you say give me the answer and the answer could be is this a positive sentiment or whatever and if it gets it wrong you then blame everything for telling it the wrong information but what you do is you do this for millions of sentences millions of different blames and this should learn but the problem is that this has a gradient problem so the gradient problem is in a long sequence the early inputs are very deep in as much as the very first word if I said not a good movie the word not is kind of further back or not a good movie for families with young children now the not is a good movie for families with good children would be a nice sentence to have not a good movie, a very bad sentence to have so you've got to reach all the way back through this chain with these problems but each problem in this very simple version is just adding and multiplying and adding and multiplying, adding and multiplying so if you get any one of those these parameters wrong you're adding and multiplying by the same parameter again and again and again and again if you make a tiny error in that parameter the numbers will explode the parameters near one it's going to be fine but if it's 0.5 you'll have no error if you have 2 then the error will be exploding so this is a problem for this thing and typically people wouldn't be able to train very long networks so the solution to this is instead of always multiplying by a weight what you want to have is like a straight through path so I would want to be able to go directly from my error here to the not at the beginning cos that's kind of actually the most important thing and basically this would give me a gradient one path it would essentially transport my error to the beginning and what you'd want to do is be able to switch that error signal on and off somehow gating so this is what a gated recurrent neural network looks like gated recurrent network looks like unit looks like so basically here is this is one this is now a whole bunch of matrix multiplies some of which gate some of which pass along here is my input this is kind of the hidden state from before this is a kind of remembering this is a forgetting this is a tunnel state this is a whole mess of stuff built up to kind of gate the signal on and off and up here is kind of my straight through path so you can use this as intensive flow or in keras this will just be grew there will be a function grew which would do what you want so in some sense this is a difficult diagram in another sense you just need to know the words gr and u here's another one this is very very popular this is invented by Schmidt-Huber in Switzerland as you will say this is also extremely popular this is a long short term memory unit basically here is your input these are your x's these basically up here are kind of your hidden states but then going across here there's all sorts of multiplies and non linearities and stuff but here the only difference between this code and that code is the letters LSTM as far as say keras is concerned so ok this is phd here piled higher and deeper you can also you got these layers you can pile them higher so the inputs in the hidden stuff in one layer can be the x's of the next layer why not so if you want if you think you've got some good features this way but it could be more hierarchical then have another layer on top so we can just build this up that way or if you've got context which may affect you backwards suppose I'm seeking German where my verb is at the end and that could inform my thoughts at the beginning I actually want to run information backwards along the line so as well as having my English text with not a very good movie for families with children you know unless they're girls the girls thing really affects what you maybe so the key point even having done all of this with all these matrix multiplies and everything it's all still differentiable which means that you can train these things and which means that given that natural language processing is a sequence of words or whatever it's a sequence we can probably learn to do natural language processing using this machinery so text so that was part one part two, text text is very intuitive to everyone and that's kind of one of its AI problems it's not so obvious what's going on and I'm going to kind of point out how bad the situation is so for text you've got documents you've got paragraphs, you've got sentences, words, characters this is obvious in order to feed this stuff into a neural network of any kind you've got to preprocess it or into any kind of data science you need to preprocess it you've probably got a sentence split it you've got to tokenize it you've got to think about your vocabulary and what to do when there's exceptions for instance for the encoding can you even open the file so this is a this is a problem is your whole pathway uniclo clean like what character sets are you going to accept at some point are you transmitting it over something which will just basically throw away a lot of your data is E the same as E or whatever so here's a French symbol it can be written in unicode quite a lot of ways the question is if I have a Fabergé or something is that the same as a Fabergé egg right and that's a good big question for your dictionary how flexible should it be what about this this is a Japanese quote symbol is it the same as this quotes this double quote or is it the same as this double quote or is there any difference do I care what about this bullet point which is this is another this is all HTML this page so this is actually very very simple encoding it can get very hairy what about this one which is a nice one which is one of them is a ligature it's not obvious what's going on here so there's a special thing in typography called ligatures where the F and the I are actually connected together so I don't think it doesn't look as though in this character set they've actually made the distinction but in a better character set these two characters are joined together and this is a different symbol from F I this is a problem in finance because finance is a problem and profit is a problem there's all sorts of problems across all these domains sentence splitting so it's pretty obvious when a sentence ends because it will have a full stop at the end except that this sentence has only one full stop the rest is an abbreviation you've got a decimal point and like an acronym so but there are nice libraries which can do this NLTK doesn't have many uses but it can do this quite well tokenisation basically you've got a string of let's decide we've got a sentence now I want to break this into pieces it's a good idea to have just a single standard throughout your code base otherwise you'll be fighting yourself with spaces and commas and stuff there's a pen tree bank which is a nice tokenisation often used but what do you do about Chinese text when there are probably no spaces what do you do about Japanese which has a whole different idea about what punctuation symbols to use another thing which will hit you is lots of this research is very English centric so this is one of the reasons why having an NLT based style up here is a different animal from one based on the valley just because we actually care about these things already and here is a this isn't so easy in fact this is tokenised like a pen tree rank it does have a space here because that's the way it should be in a pen tree rank tokenisation anyway vocabulary suppose we've converted our sentence to tokens so what we want to do is we build a dictionary we convert the tokens into numbers so now we obviously want to be in a numeric domain eventually and a very simple frequency analysis will tell us what are the stop words which are extremely frequent words which have fairly little semantic value it could be of this the a a lot of what I say will be stop words but you'll only hear stop words and say I mean listen then there's common normal words which will be the vast bulk of it you can probably get up to like 10, 15,000 words which will just all be common and then you move into the domain of pretty rare words they'll be infrequent will be fairly infrequent you can imagine all sorts of TensorFlow will be very infrequent just because it's a fairly new word they'll also be typos and junk in any given text they'll be junk and then you want to have a special unknown symbol which is often called so understanding text so now we've got it in some kind of manageable format basically a string of numbers or a sequence of numbers English we probably want to have a vocabulary of about 100,000 words as kind of your minimum sensible size people go up to like 2 million but then you're just grappling with web addresses and just junk you've now made the junk promoted that to being like officially a word and two simple ways of doing this is bag of words and the wording beddings so bag of words is you convert a sentence into a set of words you basically throw away the ordering of the words because it's irrelevant to you and you'll do a simple statistical analysis this is basically how surprising is every word in this sentence compared to the rest of the documents doing it this way is what people have been doing for the last 30 years it's surprisingly effective and many people who would claim to do a natural language processing AI are only doing this because it works well enough to fool you until you're familiar with it but because it's got no idea that jumps is a different word from jump it has no idea that jump is similar to spring it has no idea that spring is similar to summer or spring is similar to winter and that winter is stuff to do with jumpers right and jumpers is to do with jumps so another way to do it is to go for wording beddings there's a major advance word to veck, glove, these things came out since 2010 maybe 2012 the idea is that words which are close in the text should have representations which are close but at the beginning we start off with just numbers now we want to convert it into a vector so each word will have a 300-dimensional vector and the way in which we generate this is you slide a window over your text and everything within that window we then nudge towards each other so this nudging process is kind of like a backpropagation idea but the nudging if you do this enough and where I say some text we're talking like a billion words so if you Wikipedia would be a good starting place so you do this enough and you do it multiple iterations these things will converge into something which is surprisingly interesting so what I'll actually show a little example where it shows you how these things will self-organise into a very interesting space whereby things which are similar to each other will align and things which don't have much to do with each other will not so there are also other things which you can do to do with the geometry of this space which is even more crazy but doesn't seem particularly valuable at this point so text demo time let me go and oh so that's me okay so basically so all of this sorry the reason there's a picture of me there apart from it being excellent is here is my github repo and in this github repo there's a thing called deep learning workshop deep learning workshop has all of this code there's these notebooks which you can download if I've made it so that these if you download this code you can run it it will pull in everything it needs it's all right there so there's no secrets here you can run this on your own if you want to or I encourage you to run this at home okay so it also has some comments which I won't bother reading basically here I'm pulling in it's a long way away I'm sorry basically there's a thing called punked from NLTK which enables you to do sentence splitting and there's some examples of how it actually does that this is a thing where it actually shows you how it can tokenize a little sentence I've also got a corpus of Wikipedia right there so basically I've got a very small corpus of Wikipedia it's just the first 100,000 sentences I think this actually allows you to this spreadsheet allows you to actually create a glove embedding but I'm going to skip that I think because because of time one of the things you'll learn from 100,000 sentences even if you run this glove embedding for lots of iterations and this takes maybe 30 seconds it's not 2 minutes I think it's not too bad you'll get a very bad word embedding a better way of getting a word embedding is just to download one off the shelf and off the shelf means people have published these it's a big download but once you've got it you've got a nice word embedding so I may have to actually run some stuff now just stop me I don't want to do that I don't want to do that word embedding is poor sorry excuse me okay so basically I've downloaded a word embedding so this I've just taken the first 100,000 words of a glove embedding which is a 50 dimensional embedding so this is actually a fairly small embedding and I can ask there's a simple call I can do what is the most similar words to king and it's telling me prince and queen and ai ai whatever that is an emperor so it sees that emperor and these things appear in the same context all the time equally I can then test analogies so this is the geometry so this immediately shows you that man is to woman as king is to queen, daughter, prince and throne queen is the first selection it likes that so this is this word embedding has actually captured a huge amount of not just a similarity within language it's actually captured something about relationships the next one is parises to France as Rome is to Italy so it's actually captured geography from just reading Wikipedia and doing a simple nudging kind of thing across a window kitten is to cat as puppy is to dog another thing about animals relationships and understand is to understood as run is to ran so this is actually picking up some kind of crazy grammatical thing just by reading just by reading and reading and that's it so what I can do if I'm let me just check I'm running it what I can do here is I can pick up that there is a nice thing called tensorboard so tensorboard is part of tensorflow basically it'll let you graph out your your data but it'll also graph out your embeddings so this is the word embedding cloud formed by which I've just loaded in and I can look around in this cloud it's kind of difficult to see what's going on but if I type in say king and there we go and let's just show those points but at that so here's a tensor here's where all the kingish words are so you've got queen over here father, imperial so this is kind of interesting that it's actually can get this out of pure data it doesn't actually know anything about the English language apart from where to split a sentence or whatever it can do this by reading Wikipedia so you're getting all this stuff for free so let's kill that and I might actually kill this too okay so that was a text demo so what we've understood so far is and you can prove it to yourselves you can tokenize stuff we can convert it to word embeddings which is strongly the better way to do it than TIDF now so now we've got basically each of my words corresponds to a 100 dimensional or a 50 dimensional sequence of numbers so I can just take that as one word then I have another word another word and another word and I can run a network of some kind across these numbers and I'm entirely in a matrix multiply kind of domain so let's try an application of this the application here is recurrent neural networks for text and to build a quality natural language processing system like I have been for these kind of people over here an essential component of this is named entity recognition which I will call NER just for elegance and this has to be flexible and trainable partly because we're in a region where there's lots of new stuff coming we need to train it specifically differently from the guys in the US to particularly there are quirks here so for instance in NER terms in Singapore it may be fairly rational but in Malaysia you've got the datos at all something something something that's one whole thing Chinese names can be rearranged in strange ways and this is just for people on the other hand one thing which the American systems will be robust against is that Richard can be called Rick or Dick or Charles can be called Charlie or Chuck or Chaz anyway there's a whole bunch of different things which the Westerners do which aren't necessarily done here but on the other hand systems in papers typically are orientated and corpuses orientated to the kind of Western case so here's a quick example what NER is we want to transform soon after his graduation became managing director of Lam Soon so you can see the problem here right we want to know that Jim Soon is a person and Lam Soon is an organisation and the key kind of things which you're watching for is his graduation kind of means there's probably a person coming up because the word became is something which was more of a person thing than well maybe it's more of a person thing than corporate thing but managing director of is almost always followed by Lam Soon now I may be managing director of my daughter but apart from that managing director is a company so there's kind of sign but you can see that there's signposts in here which you could get just from the data if I aligned every sentence in a huge corpus with managing director of something all of them would agree that the next thing was probably an organisation except it could be managing director of a company he founded which ok then you'll have to give that a pass right so the question is can we train a an RNN of some RNN is kind of a generic term can we train an RNN to do NER so what we want to do to create or probably just download a word embedding we'll get a NER annotated training set so either it'll be human annotated so someone's actually gone through and said ok this is a person thing, this is an organisation or location or a date or we created in some way by labelling which is the basis of a paper and then we just train an RNN on this data set and then see whether we get any decent results so this is going to be a demo so the human annotated corpuses or corpora are difficult for me to distribute so I do this workshop thing where I hand out USB keys but I can't do that with any official corpus because these are expensive and difficult to license so what I'm going to do is kind of cheat here I've got some Wikipedia 100,000 lines of that and I'm going to use NLTK to annotate it as if NLTK was the truth which it isn't but I'll try and then I'm going to train the RNN on the machine annotations and I'm going to just look at how well it performs compared to its trainer which is NLTK so this seems like a fairly fair baseline but in order to make it more interesting I'm only going to let the RNN look at text in single case so I'm going to transform all the text into upper case or lower case but the trick in that is if it works it will be great because I know that NLTK will totally fail if it's all lower case because NLTK really relies heavily on whether the word is upper case or not so here's a quick network picture but this is something we can all understand now I'm going to use a bi-direction or GRU or current neural network here are my X's but each X is going to be a vector from my embedding you'll go up into two different networks and then they'll be combined to which this thing will be a label at the top is this a NUR or not and going this way is basically the hidden state in a forwards direction and this one is going in a backwards direction so this kind of lets me if I know I'm starting a name I can get to the end but if I know I've found the end of a name I can get back to the start basically it's good to have two ways so this is another there's another thing in the notebook another notebook in the workshop which lets you do this and I know I'm running low on time and that's basically this is something I almost got time so basically this is a a tagger and I'm just going to pull in the same tools which I had before I'll pull in the corpus like I had before and a reference tagger which is the NLTK thing so this can do some job of actually annotating parts of speech this is let's see what part of speech and how it's only simple to looks like so this identifies where the nouns are it does quite a good job of actual part of speech thing but I'm actually going to make this all lower case and I'm just going to learn where the nouns are basically I'll load in some glove and I'll fix it up so that the word embedding is Keras compatible because Keras, which is what I'm going to be using here is a layer on top of TensorFlow which means that instead of specifying every little matrix multiplication I'm going to do I can just say grw or I can say bidirectional and it will do just the right thing to construct the whole TensorFlow graph so I think you're going to find that a lot of more of our stuff becomes Keras focused because it just makes things easier and if you're starting out Keras is probably the way to go so this is fixing up this to be like a Keras a thing so here is kind of the magic so basically this page is taking the tokens input at the top which is my sentence length of stuff and then it's converting this using an embedding into a sequence of embeddings and then it's doing a bidirectional grw and then it's going to concatenate them and at the end of it it's going to do some kind of dense thing for the softmax so this defines the entire deep learning experience here but you've seen how much extra stuff you've had to do because this is text because this is text is quite tricky and Keras will helpfully give us here is how many parameters are involved we've got 5 million parameters in the embedding the model will only take 30,000 parameters so this is actually kind of a small model and now we're going to do the training of the model this is another feature of the workshop is that these things should train 5 minutes or less so I will now dance so this is going to do a thousand epochs each epoch is I think 64 sentences so this is going to read 64,000 sentences and run them through this so we're now at 150 of a thousand and it's going to finish so this is a nice feature of Keras in that instead of running big training loops this is kind of crazy stuff basically there's a nice progress bar and this is very nice and it will also enable you to do graphs out for tensor board it has some very nice features so we're just going to let this thing we're at 300 now so this is being trained on my laptop my laptop's got an i5 CPU it's got 8 gig of memory it's not like a super duper laptop it does have an internal graphics it has a GPU graphics card but basically it's switched off because GPUs are like special petals when you update your machine everything stops working so if you've got a desktop maybe it's worth it or a server worth it this laptop I'm not going to touch the GPU it does work but it makes life very different is there a way to if the error is input as a point will it stop? so what you can do is in here you can add something called callbacks and you can pass it an array of callbacks and at the end of every epoch it will do your callbacks and one of the callbacks is like keras.earlystopping so you can add in early stopping criteria you can add in different things to look at how you're doing basically or save my state or whatever so you can do all of that kind of game but also there's in part of this there's also basically you can define what loss functions you are what optimization functions you are so you may have something which also flags up whether it's going backwards on itself kind of thing there's a whole bunch of variability keras is all of the all of the defaults are good basically they've read all the papers if you just put in the standard stuff it's probably about middle of the road whereas in TensorFlow raw you may forget a parameter and it's you're actually way out of left field and everyone's decided that drop out should be a half or something or something anyway we're now at epoch 1000 we don't have to save these weights don't have to load them I've got a little function and here we have some output so I've got it's very difficult to see I'm sorry this is fighting me so I've got some sample sentences here and the first one is well three sets but if you look at them they're slightly different so the first one is a pure sentence which is Dr Andrew's works at Redcat Labs then let's see what part of speech analysis looks like when are you off to New York Chytania one interesting thing about Indian names is they're all kind of unique so what this is showing here on each word basically this is what NLTK says so NLTK thinks that Dr Andrews is a ner and this RNN thinks that Dr Andrews is a ner and it also thinks that Redcat Labs is a ner so both of them agree here which is good NLTK thinks that New York is a ner which is right I just think that York is a ner so that's not quite right Chytania they both pick up so this is you can see that basically one has learnt from the other you wouldn't expect the RNN to be better than the better than the NLTK one because it's teacher he's only teacher has been that but if you now go for this one where it's using proper case every word starts with a capital letter Dr Andrews works so it thinks that the first name is Dr Andrews Works Redcat Labs is fine so it thinks that speech analysis is a name so this is NLTK so NLTK has all of these problems whereas in fact the RNN because it did all its learning on the lower case version the difference between these sentences so provably it learnt to be almost as good as its teacher but it's actually superior in many many ways in that you can apply this to text which the teacher totally fails on so there is a kind of section let's look at the statistics but it's hardly worth it because the teacher is just so much the output is so much better so here's when I wrap up text processing is messy and the vegings are magic RNNs can be applied to lots of things text is what we're focusing on now and having a GPU is very helpful because I just did this on 64,000 sentences but our typical training run would be 100 million sentences which is kind of like an 8 hour job an overnight job on a GPU there's also you've seen the workshop thing please add a star I like that I will take questions but before that I'll ask you questions there's a feedback form that Google would love us to fill in and it's at bit.ly tf minus sg tf-sg if you fill it in before the end there will be prizes if you fill it in after the end no prize for you so there we go and I will take questions I will not accept friend requests but I will accept LinkedIn okay, Karthik do you want to so I'll stand here he'll do the next