 We are running a few minutes late, so we should get started. It's my pleasure to introduce you, Douglas Bonnell, talking about recurrent neural networks. Enjoy this talk. Thank you. That's enough introduction, so I'll just go to the first demo, which is using a recurrent neural network to read the kernel, kit logs, and generate new patches. So it's using a modified form of the Git log, slightly different than what you get from Git log minus p. You get the commit line, which I've taken off on this side. And because the commit number is a random string and a recurrent neural network can't to predict that, at the moment, it's output. It's just started reading, and it's predicting that that would be a good patch, which it'll gradually get better over time. And I've taken out the merge messages, and I've taken out the author and the date, because if it generates a good patch, then that would be a lie, because the author would be me. So and for the same reason, I've also taken out the signed off by and the CC lines. So it ends up with something like this. And it's going to try and learn, while you're listening, it's going to try and learn to generate patches. It probably won't succeed, because it's just a small net. Anyway, and while I was doing that, over here, it's getting better, I was talking about what a recurrent neural network is. But first, I'll talk about what a neural network is, because it's kind of like the basis. This simple neuron that almost everyone uses, that looks like this. The numbers come in from somewhere, and you try to make up some sense of them. They're multiplied by weights, and they all get added together, and then they get put through a non-linear function, which does some magic. The bias is another number that comes in in practical terms. It's easy to think of the bias as just another weight, where the input is always a 1. Then you just put it on the end of your array. So if you had a neuron that was trying to decide whether something was a cat, and the three inputs were its fluffiness, how dark it was, and whether it was barking, being fluffy would probably have a slightly positive weight, but it's not definitive. How dark it is, there's maybe more black cats than white cats, but there's a lot of ginger cats, so it has almost no meaning. And if it's barking, it's probably not a cat. So if these were your features, that might be the weights that you'd use to decide whether it was a cat. And that's all a neuron is, is adding things up and then managing them at the end, which I'll get to. Now if you have the same features and you have two neurons, and one of them is trying to decide whether it's a sea lion, then the barking is a positive thing there. So then if you compare the outputs for cat and sea lion, if it's barking, it's definitely a sea lion, definitely not a cat. But if it's not, then it's just based on the other one, so it's hard to be sure, because they're both neither black nor white. And then neural network, you just have lots of these neurons. And typically the inputs, how fluffy it is, how dark it is, and so on, they come from a previous layer. So this is showing another layer. Something down here, these neurons are working out the features that go into this layer. And you just add all these layers up. And the conventional way to draw this is to draw them so the inputs go at the bottom and then the output comes at the top. And the reason for that, I think, the good thing about that is you can say that the top ones talk about high level features, and the bottom ones talk about low level features and make sense, though it goes against the flow of text and everything. And I'm going to start using the diagrams towards the right-hand side, because drawing all the little bits is hard and getting more abstract. So in the old days, like 10 years ago, five years ago, neural network usually meant what was called multi-layer perceptron, which meant a two-layer perceptron. And neural network language, multi-means, too. You can count it as three, if you count the input layer. Or you can count it as one, if you just count the hidden layers. The input layer is just an array of numbers, so you don't really need to count it. And a deep neural network, which is all the rage these days, just has more layers in the middle. If you've got more than two layers, it's deep. If you've got two layers, it's multi. I'll just have a look and see how that's going on. So hang on. That's what I wanted to do. Yes, it's a bit bigger and easier to see. So now it's trying to generate patches, and all the lines where it's got a minus at the beginning, they look like good things. But where it's got a plus, it's not doing so well. Our recurrent neural network is a neural network where the inputs from the hidden layer from the previous generation comes around and goes in as extra inputs. I won't try to explain that picture because it's impossible. I'll just go to this one, which is that one unfolded in time. So time flows this way. Information flows that way. So the hidden layers, they feed themselves up that way. And the inputs keep coming in. The outputs keep coming out. And that's the basic simple recurrent neural network. And from now on, I'm going to start using this kind of picture because it's easier to draw. The training sequence for a neural network is to start off with random weights, usually special random weights, but they're just random. You take some plenty of examples and you just see what the neural network gives you. And then you adjust the weights slightly to make that better. So here it is with an ordinary neural network or a deep one. The minus is showing you take the difference between the two of them. And then you work out how to change the weights to make that different smaller to make it closer to the right answer. And those areas poking into the blue lines, that's representing the weight arrays. With a recurrent neural network, the current answer is based on all the inputs right back to the beginning of time. So you have to change the weights all the way back to the beginning of time. Because it's just the same weight array. It's only one set of weights. It's those purple ones. So all those lines actually represent the same weight array. So you have to work out what you need to change, then add them all together. And if you go back to the beginning, that can be a long, long time. So usually people truncate it. So in this case, it's going back 30 generations. So it's learning from 30 characters before. There's something I forgot to say about this one. But this is reading the characters one at a time and predicting the next character. And that's what it's learning to do. So these numbers here, these are measures of entropy. The V stands for validation entropy. And the T is the training entropy. So the T is the stuff it's reading that it's training on. And the V ones are things it has never seen, only sees when it's being tested, which gives you a more valid view. So what it's doing is it's reading the text. And it guesses what the next character is. And then it sees what the next character is. And the difference between what it guesses and what it sees is the error. And then it has to learn to minimize that. OK. So now I'll talk about this nonlinear function that you have at the end of your neuron. If you don't have it there, then all those layers I won't go into the math, but they all collapse into one because you're multiplying by a whole series of matrices. And if you do that and there's no nonlinearity, it's just like doing one matrix. If you have a nonlinearity, then it actually has to do something more complex than it can't collapse. And one of the things you do when you're training is you multiply the training sample by the slope of the number that you got from your nonlinear function. So traditionally, these top ones were used. They're logistic in the hyperbolic tangent. And they have a nice feature where they squash numbers down. So in the recurrent neural network, because you're multiplying by the same matrix for ever and ever and ever, it's like multiplying by number. If you multiply by a number slightly bigger than one, if you raise it to the power of a zillion, either flows your floating point. So if the matrix works out to be multiplying by it to be amplifying the signal over and over and over again, it blows up. If it's just reducing it slightly, that works well. Because the new signal was coming in from the inputs. But the odd times when it goes out of control and starts to blow up, in these squashing functions, they help with that. But it has been found in deep neural network research that the simple one here, it's called Brillo or Richter-Fudd linear unit, it works much better. It trains better, it's faster. Because you're multiplying by the slope of that. There's always this one when it's going up like that. And there's a simple way of multiplying by one and when you do nothing, or it's zero. And it's even simpler than doing nothing to multiply by zero because you do not even anything to work out what you're going to multiply. So you can skip that calculation. And I've started using this Richter-Fudd square root. I've actually got the wrong number, the wrong calculation there, never mind. So how I came across these things is I had to do an artwork in an art gallery in Dunedin in South Island. And I promised them that I'd make something that would listen to what people were saying and modify the video in response to what they were talking about. And if you saw me talking in Canberra, I was talking about in 2013. I was talking about speech recognition and all kinds of things like that. Trying to get a New Zealand accent to be recognized by a speech recognizer. I couldn't do it. And I couldn't do it. And I was running out of time to do this show. But on the way I read that this man, Thomas Mickelhoff, probably say wrong, had used the recurrent neural network to make a language model, which is something that is used in speech recognition to model the flow of language. And that's just what this thing here is doing. It's modeling the flow of a patch. And then I thought I could make an artwork that used recurrent neural networks directly on the audio and the video stream that would have the same effect to the art gallery as the thing I'd said I'd do. So that's what I got into them. But I'll go into language models first. So with a word-based language model, your inputs are words. And they have an index into the importer array. And the outputs are words, too. And you tell it what the word is, the next word. And it guesses the word after that. And then you tell it the word, the next word. And it keeps discussing the word. And the way it guesses is it makes a probability of how those words will be. And then you can sample from it to generate like nonsense like this by, if you imagine you turn that bar chart into a pie graph and throw data at it as if you're playing darts, then you're picking a word from that probability. Now, if you do a word-based language model, there are thousands of words. And that makes your arrays very large. It makes it slow to learn. And it makes the input. The input array, the blue line is a matrix of weights. The array is a single one, and the rest is zeros. And so that's the same as just feeding that one array and as the input, which is assigning a vector to each word. Now, if you've heard, you might have heard about a program called Word2Vec, which Google have written, which they calculate a vector for each word. And then they can do a arithmetic on those vectors and they find relationships between the words. It's actually the same guy who, his PhD, Mikalov, his PhD was in these recurrent neural networks. And then he went on to do the Word2Vec. So it's actually a related concept. So they learn to output the right vectors that when they go into this recurrent neural network, they'll predict the next word. And what I am doing here is just doing it character by character. So if I scroll up, I don't know if you can see. These are the characters it's using. And these ones here are the ones that it's ignoring and it's just mapping them onto the character. So this entropy number, as Shannon cross entropy, is telling you how much information the model misses. And it should be going down. It's going down. It's not going down very fast anymore. And the estimates of the entropy of Britain English go back to Shannon in 1950. He estimated that there was between 0.6 and 1.3 bits of entropy in Britain English. And actually, nobody's made any good advance on that, except that I think it's more generally accepted that the entropy in English depends on who's writing, who's reading, and the context. But I've used it actually for this bit. If you're live tweeting and if you're a new cylinder and you know what this is about, please stop for a bit and do something else with your hands or be vague. If you don't know what I'm talking about, that's OK. So there's a scandal, a recent scandal, where there's an attack blogger with post posts that attacked health researchers or mainly health researchers, all just kind of random people and no one could quite work out why, except they thought he was nasty. But in fact, he was being paid by a PR man to getting sent these posts. I've changed the names to attack various people. Being paid by the tobacco and alcohol, those kind of companies to attack people who were trying to regulate tobacco and alcohol, or just studying it. And then through other channels, the same PR people would feed friendly stories. And so now nobody knew this was happening, except that somebody hacked the blogger's computer and these emails were found that from the PR man to the blogger with a post that was to be posted the next day. But only a few of the emails were found. So there are more, the serious, there are more of these posts have been posted several days for several years. So I tagged. Now if you run this through a character model, language model, what it would focus on is the names and the topics. Because those are good indicators. If you're trying to work out the authorship, that's what would be the indicators of authorship, that is that it's talking about alcohol, or it's talking about tobacco. So I used the part of speech tagging to replace those nouns and verbs that are content words and adverbs. All the content, well, not all of the content. As many content words as I could find, I made a white list of various parts of speech because words like is is a verb, and you don't want to replace that. So by combining those two and using Armenian characters as tokens, I made a stream of contentless text that represented these blogs. And then training of your current neural network on the blogger's posts, which are the gray dots, and then another one on the PR man's posts. And then subtracting the cross entropy between the two of them. Entropy is nice in that you can add it up and subtract it. Then the difference, the ones slower down are more like they were better modeled by the model that was trained on the PR man, and the ones up the top are better modeled by the blogger, or at least by the ones that are unknown, unrecognized because nobody really knows who wrote the gray dots. And the other colors are other people who are also up to mischief. It was quite. It's not very good because there weren't enough posts to train the model to recognize. There weren't enough PR man posts. So it's a very wonky model. It's worse than this one, which has been only training for 20, 30 minutes. Actually, I'll stop doing that, and let's just see if it can make a post, make a patch. Well, that's not a very good patch. It doesn't really have a proper header, but you can see it. It's kind of getting the context of you take things out, put things in. But when it's speaking English, it stays speaking English for a while, then it sometimes ends with a comment. You need to know C and patch language to see that it's bad. But so if it was bigger, and it was trained for longer, it would do better, but it wouldn't really do all that good. They start doing things like they start balancing the parentheses, so it knows that it's opened one up and needs to finish it and that kind of thing. So that's one thing. There's some things you can do with a recurrent neural network. This is one. This is another. This is actually another way of looking at it. It wasn't a recurrent neural network, which kind of confirms that the two models confirm each other even though they're using a completely different system. So if I can make a big enough model of patches from the next screen, I may need for this. And actually, while I was looking for that, I found this quite of the week. Talking about software, replacing driver developers. But anyway, so as I've got to 25 parts, is that right? Let's show in Dunedin. I needed to make an artwork that made video that responded to audio. That's what I told them I'd do. So I made a recurrent neural network that watched video and listened to the audio and fed them both into a recurrent neural network. And they were supposed to learn to take the previous frame and make a higher resolution version of it. They take the previous frame in a surrounding set. The reason they do that is I applied it recursively. So as well as the recurrent, it was recursive. So it makes an overall picture that predicts the high resolution version. And that is used to predict the next one up. One of the reasons I didn't just take the whole frame and high resolution and go through and put out another whole frame is just to have to be too big. I'll do the demo of that one. So this is a squashing a Louis Thoreau video, starting to learn how the video moves. And it's trying to produce video that moves in the same way. Now it ran for four months. And it didn't get better than this because I had back then that because I had five minutes to finish it. But now, if it ran for four months, it would get better than this. I had to put in all kinds of checks to make sure that the numbers never got to infinity. If you're dealing with matrices, they're multiplying them by themselves. If one number touches infinity, then your whole thing turns to infinity. And it was screwed to the ceiling. And I wasn't a different city, so I couldn't go back and check on it, so I just had to leave it. And I'll do for that one. Then I did another one, which you've heard of a cellular automaton, like the game of life. Now, this one watches the video, and it learns how each pixel changes in relation to its neighbors over time. And so right at the beginning, it's actually not very much like real life. But this one does, if you run it for long enough, it starts making things that look sort of like a Louis throw video in the fine detail. But it doesn't have the overview, which the other one had, because of that recursive character, it had a stretch of the whole thing sort of looked realistic if it kept going. Another thing I've done some of is audio classification. So this is a G-stream up plugin. Those last two were G-stream up plugins, too. To train it up, I feed it a whole lot of audio files and tell it what class the audio file belongs to at each point in time, like 200 at a time. The reason I did 200 at a time is an ordinary neural network training used stochastic gradient stochastic just means you throw the examples in a random order. With the recurrent neural network, you can't really throw examples in a random order because that order is the whole point. So unless your class is changing at every point, which is the case in the character prediction thing, it doesn't go A-A-A-A, it just changes, then you need to, if you just train it on one file, it'll get trained all the way over to one class, and then the next file will train it all the way back the other way, and it won't learn to do anything good. If you train it on 500 at a time, it's being pulled in all different ways at once. So that's why I did this in it quite well. And then getting the answers out, that was one file at a time, and this is the other demo. Now this one is listening to the radio. There are radio stations that are funded to speak a certain number of hours of Maori language per day, and if they're not funded, the people want to cut the funding off. I mean, if they're not speaking that much, they want to cut the funding off. And the people who actually have the job of listening to the radio, they don't want it anymore. They want to be concentrating on their quality and not the counting minutes. So we made this machine. When it's down the bottom, this music, Maori on that side, English on that side. This is one of the parties, I don't know. This is the party branch annual general meeting. The party branch annual general meeting is being held on the 10th of September, 6th of September. I don't know, it sounds like I'm Australian. Australian 6 PM. It's being held at 38 Richmond Street in Maraenui. You can contact one of the numbers. So you can see it. It's not perfect. But it does well enough so that we told them it could do 95%. So like half an hour a day or something, an hour a day, might be misclassified. But if it comes close, then they can examine it properly. So this is talking about how preprocessed the audio. How much time do I have? All right, I'll just skip. Stick on the overview. So just taking the little windows and putting them through bins. Now, as well as people, we also identified birds. And the humans, you feed it through a filter like that to kind of get the right. Go fishers. And then, sorry, here, here, audio classification lag. So now, if you ask the recurrent neural network to say what the language is speaking at the very instant that it hears it, it's got no time to think about its context. So you actually have a friction of a chicken, maybe half a chicken, a leg, so that it has a chance to consider what it's heard and take into account what followed it. So whether this purple bit of wave is thinking about it over there and what it's got in its mind is sort of like something going off possibly forever. But it's like maybe one bit of information from a minute ago and a lot of bits from recent. And the information from over here, it's just storing up for when it gets to it. Other people do speech recognition. They're getting quite good. Though people are using recurrent neural networks to label images. And they use GPU clusters and the recurrent neural networks, which just means having more layers above or below or both the recurrent layer or maybe having more than one recurrent layer. And they use bidirectional ones where if you have an audio file, you can go backwards in time and forwards in time at the same time. And if you're doing continuous speech recognition, you can't. But they're always in labs so they keep. And they use high level libraries. But in the publisher in the archive.org, if you want to, as a field, they research so fast that they can't ever wait for the journals. So everything you need to read about is there. And this is my code, which I don't think anyone else is using, which suits me because I can keep changing it. And it's C and a little bit of Python and these three are plugins and LGPL. That's about all. Thank you. That was fascinating. Thank you. We have time for maybe one quick question. Otherwise, you would love to speak to Douglas afterwards. So if this software is just doing what you wanted to do, do you have other packages you could recommend for people who are just wanting to get started for the very first time, maybe? I haven't used any, but people do use Theano. And that uses the GPU. I don't have a GPU. I've got an Intel laptop from a few years ago. So it's no point for me. Or Torch, which uses Lua. Theano is Python. So the fast bits go on the GPU. Please move quietly when you're moving. We have one more question and then we have to move on. Go ahead. The speech recognition, or the Murray example you showed, just then, I was just curious how many neurons you had for that example. Yeah. I think 299 or 399. I always do multiple or four minus one because the ACC thinks we've got better if you do that. OK, we do have to wrap up now. That was very interesting. Thank you. Thank you. Thank you.