 So, I'm talking about generating video with recurrent neural networks. This is something I did by accident a few years ago, and it worked. So, I used to be an artist or I used to make my little bit of money being an artist. The way I do it is I'm rather proposing for some crazy artwork that involves computers and things. I get funding for it, and then I'd muck around doing whatever I wanted and try and make something that vaguely resembles what I proposed. This is talking about one of those cases. The last one, the one that actually completely ruined my financial and in terms of credibility with the Arts Council in New Zealand. So, this led to making a job which has got me here. So, what I proposed is there'd be a computer that listened to people coming towards their art gallery and it would use speech, coming towards the room that the show was in, and it would use speech recognition to work out what we were talking about, and then it would generate video that was related to what they were talking about, and then they'd go into the room and they would see the video and they wouldn't know. They wouldn't know that the machine was being sneaky, and they just think that the work was really topical. They think it was the right kind of things and they think it was a really clever artist knowing what people were talking about and they'd go away and never know that they were correct. This was in 2011 I had to say. I didn't know that people were going to actually, like a few years later, install things like that in their houses, but that's a different question. So, I kind of looked at this. There's a microphone outside the room and there's a projection inside, and now there's a TV material because the way it was going to generate video related to the words is it was going to watch TV and read the subtitles, that the subtitles were there at least sometimes, and it would associate something to do with the video qualities of the video with particular words that were coincided with those pictures, and then when it came to generate a new video, it would just try and make something that matched what it had learned about what words to do with what video, and I didn't know how that was going to work, and that was the exciting thing, so I had no idea. So, the whole point was that it was going to be my whole time fitting around trying to associate video, or trying to generate video based on these words. And I was going to use pocket strengths and the TV, but what I found out is that New Zealand English didn't work with open source speech recognition. If any of you can understand me, I can tell that you're not in open source speech recognition. Actually, it's better now that Mazada's made one. It works a lot better. And then the video picture, as I was saying, it was going to be for subtitles, but just as I was staffing, they turned off the analogue broadcast in New Zealand, and the analogue broadcast actually had digital takes, whereas the digital broadcast is still digital take, but the subtitles are an image. I have a bitmap image that just goes on top, so I need to use OCR to read what the words were. Well, it's nothing possible, but it was getting more complicated. And the video creation, I still didn't know how I was going to do that. So, anyway, the investigative speech recognition, I was kind of looking into what you need to do, and basically what you need to do is thousands of people to read sentences. You need hundreds of hours of speech, of exact constrictions of work. You need sentences with an exact constriction of what they say. So, any discluency, or hours, or any kind of misstats in the sentence, it needs to have that included, or else that sentence needs to be checked out. And it's a hard problem. You need to be an expert in grabs or some things, and I'm not that kind of person. Or else you need not somebody. But anyway, so it was really fun. I was having a good time reading about how this was done, and I started reading about recurrent neural networks, which then being used for language modeling, which is another part of the speech recognition, I won't get into this. This slide, I'm not going to explain it very much. There are lots of tutorials and stuff like that, recurrent neural networks and that, and usually people are using a thing called a long short-term memory now. But in 2012, people weren't really using that. Well, I'm not using them at all. But this guy, Thomas Nicolov from the Czech Republic, he used a recurrent neural network for language modeling. This old question came from the 80s, and it just has kind of one hidden layer. And you just do something. It's just remodified things together. They have nothing to squish it a bit, and remember the state. Remember the state from each time step on the next one. I'm going to skip over, because if you want me to talk more about it, if I have time. The simplified view is kind of looks like this. There's a state at each time step. You get a new input, and it mixes the input into the state, and it sets an output, and it kind of keeps going on. Now, anyway, I was thinking about these things, and the art gallery were asking when was this artwork going to turn up, and I had no speech recognition, no enabled video, no algorithm. But I was interested in recurrent neural networks. So, I came up with this idea, where you've got your video coming in, and you basically feed the video into the recurrent neural network with the audio. Now, because I promised that there'd be some association between audio and video and stuff, I needed the audio in there. And it splits out. It's meant to learn, it looks it's got frames in order, and it's meant to predict the next frame from the current one, and the state, it's a summary of the entire history until that point, or in practice, the last few frames. So, it's kind of got a sense of how the video moves from the... that's what it's meant to do, and the way you train it is you just give it the input, and you show the output, and you see what it creates, and you punish it for how wrong it is, how much it's different from the output, and it's machine learning. And then, so that's the training process, and then when it's in the gallery, sound from the microphone would come in, and it would mix in with the previous frame that it had generated. So, it's kind of grooming over its own output, but with the sound from the room kind of being steering what it's doing. So, the first problem with that is video has... when you put them up, there are millions of pixels, and there are... well, there's three planes and stuff like that, and to train that much... to train something with millions of inputs and millions of outputs, you actually need millions of examples. I feel like if you imagine that, you know, pixel number 100002, if you're training individually, you need gazillions of instances, because they can't really learn the relationships between them, particularly well. So, this is the problem, because I had it a few weeks, and I didn't have any training data, and also that... I didn't have any current go-never account, and it wouldn't have run on my Core 2 machine that I had. So... what I ended up doing is making a dodgy assumption if you look at close-up on a video, it's the same as the way that... I assumed that video had like a spiritual similarity. If you look at it, the tiny bit I've just seen, it'll move in a similar way to the entire scene, which is maybe true of the natural world, but not really, but it's not true of video, because video is guarded by people that overall seem to be trying to show you something, somebody wants you to see, whereas little parts of it aren't like that. And the trick I had is that it would have a 3x4, 4x3 input, and it would come out with an 8x6, and it would learn with... it had little spots all over the video at different scales. It would kind of jump about, look at one part for one, and then jump to another one, so that in real time it could watch one video, and it could learn 20 different things from it because it looked at 20 different points in it. So it was like multiplying the corpus that I had of video, and the top of the audio. And so then the thing that I did to make... because I mean, I wouldn't mind an 8x6 video, I quite like that kind of look, but I just thought it would be more interesting to make it, try and add more detail. So it kind of recursively went into the first... I'm sorry, I really want to jump over this, but I know you want to hear this. The first 4x3, it also looks... it kind of smears out the edges so that you get the... the extra, at least the insert image to fill in this dikton over. I didn't really explain it. So around the 4x3, there's another layer of nodes around it. And then it kind of looks in each quarter. And my generics video from that, and then it goes into each quarter of that, and it keeps on going down. And so it recursively goes down to add detail using the outer layer that the previous thing... And now, on this laptop, I can go seven things there. I'm going to show it to you when I'm five people. And that's just how you... This is about taking features out of the audio, but you don't really care about that. If it was an audio conference. And so there's the... I use G-Sprema. There's... basically it looks like a plug-in that is a video filter and an audio filter at the same time. And then training, and you just feed the audio in. I mean, video source which says the audio in video. And you take... I can put it into what they call fake bin, so I think. You just chuck away what you get out and it just learns state. And then when it comes to the show, you do this. You feed. I'm not sure if that's actually accurate. Anyway. So the original picture looked like that. And now because this audio wasn't... there was no delay between what you see and what it did. It was more like a... more like a work where you'd say something and the video would change straight away, right? So the microphone came into the room. Otherwise you'd just have the effect of what someone else in the building was saying. But now, so I got to the need of the town where the show was meant to be. And the art gallery there is a huge concrete building. There was no TV reception in there. So the TV came off and then the other problem I had I can't prove. Well, I didn't bring a microphone. I think I thought that they would provide a microphone. So it was more like that. So I've done a whole lot of work doing the same, synchronising the video and audio. I think I've done a whole lot of work making a slide about how I did it but I didn't include that slide either. So anyway, so it ended up just being the video generation. And then there was this thing which is watching a video and trying to learn the qualities of the video and generate video in the same style. Now, in the... minutes, I'm going to show it to you. It's not going to learn anything. But that's alright because it wouldn't actually learn that much and then long term. It might be something interesting like that. Yeah, so anyway, so this show went on for a few months. I will get to my hometown. And it turned out after a while that half drive died. So the picture froze actually. But it looked alright. It's a picture on a board, better protected than this one. And people didn't mind. So then a little while later I was asked to make another work in the gallery. That's good. And I... So, you know, I've been using the recovery on movements and I had the idea of using it to drive a cellular or comment that. I'm not going to explain it. You've probably seen life, common waste life, little bits fly around your screen. It's like a biosec thing. I've got other ones that people use. But I've just had... Each pixel, if it's got a Sony Returrent or network, it can be a Sony internal space and it looks like it may get neighbors and generates the links that step. Then that would... and it can learn from watching video and it can do something interesting. So I've got my private and it looks more like this. And also if I left it for longer it would do more different things. But at the moment it's really just trying to get it here around starting that. And so this is another way you can do it. Now, if I was going to do this then now, this was all six years ago, five years ago somewhere, I would have a different generator on the top level, generating the scene and then things like that filling out the details. And I'd use one bit neural networks which is the thing I'm interested in at the moment. But actually, what I'd do is I'd map it around for the last minute again. And you know, the same. Or different, but lots of fun and sort of a failure. So the code is CGsBremler. And it's quite fast because the activation function I use doesn't use, for instance, a digital maxim. And a lot of the times you have to multiply things for zero or one when there's actually short cuts for that. And I'd just make everything completely aligned for what you see in those days. And just using these macros and this is pretty simple stuff that we've earned to put, I guess. Just tell MTC that there's never going to be a time when it's going to get to look faster and then slow down and just keep it through. And that can, you know, generate video. I'm a crappy old computer. And then, how much time now? OK. Anyway, so I ran out of money doing all this. I'm in Bethraimton and I got a job using my Recurrent Your Network from this to identify birds and night time birds identifying kiwis and more birds and then identifying the language that people were speaking on the radio because the radio stations were funded to speak Māori that seemed to some of them out. And if they're not speaking enough then they're funding cap. And so the Recurrent Your Network that ran with these ad works is actually being used still to identify what languages are being spoken. And it's good for that because I made it all fast and small and so it can do 1,500 radio stations on one computer. And also I used it to identify anonymous authors. This is the screenshot from there. Radio thing. Each calaband is a different language or some of them are music some of them are Māori and some are English. And that's all I'm saying.