 It's my pleasure to welcome you all to the third talk in the 2017 Gibbons lecture series. For those that don't know me, I'm Jewel Dobby here in the Department of Computer Science at the University of Auckland. Our speaker today is Marcus Frean. Marcus is from Victoria University in Wellington. I wondered how long he'd been working in the neural net area, so of course I googled. And his most cited paper is a paper about a method for constructing and training feed-forward neural networks, and it was published in 1990. So, you know, while people think this is all new stuff, Marcus has been at the forefront. So, his research is driven by the challenge of making computers learn by themselves and the continuing mystery of how brains achieve the same things. So, hopefully you're going to touch on some of this this evening. Hopefully. We'll see. Hi. Thanks for having me. It's great to be here. What a pleasure. So, what's missing is a large country. It's a big place. And I've realised that I've thrown myself a hospital pass when I was asked by Bob what title I should give back in December or whenever you first approached me. And I say, oh, deep learning. Let's say, what's missing? OK, that was my first mistake because obviously I can't. If I knew what was missing, I wouldn't be missing any more. So, I've done a whole girdle thing straight off the bat. But I'm going to pick up on a couple of specific things that I feel are missing. I'm going to partially fill in one of them and the others just, you know, question marks to plant in your head. And I'll mention towards the end some other aspects that I think are just, I don't know, holes. They're really mostly holes in understanding in the current world of deep learning. So, what is this deep learning? Let's get right into it, eh? Deep learning and machine learning? If you can read that, it says hype watch, big data saturating on the hype curve. And the red guy coming up there is machine learning. And I guess the question is, is this beginning of a, how big is this curve going to go? What's going to happen? Not just in hype, but in terms of application, success, stuff. Is it going to plateau or are we just in early days? And it's hard to know because there's a lot going on. So, this curve is, there's a lot of things behind it. There's some new insights, some wonderful ideas. And then at the same time there's applications to those ideas that are very worthwhile doing. And therefore big corporates jumping in into the space that used to be occupied by universities. And huge resources pouring in. And then more ideas flowing from that. So, we're in that exponential phase there and who knows where we're going. But I think part of the excitement is driven by, let's get rid of that hand, by recently, I'm talking about the last just a couple of years, machines achieving near-human performance on certain selected tasks, right? And those tasks being spread across the spectrum from image recognition to speech recognition, text translation, gameplay, dot, dot, dot, you'll have your own things you've read about in the press. And so these are exciting and they're also slightly frightening. I love this sentence from the listener in April. I read it out, it's about to get worse. The job-sucking behemoth of artificial intelligence stalks the horizon. And I think that's beautiful, you know. It's stalking the horizon, this silhouette of the job-sucking artificial intelligence. So that's scary and there are aspects that we should be thinking about. I think that some things are a little scary. There's also tons of promise. There's enough promise that lots of academics want to jump in on the game. In particular, this phenomenon of deep learning, which is the subject of my talk, is the part of machine learning that's actually driving the red curve up, the hype curve. All these achievements that are close to human-level performance are deep learners. And everyone's saying, oh yeah, of course, I've been doing deep learning forever. I don't know if you have colleagues like that. OK, so let's just get into this. What's a neural net? So deep learners, what it means is big neural nets. And probably if you're in a talk like this, you probably know what a neural net is. But here's a neural net. What we have is a bunch of input features or values. So that forms a vector. Let's turn them all on. A vector of input values and some target output. In this case, there are two outputs. And in between, we're going to have a bunch of neurons. What's a neuron? A little unit that takes a weighted sum of its predecessors in the previous layer, computes the weighted sum and then pushes that through a very simple non-linearity. So if you like, let's make this thing go. Let's give it a problem, a hard problem, make it go. OK, so what you see on the right there is that little diagram there, if you can see it. That's the loss, that's the error of the network as it attempts to try to get to grips with this learning problem I've given it, which is actually a pretty tough one, to learn to classify the data points on the spiral. All right, so what's happening here is there's a learning algorithm which is changing the connections between all these neurons. OK, that's all that's happening. The function, the non-linearity is staying the same. The inputs are coming from some training set of examples and outputs, an output pattern that goes with each input pattern. And it's simply learning and mapping from the input to the output. OK, with some success, it's still going. OK, fine. So that's what a neural network is. I should say there's another part of this red curve rising as a bunch of new fantastic tech being built, including TensorFlow, that demonstration there. But here's my more reduced picture. This is my shorthand for that last picture of what a neural network looks like that I'll use for other bits of the talk. So we have an input, that's a vector, that blue line, and an output, another blue line vector. I hope you can see that, it's a little bit, apparently we can't sort of dim the lights to half way, it doesn't work like that. But anyway, so input through to output and it's going through each of these parallel, well, what is that, trapezoid, something, four-sided thing is representing a weights matrix. So all the weights from one layer to the next. And then the sigma in a circle is representing that nonlinearity. So we have the weights matrix happening and then element-wise nonlinearity happening. And that's all a neural network is. OK, and it's deep if you make, you string a bunch of these together. It's deep if you make more than three layers. You're allowed to say deep, I think. In the old days you could say that. And now we can train things that are hundreds of layers deep. OK, so how do we train this? We look at the error. So each input pattern has a desired output. You stick in the input, something else comes out, not the desired output, so there's an error. We can compute the gradient of the difference between them. A loss function. Compute the gradient of that and push that back through the network. And that's basically just saying, how would I change some parameter inside this matrix, some weight value? How would I like to change it such that the performance on the training set has fewer errors? Error back propagation is often called. It's an older algorithm from the 80s. OK, so that's a neural net game over, really. OK, so here's an example of a neural net. A reasonably deep one. It's take the input. This blue vector here is an image. That could be just a huge vector, right? It's pretty big. Hundreds and hundreds of elements. But an image is a vector. It's going through this neural net. Don't worry about the details here. And out is coming out the far side of the output. In this case, labels, words, like car, truck, airplane, ship, horse. And in this case, you know, thankfully, it's figured out that this thing is more likely to be a car than a horse. So full marks. And it did that by training on probably millions of pictures of cars. OK? So all the weights in this big neural net have been trained by this algorithm. OK, that's deep learning. Awesome. That wasn't too hard, was it? I should tell you about a particular kind of neural network, which is known as an autoencoder, which is being used in very interesting ways these days. So an autoencoder, what you do is you take the input and you push it through the neural net and you train it to reproduce the input at the output. So the target is just the input. It's called an autoencoder. Why on earth would we want to do this? The identity map does this. It's not that hard to get the input from the input. But if we push it through a neural net and there's a layer in here that is very narrow, a little blue dot in the middle there. Sorry, I'm being very sighted. Rightest here. So if we go through that bottleneck layer, then what's the neural network trying to do? It has to push this picture of a car through the network and recover car picture at the far side. But along the way, it's got to go through this really skinny layer in the middle. Imagine that was only two units and it was really nasty. That means it's got to push it down to a two-dimensional vector which represents that car. It's obviously pretty tricky, right? But perhaps you can imagine that those two dimensions are going to be pretty interesting features of car and of the training set if you can achieve it. Hard to achieve, but if you can do it, it's pretty interesting. OK. So here's just another picture about use of this. Here's an example where this is a different kind of picture. Each dot here on the left is a data point. I'm just schematically representing a picture of a car. They're being pushed through this network just like before and out is coming in output and I'm learning to match that output to the input. So the output is going to be really a distribution. And along the way, it's going to be forced through right down into just two dimensions, say, a very low-dimensional version. So each dot here actually has its own little dot down in the two-dimensional space, right? And then it recovers back and you get the car back. So each dot, the precise position of the dot, means a lot. OK, so now I can imagine you move, I can just pick a dot here and I can make a car picture or I could move on a line through that sort of central position. As I move along through that line, what do I get? I get different pictures of cars and they'll change. Probably in some discernible way. So let's forget cars and do it with faces instead. This is my colleague Tom White from Victoria's Design School actually. Let's see if I can do this. Good, OK. So I'll just go straight to the horse's mouth. He's looking at... Oh, let's look at the earlier one first. So this is almost literally that two-dimensional picture of little dots and as we move, you pick a dot in the two-dimensional space and go and make a face from it. So the whole network has been learning faces. Millions of faces and faces and faces at the same. So it knows about faces. You pick a dot here, you get a face. As you move along a line, the faces change. OK, so that's exactly what I was saying is happening here and you can see what's happening. You know, there are there are recognisable things happening as you move through this two-dimensional space and here he's made it more than two-dimensional and he's holding two of them fixed and moving on a third one and it seems to be the the concept vector for a seething hairline is what he says. Loads of fun. So these generative models, I guess what I'm really saying is these generative models are really interesting beasts and we've recently started to get really good at this. So for a long time it was all about supervised learning learning to classify pictures of cars maybe cats, you know, things and now people are moving on to the unsupervised learning problems which means building generative models and I just want to show you a couple of other powerful generative models here. You've probably heard of deep art. Deep art is great fun. You can put your own photo into this and have it converted into a style. So if you take a network that's been trained on a particular style and then you put a real alternative image and as the input, what does it do? It pushes it through this bottleneck and then tries to recover the pattern but it's going to recover a pattern like the ones it's seen in its training set. So heaps of fun. Just one. What is that? Let's skip that. This is the one that doesn't work, right? Okay, sorry. I've got to find the other browser. Where are you? There you are. So this is a piece of fun. So this application of that general idea of generative models takes a sketch and turns it into a picture. Which is an extraordinary thing to do, isn't it? I hope you can agree that it's just an amazing thing for a machine to be able to do. Of course, if you give it something that's different it'll get a bit... Let's just move the tail up over a bit. Okay. Yup. That's great. So this learned model knows about cats. So that's great if you have a sketch of a cat. I don't know. It's a little alarming, the sort of things you can do. You can have a lot of fun with this, right? Let's get rid of the eyeballs. There's going to have to put the eyeballs somewhere. You know it's going to happen because you can't have a cat with eyeballs. Yeah. Yeah, fun. So I don't know. This is a sketch of a cat. I don't know. I don't know. This is a toy. There are some obvious money-making applications but there's a whole heap of fun also out there in this world of generative models. So I'm a big fan. Top picture there is pictures of bathrooms. This is a net that has seen a lot of bathrooms in its time from the interweb and it's now fantasising bathrooms. So those are all fake bathrooms. And that's totally extraordinary. This is April this year. Before that you got blurry things that you might be able to imagine with bathrooms. Now they really look like bathrooms. So I think this is interesting from the point of view of we've all kind of got used to the idea of machines talking to us. You know, automated voices call centres and so on. And to some extent we're used to auto-generated text but we're entering this world where we're going to have a lot of auto-generated imagery it seems coming at us. And for a species that is so visually dominant it's um, I don't know that's an interesting challenge isn't it? I think it's going to mess with us a bit. Okay now I want to talk about something that's missing in this picture. So that's, all that stuff is really being achieved with these really quite straightforward technology underneath. There's lots of tricks you know, learning of course but essentially the engine is these deep neural nets and they are trained by this error propagation algorithm. Okay, so there's a fundamental learning problem identified in 1991 by Schmidhuber and his PhD student, Hockwriter and they were confident enough to call this the fundamental deep learning problem and it's called vanishing or exploding gradients. So what happens, I take an error signal and I start pushing the gradient of the loss the gradient of the difference between the output and the and the target. Okay, I start pushing that back through the network and what happens is the signals that are passing back through the network tend to have a very strong tendency either to blow up and start becoming infinite or to shrink down to tiny tiny values and this isn't a minor problem you might think but I can use lots of floating point arithmetic and we'll be fine no, it's a real explosion or a real vanishing. It comes down to linear algebra I can values, let's not go there it's a it's a very ubiquitous feature of trying to do this task of propagating the gradients. Okay, so this is really supposedly solved by a bunch of things that replace sigmoids in the old school neural nets of my youth with rectified linear functions which just, rectified linear just says if I'm a neuron and my total input is negative I output is zero if it's positive then I output just whatever the positive sum was that I was receiving simple as that, really dumb and that works really well so that's part of the trick very careful initialisation of the weights can help a lot here and a technique called batch normalisation which is basically just trying to keep these activations to have about the right mean and about the right variance all the time constantly rescaling and shifting them to keep them in the right ballpark it kind of makes sense, right? Took a long time for someone to make it really work it makes intuitive sense those all help, they really solve that problem okay so, here's the thing I reckon there's another fundamental problem and this is work with David Baldusie who's really the lead creative on all this there's another fundamental problem to learning in deep nets it's not exploding or vanishing we call it shattering David's term so we've just written a paper called Shattered Grade Against Problem and the title says it all if Resnets are the answer, watch the question I'll tell you about Resnets in just a second so, here we go, Resnets so residual nets are as of last year the king, they are the state of the art in classifying images, things in images for example very very effective and they allow you to learn very very deep nets and hundreds of layers deep this is great, what's the trick? we replace, instead of just having this straight feed forward connection through to weights matrices you insert these little hop over direct connections which are just the identity map and why would that make any difference well, if you drew it differently so you've just got basically the identity map as the short route going through and these are all hopping out to do a bit more work so you can imagine it's a bit like replacing well, it is actually replacing the old feed forward function with x, the input, minus the new feed forward function so the non-miliarities in the weights matrices are working on the residual fixing up the errors but there's this backbone of very direct connections okay, so for whatever reason this is a really good idea lots of varieties of it but basically let's call them residual nets or res nets, they work really well okay, so it was originally said that they were helping with this with this explosion or vanishing gradient problem but if that's true why are they working so well now on these deep nets because really that problem was solved by these other techniques so they're doing something else these skip connections, they're solving some other problem alright what do you think it could be maybe it pays to just have a look at some some gradients, so here's a really simple neural net it's got two layers and this is what I'm doing is plotting ah, it's a really simple net so it has one input one, it's a scalar input and it has a every hidden layer has 100 hidden nodes in it and they've all got these rectified linear values and then it ends up the output it's just a single scalar as well so a scalar to scalar, x to y which is great because I can just plot it and we're doing backpropagation on this network, so here's a picture of the gradient as I move along the x I'm being rightist again as I move the input value from 0 to 1 say how does the gradient change, the gradient I would calculate at the first layer it's a Manhattan skyline it's got flat bits, that's because of these value functions but let's look at the overall shape it jumps around, it's sort of a random walk below it is a random walk otherwise known as brown noise a random walk is someone staggering down the road, left, right oh god so one of the things we've shown is that in the limit of an infinite number of hidden units, this is a random walk to the same distribution okay, that's two layers here's what 20 layers looked like with this batch normalisation thing which is solving a whole exploding gradient problem this is 20 layers neural net that's white noise random white, totally uncorrelated white noise okay, so this is that's what we mean by shattered the gradient is being shattered into tiny tiny tiny pieces that are completely unrelated to their neighbours and you can start to imagine what that would do to a learner so how can you learn when the gradient information which is the learning signal looks like that you can't alright here's what resnet looks like it's somewhere in between it's not brown noise and it's not white noise it's somewhere in between, it's got local structure but it's not it's not just a straight line either alright so here's our idea here's the intriguing facts actually the gradients of a deep feed forward net look like white noise wow it's only 20 layers, imagine when you get to 200 and resnet, don't shatter the gradient does not get shattered alright, so let's look at some more pictures this is what I'm looking at is, as I go left to right along one of these lines, the black and white lines I'm just saying, is this neuron here's a neuron in a hidden layer is this activation on I mean, is it outputting a real value other than zero, or is it outputting zero I know that's what black and white means I guess just don't worry about the coloured ones look at these hashy pictures there okay, so they all look the same at the two layer level fine, you go to 10 layers things are starting to change let's go all the way to the 50 layers at 50 layers, what's happening we should be able to understand this the the plan feedforward network pretty much all the neurons are either on all time or off all the time that's actually what the histograms are showing in the blue there neurons are either active or they're totally inactive in other words, they're just being linear, which is not very helpful or they're dead, they're outputting zero neither of these things is contributing any computation to the mapping that we're trying to learn so that's no good a feedforward network with batch normalisation the shattering has broken up the activity so every neurons pretty much uncorrelated to the different points of the input so I have to vary the input any one of these neurons is flicking on and off all the time, pretty much randomly it's not random, it's deterministic but it looks random and there is net somewhere in between we can prove theorems so we can talk about the covariance as we move along the x direction watch the covariance in the gradient actually at different points and you can prove that it dies away one over two to the power of the number of layers so it's exponentially dying away the covariance is zero means there's no signal left you have totally white noise no connection between neighbouring points and input space basically as far as the learner is concerned it makes the problem unlearnable res nets go the other way they actually rise but it's only a slow rise so the square of the number of layers really different characteristics so what's going on, shattering is shattering is occurring is a feature of deep learning it's a really important thing it actually stops you learning things it's solved, one way to solve it is to have these skip connections so that's really what the skip connections are doing okay so with this insight it was a nice payoff immediately to come up with a new initialisation understanding something provokes further ideas and so we came up with a new initialisation which is a student Linux leery which enables us to train really deep nets up to 200 layers and hundreds of hidden units per layer so really deep without any skip connections at all so it's the first time anyone's trained a network that deep without using skip collections and skip connections essentially short now I'll reduce the depth okay so I like this we've attracted that reddit page which is awesome and one of the comments on that sums it up really nicely so we really don't understand how to train neural nets do we it's such a basic thing fundamental aspect it's just coming to light really okay I want to switch tack to the other flavour of neural nets so we've been talking about deep neural nets as feed forward enterprises they go from input to output people are really interested in modelling structure and time and modelling sequences and you can do that with neural nets that are very similar with one difference so the idea is we allow the hidden layer to feed back on itself that makes a first-order mark-off system out of the out of the device if you like so now what I'm saying is the hidden layer we're still mapping inputs to outputs but the representation of the middle these hidden layers I should say more than one but the hidden layers are receiving two kinds of input they receive input from the actual pattern that's coming along at this moment in time but they also receive input which is a copy of themselves a moment ago so in that way in time in principle they could learn to use their weights to model temporal processes right it's not a done deal that this is going to work but that's the idea sort of minimal model of a system that could model sequences okay so you can think of it like that or you can unroll it over time I think this is a good way to think of it so that's just the same picture and you can see that that is is a like a deep net if I try to propagate an error signal something's gone wrong here and propagate that all the way back it's got to come back to change that weights matrix it's got to propagate through a whole lot of weights matrices it's a deep learning problem so we're only unrolling here we're not these weights are all copies the A matrix or whatever is going on this is the same at all time alright so it has been pointed out that this really generalises the family of what sorts of things you can learn so here's the old input goes to output I'm sorry it's back the other way around now input at the top output at the bottom that's one to one standard old feed forward network if you like but with this temporal process each one's unrolled here we can imagine a one to many mapping that's like I say perhaps a a signal, a word making a sequence here's the opposite you could take a sequence and map it to a single value such as classifying a sequence who is the speaker or what are they saying many to many is also possible so this is what's going on and perhaps Google Translate which is taking in a sequence and then having had the whole sequence it's now going to generate a new sequence and the most general one I guess is many to many so a streaming input coming along that's the red squares all through time and we're having to produce outputs all through time as well alright so lots of possibilities lots of scope for fun here's one rendition picture of the fundamental element inside one of these recurrent nets so what we're taking is it looks more complicated than it is probably all we do is we take the input actually this think of two paths through this so one is that the input is going to the output and on this picture it's going through two layers of neural net so that's a two layer neural net going from input to output and this is going to be replicated over time there's a whole bunch of these strung together if you like so the other path going through here is left to right and that's the hidden state which we might as well call a memory now because it's being propagated over time and the memory is able to pass from memory at time t to memory at time t plus one being changed along the way and it's changed in the light of the current input so you take the memory you take the input that changes the memory and here's the memory influencing how input maps to output so it's got it's got the bare minimum okay this has a bunch of problems so it seems like this is a sensible way to start unfortunately it doesn't really work very well one problem is it has this exploding vanishing gradients problem but there's more to it than that very early on in the 1990s an alternative architecture was proposed which is called the long short term memory LSTM and here's a picture of it and you can tell me whether how it works the interesting thing is it's pretty much inscrutable but incredibly successful so this is my second missing why is this thing working so well so LSTM long short term memory looks like this I'm not even going to try and tell the story but if you spend enough time on someone who really knows their stuff you can start to believe bits of this thing makes sense but it's hard to that's not the same as saying it's the answer or I don't know there's something really weird about this that it's so successful it's been such a long time in 1997 so it's 20 years old incredible this thing is still the state of the art in recurrent neural nets it's slight tweaks admittedly but basically this idea so Google's translation engine runs with a whole bunch of these state of the art speech recognition all kinds of cool tech that's either on your browser or on your phone or is coming soon built out of these things and let's just look at some examples so here's a language model I feed something I feed one of these recurrent neural nets everything Shakespeare ever wrote and then what it does is let's just jump back to that picture back there I'm feeding it individual characters in that scenario more like that one individual characters C, A, T that's not a good example S and then I'm trying to predict the network it's being trained to predict the next letter has a little window it's just got a sliding window and it predicts the next letter fine that's a predictor I can train it and then just for kicks I'm running that model and when it says K for the new prediction I'm just writing K in so now it's able to just write text forever so this is making up Shakespeare it's a bit hard to see the the bigger picture but it's pretty good I mean it's making words it's making spaces grammatical sentences and look it has the feel right lots of fun this is trained on a few hundred essays by someone called Paul Graham founder of Yahoo Store you can tell by looking at the fake text coming out of this generator the sorts of things that Paul Graham writes huge fun if someone wants to ask me a question later I can show you some fake maths it's really fantastic so they got the whole latech of a big book and trained it up and then they ran the model they got a whole bunch of latech that actually compiled which means you can make a PDF out of it and look at the proofs and you've probably seen Deep Drump this is a Twitter account trained on all Donald Trump's tweets and it just generates tweets all day long stuff like this but they're pretty good actually fantastic okay what other things can we do that's just the language stuff you can now take an image and push it to a sentence so instead of taking this image and putting just labels on it we'll take that representation the low level representation of the image and ask one of these recurrent neural nets to make plausible sentences from it incredibly this works you wouldn't think it would work but it works pretty well it's a man in a black shirt playing a guitar it's true so these are slightly cherry picked examples but it's pretty effective so it's taking the whole system end to end so taking it a raw picture raw image and it's generating the text at the far end and it's all just neural nets all the way along trained from lots of examples there's nothing else in there okay oh, sentence to sentence this is just a picture to remind myself to say that the what's inside technology now is just a big hierarchical multi-layered LSTM network that's what's inside it up until last year they used to have all kinds of clever linguistic machinery in there developed painstakingly over years and years and years huge thing and then last year they swapped it out because this worked better and this is just this is just one end to end thing it's just a whole bunch of those recurrent neural nets alright the one on the bottom right is oh, taking a sentence and making an image out of it it is extraordinary that this works early days as well but if you can see them this has seen a lot of pictures of birds okay and then it's been given a sentence this small bird has a pink breast and crown and black primaries and secondaries of birds how long has it been doing this? how long? this is last year I think it's either early this year or last year yeah okay, so text to handwriting let me show you this, this is fun this is someone needs to give me a sentence okay, so it's taking text this was really came from someone trying to build a model that should recognise handwriting in reverse oh, I can't scroll that down can you see at the bottom? so someone needs to give me a sentence different samples you can mess with them all you like have a go lot of fun so the text comes in and then a recurrent neural net is actually directing the way this is actually done I believe is it's directing the drawing motion rather than trying to make a whole image with that sentence from scratch is actually controlling if you like a virtual hand alright oh, and here's someone, this is just the other day last week I think this guy's okay, that's not great same idea as Shakespeare so you feel that it's going to predict something from the last few frames and we're just going to take that frame and go with it and keep going you could tell how this was trained it's trained on a train it must be completely new countryside it's not perfect but it's pretty amazing let's go to the town how was it kicked off? I'm not sure for the Shakespeare one you can kick it off with a blank space and it just says okay so I'm not sure maybe it's kicked off with a picture but maybe it just starts making stuff up I don't know okay where's that going? like I say we're going to start mistrusting our eyes I don't know the world's going to get interesting so here's something that's missing at least what's so great about LSTM they're almost impenetrable and that it's very hard to tell a story in a pub about how the LSTM works it's a sort of 3-point explanation it's really hard you can pick out bits but there's not that very simple story that I was trying to tell about the basic current network there's two things going on inputs go to outputs through a neural net but they're mediated by the memory and memories go to new memories mediated by the latest input okay hard to tell that story so you have to wonder how special is the LSTM how unique is it there's an interesting story here that two teams ran two large trials trying to find something alternative so they used genetic programming which is random search to try out all kinds of variations so this LSTM I've drawn it as a diagram but it's really just five equations so you can make up equations randomly you can try them out on a big machine they trained it on the Googleplex for the equivalent about a century of CPU time equivalent and and found the LSTM and a few other things GRU if you know this stuff is the other alternative but not much so basically this is a some kind of special spot it seems in the space that they were searching at least okay so one idea is to do genetic programming for 100 years of CPU time just to think about it so we've been trying to do that and this has worked with Paul Matthews, a wonderful honours student at Avaldusi, he can't get away from them again this is last year and this year we're pushing on with it so essentially we're trying to think of what's a minimalist model that's better than the old restricted recurrent neural net but is understandable unlike the LSTM so and we're going tensors on this so there's a mapping from input the same story right so the inputs map straight to outputs in fact this is just an outer product so basically inputs are going to outputs through a one layer neural net it's just that we do want that to be mediated by what's in the memory so we involve the memory by taking out a product this is another way to say this is a tensor mapping a three-way tensor takes the input and the memory in gives you a new output otherwise known as a bilinear form whereas a straight matrix is just a linear form so that's that story and memory is going to new memory how's that happening well it's just going straight through really this is a little switch that says you either send this information straight through or maybe you take the other path and you take something from the input and this is an either-or thing it's not binary but it's in the 0 to 1 a mixing between what's in the memory and what is coming directly from input it's as simple as that so how's this decision made about which one you want to keep this is a bit like a read-write head you can think of it in a simple way that this is saying should I cross there or should I overwrite it with a new bit how do you decide what to overwrite which things should be overwritten or not we figure that's a job for both the current input and for the memory state so it has to be a combination of both let's do the same thing so take a tensor product of the current memory and the input and put it through a nonlinearity and that's your switch so that's a longer story so this thing has some nice properties so the gradients don't explode or vanish because basically the memory is propagating straight through here and we're just swapping stuff out or into it has the effect that the gradients don't explode we're avoiding this concatenation of input and hidden states which always happen in the other models you sort of push inputs and hidden together into one space where they don't really want to be in the same space now it's almost like a type error to do that so we avoid that and we use a tensor instead of concatenation um you know it's a machine learning talk when everyone's frowning and there's a graph like this I guess there's a lot of talks to have graphs like this but machine learning you've always got one so there's the loss function training time going left to right along the bottom the training loss that's the amount of error the amount of badness badness going away and as with most talks the badness goes away on the stuff we did and stays high on the stuff other people did okay so the top two lines are the LSTM actually and the GRU which was one of those other random ones found that it was as good as the LSTM so ask me about this task if you want to know later but I'll there are tasks very straightforward simple memory type tasks for which this new architecture seems pretty promising it's early days we're still chasing this around but there are tasks here's an interesting task which we also happen to do well on it's the task of well okay so here's the input pattern vertically there's nothing over time okay so sequence of input patterns what happens is pay attention to that very last couple of rows if when this neuron comes on I want you to I want the network to look at these other bits so these ones over here and remember them and to spit them back on the output when that neuron goes off again so the last last of the inputs is a sort of trigger when it comes on remember something when it goes off cough it up and here we're doing that with three of them right there's three memories going in store that store that store the third thing which overlaps actually and then cough it up at the end and it's not perfect this is our model trained on gazillions of these examples but it's having a good go second one's close third one's actually pretty good not quite there yet so work to do but that's a that's to be compared with the best this is the very best of the competition where we try to store three memories we get three things out and they come out at the right times when these neurons here turn off you get a pattern back but it's the same pattern so this is the LSTM the state of the art that's inside in your phone doing all these awesome things all the handwriting stuff it can't do this kind of task it's just a conceptual it finds this very hard to do we would argue our architecture finds this really natural thing to do and it's very like variable binding right so the trigger comes along you remember some arbitrary thing hang on to it and then when I release the trigger it comes up at the output so I guess what's particularly appealing about it is that it feels like the foundation for other processes any processes that are able to bind one pattern to another and use like a key and then manipulate the key perhaps in some fashion and then get back to the original bound item I didn't say that well but I guess optimistic and feel that this is the beginning of something interesting bridging into a more symbolic realm for neural nets but it's still your standard neural net it's just got magic tensors in it just an even tougher task when this bit comes on you have to remember the bit pattern that immediately follows it on the other nodes and then when the neuron goes off you have to cough it up again so same task but now there's all this distractor stuff that's what the hidden representation looks like and that's the output after training so if you're clever you can see that pattern there it's the same as that pattern there I think 1, 2, 3 oh yeah it is great so it's learned to do that we never told it we just trained it and gave it lots of examples in which that's the pattern and this is a new example so I'm going to wrap up quickly some more missing things we've talked about why one thing I said was missing was why the state of the art in feedforward nets is actually the state of the art why is it res nets what are they actually doing I argued that they're solving the shattered gradient problem we said there is a shattering gradient problem and they're solving it and then I talked about why is the state of the art in recurrent neural nets LSTM I don't have an answer because we haven't found the better thing and pointed to direction for thinking about the better thing I think thinking about the problem and trying to gain insight and building something that's got that is interpretable is a better way to go than randomly searching here's some other questions to throw out there but I'm not going to answer they're very much missing why are deep nets better than shallow ones this is still up for grabs actually so there's a proof just this year I think it was last year but that relates actually to two versus three layers it doesn't generalise so that's the wow why this architecture at all why neural nets at all what's so special about going mush through a matrix and then having an element wise nonlinearity and then back into a way just doing that and iterating that why does that work so well why not something else I think there's a coming something that is coming is various ways of making forays into more symbolic style processing but within this connectionist within this neural net world and hinted at that that our tensor gate idea is one direction to take into that but lots of other people are doing things that are to do with for example having external memory having a neural net being able to read and write from external memory having neural nets able to implement programs that's a hot area lots of fun and my pet passion I guess is trying to understand generative models for really complex multi-causal data so you imagine you have some data my go to example is a face, my face what you're seeing the pixels that are appearing on your retina are a combination of two completely independent things one is the physical structure on my face and the other is the lighting environment and those two things are independent completely independent in fact they interact and then you only get to see the interactions you never actually get to see a naked lighting environment you only see lighting environments when they hit things and you never get to see directly apprehend 3D structure you always have it in the context of a lighting environment or some other environment so there are there are pure things out there in the world but they get mixed and then we see the mixtures only can we build a system go the other way and having seen a lot of data disentangle that to discover the underlying causes that's a big tough problem and unsolved many many more things to come there's a lot missing missing is a huge country thank you very much thank you very much Keir Tuwa's book about how to create a mind do you agree with this sorry who's Keir Tuwa's he's written a book last year on how to create a mind he discusses the new arms in a brain and that there's 12 layers et cetera I haven't read the book so I don't think I can comment on that one interesting today the announcement at the Google conference today was that the future is to do with the tensor oh no that's all right someone will get there I mean I do have a for me I have a skepticism about people the idea that you can study brains and that will if we study individual brains enough we'll understand entirely how they work I think that has to be coupled with computational style thinking to achieve this kind of computation people have been it's a bit like the analogy with flight that you understand flight you look at birds and that can help but then eventually you have to do some aerodynamics sorry yes you said that it's a good thing that you don't concatenate the inputs and the previous that you said can you elaborate with why there is okay so I would direct you to David Baldusi's paper called neural nets strongly typed is the key phrase so his point is when you when I map a hidden so there's a mapping from the hidden state to the new hidden state it's being achieved by this weights matrix so what's the weights matrix doing actually it's it's rotating and skewing directions right that's what it's doing there's a sort of there's a native is there like a basis associated with that transformation and the basis for that is a different basis from that associated with the input pattern the interaction between the input and the hidden layer that's a separate it's a separate basis there and it's why would you expect the hidden units dynamics to be naturally mapped on to to be naturally altered just by additive changes being coming in from the input it's a very limiting way to do things and have a look at David's paper about strong typing I think a lot you can explain it yes this is the sentences to images yes the images are in theory they're supposed to be brand new because they are projecting from this space of the sentence down to a much more modest latent space low dimensional space and from that space generating the image so in theory they're new images in practice have a read of the paper they're struggling so this is last year and it's actually look for GANs Generative Adversarial Networks another kind of generative model they they do seem to suffer from a problem it's called mode collapse where the generative model learns to only produce a very narrow range of what do we have, birds kind of all rather similar so that was a problem it seems to have been solved so I expect there'll be much more convincing demonstrations of that soon but no, the whole idea is that they're not supposed to be just stuff you've seen after all that would be kind of boring so most of these papers what they do and they should do is test so they're making up generating new things they should look at those new things and compare them to items in the training set and usually there's some kind of demonstration that, yeah, we're really making new things hmm hi sorry, which one? let's go with you okay fine, sorry about my insufficiency last night um, the okay, I've started seeing this a couple of years ago and it's much better okay, just a weird question that came up in the discussion with somebody who used her medical background a few months ago is it ignoring the ethical issues is it within technology to cut through the optic nerve of an animal and attach it to some device that can get the signal out of it? ah not my field of expertise I would say that can you chop the eyes of something and stick it on to the brain of something else not currently doable and not something I'm an expert on what was the question here ? ? ? ? ? ? I know, it's annoying isn't it it's hugely annoying that there's all this progress and cool stuff and it's all built with that propagation and that propagation is biologically implausible so that's a damn shame for the idea that you might think about computation and make your way back to water's brains and understand ourselves better bit of a blow there is various people have tried to think of alternatives which generally the efficiency goes down I think it's still a live question all this stuff is working very nicely it brings the pressure on how is it done by real neurons then it's still an open question ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?