 ysbyrwagwr. teisana pab venir labi a i teisana merozu i teisana Opak Pwg i i attentive, a teisana teisana teisana e-siaidia rangālu hawkei. Rai que ida bwisi teisana merozu ida, Marcas teisana merozu i teisana, dan ida kai teisana a i teisana hulu hukumia. Teisana ma sprunau tiempu e paba leiratia tia asatia. So, hoffa yw'r gau te tach on some of this evening. Hoffa yw'r gau. Hi, thanks for having me. It's great to be here. What a pleasure. So, what's missing is a large country. It's a big place. I realised that I've thrown myself a hospital pass when I was asked by Bob what title I should give back in December or whenever you first approached me. And I say, oh, deep learning, let's say, what's missing. OK, that was my first mistake, because obviously I can't. If I knew what was missing, I wouldn't be missing anymore, so I've done a whole girdle thing straight off the bat. But I'm going to pick up on a couple of specific things that I feel are missing. I'm going to partially fill in one of them as just question marks to plant in your head. And I'll mention towards the end some other aspects that I think are just, I don't know, holes. They're really mostly holes in understanding in the current world of deep learning. So, what is this deep learning? Let's get right into it, eh? Something's happening in machine learning. If you can read that, it says, hype watch, big data saturating on the hype curve. And the red guy coming up there is machine learning. And I guess the question is, is this beginning of a, how big is this curve going to go? What's going to happen? Not just in hype, but in terms of application, success, stuff. Is it going to plateau or are we just in early days? And it's hard to know because there's a lot going on. This curve is, there's a lot of things behind it. There's some new insights, some wonderful ideas. And then at the same time, there's applications to those ideas that are very worthwhile doing, and therefore big corporates jumping into the space that used to be occupied by universities. And huge resources pouring in, and then more ideas flowing from that. So we're in that exponential phase there, and who knows where we're going. But I think part of the excitement is driven by, let's get rid of that hand, by recently, I'm talking to the last just couple of years, machines achieving near human performance on certain selected tasks, right? And those tasks being spread across the spectrum from image recognition to speech recognition, text translation, gameplay, dot dot dot, you'll have your own things you've read about in the press. And so these are exciting, and they're also slightly frightening. I love this sentence from the listener in April. I read it out, it's about to get worse. The job-sucking behemoth of artificial intelligence stalks the horizon. And I just think it's beautiful, you know. Stalking the horizon, this silhouette of the job-sucking artificial intelligence. And so that's scary, and there are aspects that we should be thinking about. I think that some things are a little scary. There's also, of course, tons of promise. There's enough promise that lots of academics want to jump in on the game. In particular, this phenomenon of deep learning, which is the subject of my talk, is the part of machine learning that's actually driving the red curve up, the hype curve. All these achievements that are close to human-level performance are deep learners. And everyone's saying, oh, yeah, of course. I've been doing deep learning forever. I don't know if you have colleagues like that, but... OK, so let's just get into this. What's a neural net? So deep learners, what it means is big neural nets. Talks, and probably if you're in a talk like this, you probably know what a neural net is. But here's a neural net. It's what we have is a bunch of input features for values. So that forms a vector. Let's turn them all on. A vector of input values. And some target output. In this case, there are two outputs. And in between, we're going to have a bunch of neurons. What's a neuron? That's a little unit that takes a weighted sum of its predecessors in the previous layer, computes the weighted sum, and then pushes that through a very simple nonlinearity. So if you like, let's make this thing go. Let's give it a problem, a hard problem. Make it go. OK, so what you see on the right there is that little diagram there, if you can see it. That's the loss. That's the error of the network as it attempts to try to get to grips with this learning problem I've given it, which is actually a pretty tough one, to learn to classify the data points on the spiral. All right, so what's happening here is there's a learning algorithm which is changing the connections between all these neurons. OK, that's all that's happening. The function, the nonlinearity is staying the same. The inputs are coming from some training set of examples and outputs, an output pattern that goes with each input pattern. And it's simply learning and mapping from the input to the output. OK, with some success, it's still going. OK. Fine. So that's what a neural network is. I should say there's another part of this red curve rising as a bunch of new fantastic tech being built, including TensorFlow, that demonstration there. But here's my more reduced picture. This is my shorthand for that last picture of what a neural network looks like that I'll use for other bits of the talk. So we have an input. That's a vector, that blue line, and an output, another blue line vector. I hope you can see that. Apparently we can't dim the lights to half-way. It doesn't work like that. But anyway, so input through to output and it's going through each of these parallel... Well, what is that? Trapezoid? Something four-sided thing is representing a weights matrix. So all the weights from one layer to the next. And then the sigma in a circle is representing that nonlinearity. So we have the weights matrix happening and then element-wise, nonlinearity happening. So all the neural network is. And it's deep if you make... you're stringing a bunch of these together. It's deep if you make more than three layers. You're allowed to say deep, I think. In the old days you could say that. And now we can train things that are hundreds of layers deep. Okay, so how do we train it? We look at the error. So each input pattern has a desired output. You stick in the input, something else comes out, not the desired output. So there's an error. We can compute the gradient of the difference between them. A loss function. Compete the gradient of that and push that back through the network. And that's basically just saying, how would I like to change some parameter inside this matrix? Some weight value. How would I like to change it such that the performance on the training set has fewer errors? Error back propagation is often called. It's an older algorithm from the 80s. Okay, so that's a neural net. Game over, really? Okay, so here's an example of a neural net. A reasonably deep one. It's take the input. This blue vector here is an image. That could be just a huge vector, right? It's pretty big. Hundreds and hundreds of elements. But an image is a vector. It's going through this neural net. Don't worry about the details here. And out is coming out the far side at the output. In this case, labels, words, like car, truck, airplane, ship, horse. And in this case, you know, thankfully, it's figured out that this thing is more likely to be a car than a horse. So full marks. And it did that by training on probably millions of pictures of cars. All the weights in this big neural net have been trained by this algorithm. Okay, that's deep learning. Awesome. That wasn't too hard, was it? I should tell you about a particular kind of neural network which is an auto-encoder which is being used in very interesting ways these days. So an auto-encoder, what you do is you take the input and you push it through the neural net and you train it to reproduce the input at the output. So the target is just the input. It's called an auto-encoder. Why on earth would we want to do this? The identity map does this. It's not that hard to get the input from the input. But if we push it through a neural net and there's a layer in here that's very narrow, a little blue dot in the middle there. Sorry, I'm being very sighted. Rightest here. So if we go through that bottleneck layer, then what's the neural network trying to do? It has to push this picture of a car through the network and recover car picture at the far side. But along the way it's got to go through a really skinny layer in the middle. Imagine that was only two units just to be really nasty. That means it's got to push it down to a two-dimensional vector which represents that car. It's obviously pretty tricky, right? But perhaps you can imagine that those two dimensions are going to be pretty interesting features of car of the training set if you can achieve it. It's hard to achieve, but if you can do it, it's pretty interesting. So here's just another picture of the same sort of thing. How would you make use of this? Here's an example where this is a different kind of picture. Each dot here on the left is a data point. I'm just schematically representing a vector picture of a car. They're being pushed through this network just like before and out is coming an output and I'm learning to match that output to the input. The output is going to be really a distribution. And along the way it's being forced through right down into just two dimensions, say a very low-dimensional version. So each dot here actually has its own little dot down in the two-dimensional space, right? And then it recovers back and you get the car back. So each dot, the precise position of the dot, means a lot. Okay, so now I can imagine I can just pick a dot here and I can make a car picture and I could move on a line through that sort of central through this compressed representation. As I move along through that line what do I get? I get different pictures of cars and they'll change probably in some discernible way. So let's forget cars and do it with faces instead. This is my colleague Tom White from Victoria's Design School actually. Let's see if I can do this. Good, okay. So I'll just go straight to the horse's mouth. He's looking at... Oh, let's look at the earlier one first. So this is almost literally that two-dimensional picture of little dots. And as we move, you pick a dot in a branch all the time. Constantly rescaling and shifting them to keep them in the right ballpark. It kind of makes sense, right? Took a long time for someone to make it really work, but it makes intuitive sense. Those all help. They really solve that problem. Okay. So, here's the thing. I reckon there's another fundamental problem. And this is work with David Balduzzi who's really the lead creative on all this. There's another fundamental problem to learning in deep nets. It's not the exploding or vanishing that we call it shattering. David's term. So we've just written a paper called Shattered Grade Against Problem. The title says it all. If Resnets are the answer, watch the question. I'll tell you about Resnets in just a second. So, here we go. Resnets. So residual nets are, as of last year, the king. They are the state of the art in classifying images, things in images, for example. Very, very effective. And they allow you to learn very, very deep nets. Hundreds and hundreds of layers deep. This is great. What's the trick? We replace, instead of just having this weight-feed-forward connection through the weights matrices, you insert these little hop-over direct connections which are just the identity map. And why would that make any difference? Well, if you drew it differently, so you've just got basically the identity map as the short route going through, and these are all hopping out to do a bit more work. So you can imagine it's a bit like replacing a, well it is actually replacing the old feed-forward function with X, the input, minus the new feed-forward function. So the nonlinearities in the weights matrices are working on the residual. They're fixing up their errors. But there's this backbone of very direct connections. Okay, so for whatever reason, this is a really good idea. Lots of varieties of it, but basically let's call them residual nets or res nets. They work really well. Okay, so it was originally said that they were helping with this explosion or vanishing gradient problem. But if that's true, why are they working so well now in these deep nets? Because really that problem was solved by these other techniques. So they're doing something else, these skip connections. They're solving some other problem. All right. What do you think it could be? Maybe it pays to just have a look at some gradients. So here's a really simple neural net. It's got two layers. And this is what I'm doing is plotting. Ah, it's a really simple net. So it has one input. It's a scalar input. It has a every hidden layer has 100 hidden nodes in it. And they've all got these rectified linear relues in them. And then it ends up, the output is just a single scalar as well. So a scalar to a scalar, x to y, which is great because I can just plot it. And we're doing backpropagation on this network. So here's a picture of the gradient as I move along the x. I'm being rightist again. As I move the input value from 0 to 1 say how does the gradient change? The gradient I would calculate at the first layer. And it steps along. It's a Manhattan skyline. It's got flat bits. That's because of these value functions. But let's look at the overall shape. It jumps around. It's sort of a random walk, right? Below it is a random walk. Otherwise known as brown noise. A random walk is someone staggering down the road, right? Left, right. Oh, God. So one of the things we've shown is that in the limit of an infinite number of hidden units, this is a random walk. These are the same distribution. Okay, that's two layers. Here's what 20 layers looked like with this batch normalisation thing which is solving a whole exploding gradient problem. This is 20 layers neural net. That's white noise. Random white, totally uncorrelated white noise. Okay, so this is that's what we mean by shattered. The gradient is being shattered into tiny, tiny, tiny pieces that are completely unrelated to their neighbours. And you can start to imagine what they would do to a learner. How can you learn when the gradient information, which is the learning signal, looks like that. You can't. Here's what res net looks like. It's somewhere in between. It's not brown noise and it's not white noise. It's somewhere in between. It's got local structure but it's not it's not just a straight line either. All right. So here's our idea. Here's the intriguing facts actually. The gradients of a deep feed forward net look like white noise. Wow. It's only 20 layers. Imagine when you get to 200. And res net, don't shatter. The gradient does not get shattered. All right. So let's look at some more pictures. This is what I'm looking at is as I go left to right along one of these lines, the black and white lines I'm just saying is this neuron, here's a neuron in a hidden layer, is it's activation on I mean is it outputting a real value other than zero or is it outputting zero. I know that's what black and white means. I guess just don't worry about the coloured ones. Look at these hashy pictures there. Okay. So they all look the same at the two layer level. Fine. You go to 10 layers. Things are starting to change. Let's go all the way to the 50 layers. At 50 layers what's happening? We should be able to understand this. The the feed plan feed forward network pretty much all the neurons are either on all time or off all the time. That's what actually what the histograms are showing in the blue there. Neurons are either active or they're totally inactive. In other words they're just being linear, which is not very helpful, or they're being they're dead, they're outputting zero. Neither of these things is contributing any computation to the mapping that we're trying to learn. So that's no good. A feed forward network with batch normalisation the shattering has broken up the activity. So every neuron is pretty much uncorrelated to the different points of the input. So I have to vary the input. Any one of these neurons is flicking on and off all the time pretty much randomly. It's not random, it's deterministic, but it looks random. And there isn't it somewhere in between. So that's nice. We can prove theorems. So we can talk about the covariance as we move along in the x-direction watch the covariance in the gradients actually at different points. And you can prove that it dies away one over two to the power of the number of layers. So it's exponentially dying away. The covariance is zero means there's no signal left. You have totally white noise. No connection between neighbouring points in the input space basically as far as the learner is concerned. It makes the problem unlinnable. Resnets go the other way. They actually rise, but it's only a slow rise, so the square of the number of layers. Really different characteristics. So what's going on? Shattering is occurring as a feature of deep learning. It's a really important thing. It actually stops you learning things. It's solved. One way to solve it is to have these skip connections. So that's really what the skip connections are doing. Okay. So with this insight, it was a nice payoff immediately to come up with a new initialisation. Understanding something provokes further ideas. So we came up with a new initialisation. This is a student Linux leery which enables us to train really deep links of up to 200 layers and hundreds of hidden units per layer. So really deep. Without any skip connections at all. So that's the first time anyone's trained a network that deep without using skip connections. Skip connections are essentially short. Now I'll reduce the depth. Okay. So I like this. We've attracted that Reddit page, which is awesome. And one of the comments on that sums it up really nicely. So we really don't understand how to train neural nets, do we? It's such a basic thing. Fundamental aspect. It's just coming to light really. Okay. I want to switch tack to the other flavour of neural nets. So we've been talking about deep neural nets as feed forward enterprises. They go from input to output. People are really interested in modelling structure and time and modelling sequences. And you can do that with neural nets that are very similar with one difference. So the idea is we allow the hidden layer to feedback on itself. It makes a first-order Markov system out of the device, if you like. So now what I'm saying is the hidden layer we're still mapping inputs to outputs but the representation of the middle, these hidden layers I should say more than one but the hidden layers are receiving two kinds of input. They receive input from the actual pattern that's coming along at this moment in time but they also receive input which is a copy of themselves a moment ago. So in that way they're able to capture structure in time. In principle they could learn to use their weights to model temporal processes. It's not a done deal that this is going to work but that's the idea. Sort of minimal model of a system that could model sequences. So you can think of it like that or you can unroll it over time. I think this is a good way to think of it. So that's just the same picture unrolled over time and you can see that that is like a deep net. If I try to propagate it's gone wrong here and propagate that all the way back it's got to come back to change that weights matrix. It's got to propagate through a whole lot of weights matrices. It's a deep learning problem. So we're only unrolling here. These weights are all copies. The A matrix or whatever is going on in the A box is the same at all time. Alright So it has been pointed out that this really generalises the family of what sorts of things you can learn. So here's the old input goes to output. I'm sorry it's back the other way around now. Input at the top, output at the bottom. That's one to one standard old feed forward network if you like but with this temporal process each one's unrolled here we can imagine a one to many mapping that's like I say perhaps a signal a word making a sequence here's the opposite you could take a sequence and map it to a single value such as classifying a sequence who is the speaker or what are they saying many to many is also possible so this is what's going on in Google Translate which is taking in a sequence and then having had the whole sequence it's now got to generate a new sequence and the most general one I guess is many to many so a streaming input coming along that's the red squares all through time and we're having to produce outputs all through time as well alright so lots of possibilities lots of scope for fun here's one rendition picture of the fundamental element inside one of these recurrent nets so what we're taking is it looks more complicated than it is probably all we do is we take the input and actually this think of two paths through this so one is that the input is going to the output and on this picture it's going through two layers of neural net so that's a two layer neural net going from input to output and this is going to be replicated over time there's a whole bunch of these strung together if you like so the other path going through here is left to right and that's the hidden state which we might as well call a memory now because it's being propagated over time and the memory is able to pass from memory at time t to memory at time t plus one being changed along the way and it's changed in the light of the current input so you take the memory you take the input that changes the memory and here's the memory influencing how input maps to output it's got the bare minimum okay this has a bunch of problems so it seems like this is a sensible way to start unfortunately it doesn't really work very well one problem is it has this exploding vanishing gradients problem but there's more to it than that very early on in the 1990s an alternative architecture was proposed which is called the long short term memory or LSTM and here's a picture of it and it tells me how it works the interesting thing is it's pretty much inscrutable but incredibly successful so this is my second missing why is this thing working so well so LSTM long short term memory looks like this I'm not even going to try and tell the story but if you spend enough time with someone who really knows their stuff you can start to believe bits of this thing make sense but it's hard to that's not the same as saying it's the answer or I don't know there's something really weird about this that it's so successful it's been such a long time in 1997 so it's 20 years old incredible this thing is still the state of the art in recurrent neural nets varieties of it, slight tweaks admittedly but basically this idea so Google's translation engine runs with a whole bunch of these state of the art speech recognition all kinds of cool tech that it's either on your browser or on your phone or it's coming soon built out of these things and let's just look at some examples so here's a language model I feed one of these recurrent neural nets everything Shakespeare ever wrote and then so it takes, what it does is let's just jump back that picture back there okay I'm feeding it individual characters in this scenario more like that one individual characters you know C, A, T that's not a good example B, A, S and then I'm trying to predict the network okay it has a little window it's just got a sliding window and it predicts the next letter fine that's a predictor and then just for kicks I'm running that model and when it says K for the new prediction I'm just writing K in so now it's able to just write text forever okay so this is making up Shakespeare and if you have a look at it I don't know it's not that bad it's a bit hard to see the bigger picture but it's pretty good I mean it's making words it's making spaces grammatical sentences and look it has the feel right lots of fun this is trained on a few hundred essays by someone called Paul Graham founder of Yahoo Store you can tell just by looking at the fake text coming out of this generator the sorts of things that Paul Graham writes huge fun oh I if someone wants to ask me a question later I can show you some fake maths it's really fantastic so they got the whole latech of a big book and trained it up and then they ran the model they got a whole bunch of latech that actually compiled which means you can make a PDF out of it and look at the proofs they're junk proofs and you've probably seen Deep Drump this is a twitter account trained on all Donald Trump's tweets and it just generates tweets all day long you know stuff like this but they're pretty good actually fantastic okay alright what other things can we do heaps that's just the start that's just language stuff you can now take an image and push it to a sentence so instead of taking this image we'll take that representation the low level representation of the image and ask one of these recurrent neural nets to make plausible sentences from it incredibly this works you wouldn't think it would work but it works pretty well it's a man in a black shirt playing guitar it's true so these are slightly cherry picked examples but it's pretty effective and getting better all the time so it's taking the whole system end to end it's taking a raw picture raw image and it's generating the text at the far end and it's all just neural nets all the way along trained from lots of examples there's nothing else in there okay oh sentence to sentence this is just a picture to remind myself to say that the what's inside google's translate technology now is just a market cool multi-layered LSTM network that's what's inside it up until last year they used to have all kinds of clever linguistic machinery in there developed painstakingly over years and years and years huge thing and then last year they swapped it out because this worked better and this is just this is just one end to end thing it's just a whole bunch of those recurrent neural nets alright the one on the bottom right is oh taking a sentence and making an image out of it it is extraordinary that this works early days as well but if you can see them this has seen a lot of pictures of birds okay and then it's been given a sentence this small bird has a pink breast and crown and black primaries and secondaries and it makes some images how long has it been doing this? how long? this is last year I think it's either early to see or last year yeah okay so text to handwriting let me show you this this is fun this is someone needs to give me a sentence okay so it's taking text this was really came from someone trying to build a model that should recognise handwriting but in reverse I'll roll that down no can you see at the bottom? so someone needs to give me a sentence different samples you can mess with them all you like have a go a lot of fun so that's actually generated so the text comes in and then a recurrent neural net is actually directing the way this is actually done I believe is it's directing the drawing motion image with that sentence from scratch is actually controlling if you like a virtual hand alright oh and here's someone this is just the other day last week I think this guy's okay that's not great same idea as Shakespeare so you feel then it's going to predict something from the last few frames and we're just going to take that frame and go with it you can tell how this was trained it's trained on a train it must be completely new countryside it's not perfect but it's pretty amazing let's go to the town how was it kicked off I'm not sure for the Shakespeare one you can kick it off with a blank space and it just says okay so I'm not sure maybe it's kicked off with a picture but maybe it's just starts making stuff up I don't know okay where's that going like I say we're going to start mistrusting our eyes the world's going to get interesting so here's something that's missing I feel this is missing at least what's so great about LSTM they're almost impenetrable and that it's very hard to tell a story in a pub about how the LSTM works it's a sort of three-point explanation it's it's really hard you can pick out bits well this is kind of doing but there's not that very simple story that I was trying to tell about the basic current network there's two things going on inputs go to outputs through a neural net but they're mediated by the memory and memories go to new memories mediated by the latest input okay hard to tell that story so you have to wonder how special is the LSTM how unique is it there's an interesting story here that two teams ran very large trials in alternative architecture so they used genetic programming which is random search to try out all kinds of variations so this LSTM I've drawn it as a diagram but it's really just five equations so you can make up equations randomly you can try them out on a big machine they trained it on the Googleplex for the equivalent of I think about a century of CPU time equivalent and and found the LSTM and a few other things GRU if you know this stuff is the other alternative but not much right so basically this is a some kind of special spot it seems in the space that they were searching at least okay so one idea is to do genetic programming for 100 years of CPU time and another idea is just to think about it we've been trying to do that and this has worked with Paul Matthews, a wonderful honours student I'd ever well do as he can't get away from him again this is last year and this year we're pushing on with it so essentially we're trying to think of what's a minimalist model that's better than the old restricted recurrent neural net but it's understandable unlike the LSTM so and we're going tenses on this so there's a mapping from input the same story right so the inputs map straight to outputs in fact this is just an out of product so basically inputs are going to outputs through a one layer neural net it's just that we do want that to be mediated by what's in the memory so we involve the memory by taking out a product but this is another way to say this is a tensor mapping a three-way tensor takes the input and the memory in gives you a new output otherwise known as a bilinear form whereas a straight matrix is just a linear form so that's that story and memory is going to new memory how's that happening? well it's just going straight through really this is a little switch that says you either send this information straight through or maybe you take the other path and you take something from the input and this is an either or thing it's not binary but it's in zero to one a mixing between what's in the memory and what is coming directly from input it's as simple as that so how's this decision made about which one you want to keep? this is like a bit like a read write head right? you can think of it like in that simple way that this is saying should I keep what's there or should I overwrite it with a new with a new bit how do you decide what to overwrite which things should be overwritten or not we figure that's a job for both the current input and for the memory state so it has to be a combination of both let's do the same thing so take a tensor product of the current memory in the input and put it through a non linearity and that's your switch so that's a longer story but that's a story this thing has some nice properties so the gradients don't explode or vanish because basically the memory is propagating straight through here and we're just swapping stuff out or into it has the effect that the gradients don't explode we're avoiding this concatenation of input and hidden states which always happen in the other models you sort of push inputs and hidden together into one space where they don't really want to be in the same space they're different entities it's almost like a type error to do that so we avoid that and we use a tensor instead of concatenation you know it's a machine learning talk when everyone's frowning and there's a graph like this I guess there's a lot of talks to have graphs like this but machine learning you've always got one so there's the loss function training time going left to right the training loss that's the amount of error the amount of badness badness going away and as with most talks the badness goes away on the stuff we did and stays high on the stuff other people did so the top two lines are the LSTM actually and the GRU which was one of those other random ones found that it was as good as the LSTM so ask me about this task if you want to know later but I'll there are tasks very straightforward simple memory type tasks for which this new architecture seems pretty promising it's early days we're still chasing this around but there are tasks here's an interesting task which we also happen to do well on it's the task of well okay so here's the input pattern vertically happening over time okay what happens is pay attention to that that very last couple of rows if when this neuron comes on I want you to I want the network to look at these other bits so these ones over here and remember them and to spit them back on the output when that neuron goes off again so the last of the inputs is a sort of trigger and it says when it comes on remember something and here we're doing that with three of them right there's three memories going in store that store that store the third thing which overlaps actually and then cough it up at the end and it's not perfect this is our model trained on gazillions of these examples but it's having a good go so the first pattern comes out pretty perfectly second one's close everyone's actually pretty good not quite there yet so work to do but that's a that's to be compared with the best this is the very best of the competition where we try to store three memories we get three things out and they come out at the right times when these neurons here turn off you get a pattern back but it's the same pattern so this is the LSTM the state of the art inside in your phone doing all these awesome things it can't do this kind of task it's just a conceptual there's something it finds this very hard to do we would argue our architecture finds this really natural thing to do and it's very like variable binding so the trigger comes along you remember some arbitrary thing hang on to it and then when I release the trigger you've got to cough it up at the output so I guess what's particularly appealing about it is that it feels like the foundation for other processes any processes that are capable to bind one pattern to another and use like a key and then manipulate the key perhaps in some fashion and then get back the original bound item I didn't say that well but I guess we're optimistic and feel that this is the beginning something interesting bridging into a more symbolic realm for neural nets but it's still your standard neural net it's just got magic tensors in it it's just an even tougher task when this bit comes on you have to remember the bit pattern that immediately follows it on the other nodes and then when the neuron goes off you have to cough it up again same task but now there's all this distractor stuff in the middle that's what the hidden representation looks like and that's the output after training so if you're clever you can see that pattern there is the same as that pattern there I think one, two, three oh yeah it is great so it's learned to do that we never told it we just trained it and gave it lots of examples in which that's the pattern and this is a new example alright now quickly some more missing things we've talked about why one thing I said was missing was why the state of the art in feed forward nets is actually the state of the art why is it res nets what are they actually doing I argued that it's that they're solving the shattered gradient problem we said there is a shattered gradient problem and they're solving it and then I talked about why this missing explanation where it's LSTM I don't have an answer because we haven't found the better thing and pointed to direction for thinking about the better thing I think thinking about the problem and trying to gain insight building something that's got that is interpretable is a better way to go than randomly searching here's some other questions to throw out there missing things that I'm not going to answer they're very much missing why are deep nets better than shallow ones this is still up for grabs actually so there's a proof just this year I think it was last year but that relates actually to two versus three layers it doesn't generalise so that's why this architecture at all why neural nets at all what's so special about going mushed through a weights matrix why is non linearity and then back into a way just doing that and iterating that why does that work so well why not something else I think there's a coming something that is coming is various ways of making forays into more symbolic style processing but within this connectionist within this neural net world and hinted at that our tensor gate idea might be one direction to take lots of other people are doing things that are to do with for example having external memory having a neural net being able to read and write from external memory this sort of thing having neural nets able to implement programs that's a hot area lots of fun and my pet passion I guess is trying to understand generative models for really complex multi-causal data so you imagine you have some data my go to example is a face what you're seeing the pixels that are appearing on your retina are a combination of two completely independent things one is the physical structure on my face we're 3D and the other is the lighting environment and those two things are completely independent in fact but they interact and you get to see the interactions you never actually get to see a naked lighting environment you only see lighting environments when they hit things and you never get to see directly apprehend 3D structure you always have it in the context of a lighting environment or some other environment so there are there are pure things out there in the world but they get mixed and then we see the mixtures only can we build a system which is able to go the other way and having seen a lot of data disentangle that to discover the underlying causes that's a big tough problem and unsolved many many more things to come there's a lot missing missing is a huge country thank you very much thank you for your questions Kent Williams spoke about how to create a mind you agree with his sorry who's he's written a book last year it's called how to create a mind he discusses the nuance in the brain and that there's 12 layers etc I haven't read the book so I don't think I can comment on that interesting today the announcement at the Google conference today was that the future is to do with teams ooh no that's alright someone will get there I mean I do have a for me I have a skepticism about people the idea that you can study brains and that will if we study individual brains enough we'll understand entirely how they work I think that has to be coupled with computational standard thinking how do you achieve this kind of computation people have been it's a bit like the analogy with flights that you understand flights yeah you look at birds and that can help but then eventually you have to do some aerodynamics sorry yes you said that it's a good thing that you don't concatenate the inputs and the previous that you said can you elaborate this ok so I would direct you to David Balduzzi's paper called strongly typed neural nets is the key phrase so his point is when you when I map a hidden so there's a mapping from the hidden state to the new hidden state that's being achieved by this weights matrix so what's the weights matrix doing actually it's it's rotating and skewing directions right that's what it's doing there's a sort of native there's like a basis associated with that transformation and the basis for that is a different basis from that associated with the input pattern the interaction between the input and the hidden layer it's a separate it's a separate basis there in there in it's why would you expect the hidden units dynamics to be naturally mapped on to to be naturally altered by additive changes being coming in from the input it's a very limiting way to do things and have a look at David's paper about strong typing I think a lot and you can explain it yes this is the sentences to images the images are so there's a good in theory they're supposed to be brand new because they are projecting from this space of the sentence down to a much more modest latent space low dimensional space and from that space generating the image so in theory there are new images in practice have a read of the paper they're struggling so this is last year and it's actually look for GANs for the real networks another kind of generative model they they do seem to suffer from a problem it's called mode collapse where the generative model learns to only produce a very narrow range of what do we have, birds kind of all rather similar so that was a problem it seems to have been solved to some extent recently but it's not convincing demonstrations of that soon but no, the whole idea is that they're not supposed to be just stuff you've seen after all that would be kind of boring so most of these papers what they do and they should do is test so they're making up generating new things they should look at those new things thank you mmm hi sorry, Rishon let's go with you okay fine, sorry about my insufficiency last night the okay, I've studied CGS a couple of years ago we'll use launch budget okay, just a weird question that came up in the discussion with A that we used from medical background a few months ago is it ignore any ethical issues is it within technology to cut through the optic nerve of an animal and attach it to some device that can get the signal out of it ah, not my field of expertise I would say that can you chop the eyes or something and stick it onto the brain or something else not currently doable and not something I'm an expert on what was the question here admittedly a mechanism hmm I know it's annoying isn't it it's hugely annoying that there's all this progress and cool stuff and it's all built with that propagation and that propagation is biologically implausible so that's a damn shame for the idea that you might think about computation and make your way back towards brains and understand ourselves better bit of a blow various people have tried to think of alternatives which generally the efficiency goes down I think it's still a live question to what all this stuff is working very nicely it brings the pressure on so how's it done by real neurons then it's still an open question yes they understand that you have inputs you have a set of weights which you can tweak and tweak the weights so as to get the outputs to match some of the desired output yes what is the point of dividing the weights out of the structure of these weights dividing it out from layers but what do you get I think we're back there I think the same thing I don't know you can tell a story about very shallow nets like that so imagine a single layer it's not going to be up to much because mathematically what each neuron is doing is drawing a hyperplane in the input space it's going to be on for this side and off for that side and hyperplanes can't do much that's very interesting and then hyperplanes into that space so a two-layer architecture can make shapes that are much more interesting than hyperplanes there's a proof that if you make a second layer if you make just a two-layer architecture wide enough enough hidden nodes that are very wide it's not well understood what extra depth adds and I guess this how important is it to have it in layers probably not that important if I understand the question these res nets are really a break with that strict layer-wise thinking they have these skip connections and you can think of multiple paths through the network through many layers some of which jump over whole layers it's because a big complex image face has got to have eyes and eyes and mouth they have separate images and then they probably put together yes, well that's a good argument for visual examples at least last question I look at the sequence of slides and there's a kind of friend or something that I see I'm going to make a statement and you can tell me what you think I look at this and I see that you have this deep neural net and you've applied it to a variety of problems but the cases where it doesn't do like your Shakespeare example or your other example are things where it has to somehow represent some people calling it some sort of knowledge Shakespeare is not just about words it's cool to see it's pure Shakespeare in words but it's not making sense even for Trump because simply because this neural net deep net is not well suited for doing some kind of deep cognitive skills like humans are very good at this is just being connectionless stuff it is not sort of capturing say like you know how many reasons at multiple levels at multiple steps and stuff like that which is why in certain kinds of problems it doesn't work okay and what's the question I'm going to agree with you so you're pointing out this so deep nets they're for the language example the deep jump I would say that there is messing in the nearly silly beyond assume that it would do well because that's a wrong task to apply a deep network for okay yeah it's a demonstration of that the neural net correlation structure in temporal data well it's hard to say probably not worth exploring it's I would say it's a step towards that so there are things missing right for example there are symbolic style manipulations that humans achieve somehow but we don't achieve them we do it with the CPU and the register we do it with neurons and so I'd say these are not there yet there are lots to be figured out it's not just make it big enough train it on enough data and it'll become conscious and actually make sense as you point out the Shakespeare didn't make a lot of sense but I don't think that's the point sorry sorry I didn't catch the end of that perhaps yeah so currently with a memory in these neural nets is the knowledge is all being stored into the weight values and I didn't talk about it but there's a lot of interest in systems where the knowledge is they're also able to read and write from some kind of notion of an external memory an associative memory it's a hot topic it's yielding interesting things okay I think I'll draw up to a close now Marcus can probably still answer a few questions after we thank him so thank you very much for welcoming us to your country it does look very big it does seem like it's so much missing but it certainly shed light on some things that are happening in the area so thank you very much thank you we've just got one announcement very brief thing I'm on the committee of the Auckland branch of IT professionals New Zealand ITP is a co-sponsor of the Gibbons lecture series so thank you Associate Professor Freeman for your talk tonight quickly mention that there's two events coming up next week that you can register for on the ITP website one is of course the fourth Gibbons lecture on the Ethics the Ethics of AI so that's next Thursday and next Tuesday we are having the ITP event which is also free down at the Onwindyo Wharf next Tuesday so if you're interested in that it's a discussion on the future of money and we've got representatives from the banks zero event talking about issues including AI and how AI isn't affecting and then going to impact the financial system so you can register for that also on the ITP website