 So it's a pleasure for me to pass the baton to Joshua Vangio. You heard me make one brief mention of the Havado program and artificial intelligence. Joshua is one of the triumvirate of leaders who relaunched deep learning that was quite popular as a neural network idea in the 80s and 90s. And Joshua Yanlequin and Jeff Hinton relaunched it in the 2000s and it's become basically a monster, which is taking over everything that we do. So we're going to hear more about the applications of deep learning, specifically in the context of neuroscience. So Joshua is going to talk to us, Joshua is from the University of Montreal, I should have said. Joshua is going to talk to us about bridging the gaps between brains, cognition, cognition and deep learning. Joshua. Thank you. I'm not going to tell you much about applications of deep learning, but more about the connections between deep learning and neuroscience and cognition. At least some of them, they only have half an hour. Just to get off the ground about deep learning is an endeavor that's part of research in AI and machine learning. The central goal of AI is to put knowledge into computers, but what has prevented that from happening much earlier is that a lot of the knowledge we have, that we would like computers to have, isn't knowledge that we can communicate directly because it's not consciously accessible. And so computers have to learn from data in order to succeed and deep learning has been an approach to learning for computers that has been immensely successful, mostly in the area of perception but to some extent also in everything having to do with language, like machine translation, playing games, driving cars, translating like in this example from images to texts, all kinds of things. Let me step back a little bit about how the, I think, one really important thread in neural network research has started in the 80s called connectionism. So connectionism, I've tried to summarize it in one sentence. It's a pretty long sentence which should be broken down into pieces. So what it's about, and it's also something of course that is still at the heart of what deep learning is today, iteratively training distributed representations. So this notion of distributed representation is the idea that we represent information through a pattern of activation, which of course is natural for the brain, through a composition of neural-inspired simple operations and then that part is really important towards a justifiable training objective. So there's this notion that the learner is optimizing something that's well-defined and makes sense. So that training objective forces the learner to capture the relevant statistical structure of the data. So I have another talk, which I'm not going to give, which goes through a list of what I found to be ideas from your science and cognition that have influenced machine learning and especially deep learning over the years. So of course neurons networks, plasticity in learning, these representations that I mentioned, a lot about the architecture of neural nets, especially these convolutional nets, is inspired by the visual cortex. The idea of depth, in other words to have multiple levels of representation, also is inspired by what we know about the cortex. The kind of non-linearities inspired by the brain, particularly the rectified linear units, which are now used in the last five, six years, have dominated the field, were inspired by neuroscience. Spikes, well it looks like we don't use spikes in deep learning but actually there are different things like dropout where you inject binary noise into the system and also systems that are quantifying activations of neural nets that have been developed which have some resemblance to spikes. Correct learning is something that's more on the cognitive side about training from not independent examples but through a sequence of examples that are gradually more difficult and sort of like a teacher would do with a child. Cultural evolution and distributed training, how multiple agents can learn together and help each other, notions from psychology of affordances, options in reinforcement learning, notions of exploration, this is all in reinforcement learning and notions of controllable factors so the relationship between representations and what an agent can do in the world, what it can control, the notion of attention which has really become a central tool in deep learning in the last few years, the notions of lateral connections which come up in what's called softmax and clustering and notions of attractors, notions of associative memories which i probably will mention at the end so there's more and more research in connecting not just a standard neural net which you can think of as like cortex but also things that act more like memory to these neural nets and then of course notions that connect to classical AI, sort of system two types of computation that involve reasoning, planning and consciousness. All of these things are happening in deep learning and of course have time to brain sciences so there's an underlying assumption behind a lot of those connections and also that has motivated me for the last few decades which is that there would be a few simple principles that explain both human intelligence and animal intelligence and that we could use to build intelligent machines right so and it's not clear that of course this hypothesis is true maybe the brain is just a huge bag of tricks and so it is a hypothesis but if it's true then the consequences both for our ability to understand the brain in the big picture of the brain as well as to build intelligent machines it could be immense. Let me skip these things and go back to the connection between brains and deep learning so I feel like if we're trying to understand the brain of course there are many aspects to it but one aspect which really would explain a lot about the brain is how it learns because you could you could imagine that the learning mechanisms themselves are fairly simple in the sense that it could be described by maybe a few equations but the consequences of these learning mechanisms could be huge in terms of the function that the brain could do and so when you think in this way you start thinking about not what particular neurons are doing but of course how they learn what it is that this learning is optimizing if it is optimizing something one kind of architecture structure in the circuits makes it easier or more difficult to learn different kinds of things and so on so these kinds of high-level explanations are the kinds of questions that deep learning researchers are thinking about and I think that it could be useful for neuroscience to start thinking with these views. If you consider the two extremes of low-level computation that is typically what's studied by neuroscientists on one hand and the other hand high-level questions about computation in the brain that cognitive scientists are studying what's interesting with the neural network research is that it can span both of these levels right so and as I said there's inspiration coming from both of those sides and of course that could go the other way as well now I'm gonna now focus on some work that we've been doing thus on the side of the brain implementation of neural networks and particularly the question of is the brain doing something similar to back prop so back prop is the workhorse of the success of deep learning and what it is is just a mechanism to do credit assignment for a large network of neurons interconnected and complicated ways in arbitrary ways to figure out how each of them can change a little bit so that some overall objective is improved gradually and for many decades the dominant thinking was well there's no way that the brain could do something like back prop for all kinds of reasons and now in the last few years we see a flurry of papers that suggests implementations in the brain that at least are much more plausible than would have been possible to estimate gradients in a style that's similar to back prop but not be exactly the same so one question with back prop is you know if we want to implement something like back prop in the brain is how are we going to encode the error signals that are normally propagated in these deep artificial networks and one major obstacle to coming up with a logically plausible analog to back prop is that we would like the computation performed to compute those gradients to backpocket those gradients to also be done by neurons right we we we don't want like a separate kind of computation that is not biologically plausible for the the the back prop or for the estimation of the gradient so there's a hypothesis which has been proposed already in the late 80s which our work is is writing on and about how these error gradients could be represented in the brain and the hypothesis is that the error gradients are represented by temporal derivatives of activations so one way to think about it is the level of firing rates of neurons encodes the you know computation that is being performed but then those neurons receive feedback from potentially various sources that steer them towards slightly better configurations and so there's there's going to be a change in the activity and that small change would be encoding error gradients and so we don't know that's true of course but this is this is an interesting hypothesis to explore both in theory and and eventually as we're starting to do through experiments on animals so now I'm going to tell you about a particular algorithm that exploits this hypothesis which we introduced last year called equilibrium propagation and it's analogous to propagation through two phases of computation which I think we can eventually merge into one but but for now think of that as two phases it's similar to what I mentioned before so there's a forward phase in back propagation in which the network takes inputs and produces predictions or something or you know expected rewards or whatever it is that we care about and in equilibrium propagation this is going to be implemented by a relaxation phase a free phase where the the network which now has dynamics and you know feedback connections and feedforward connections and natural connections is is is being influenced by the input and and and converging dynamically to some configuration so this is the equivalent of the forward pass and then there's going to be a backward pass and this is this is the place where it's not obvious how you know brains could implement the equivalent of the of the backward pass that we find in back propagation so the idea in equilibrium propagation is that the outputs that are producing predictions are going to be receiving signals when they make mistakes and those signals will push them will nudge them towards better values in the sense of lower value of the prediction error but because of feedback connections these these nudges will propagate to all of the network again through the same dynamics and then the network will converge to a slightly different state so so we've actually made this story formal and we can prove that if you're able to do these two phases and you know they converge and then you look at what's called the sufficient statistics so things like Hebbian pre times post firing right products and you you take the difference of them so something fairly similar to a lot of previously proposed learning rules you can actually estimate the true gradient of the prediction error now there is a there are several issues left with this approach one of them is we I mean for for a realistic biological implementation would like the the computation of the prediction to happen very quickly like you know if something you we already know that in about 100 millisecond your visual cortex can you know detect objects and so it's almost a single feedforward pass is sufficient to do a lot of things so how do we make sure that the feedback connections are not making the convergence to a correct answer not not too slow right so because they will interfere if you want with the feedforward connection so so we have a scheme that we started working on with Walter Sand and Joao Sacramento in which we use lateral connections to cancel to that are trained to cancel feedback connections right so you have pyramidal cells you have feedback connections arriving from downstream layers onto the apical dendrites and now you're going to have lateral connections that learn to cancel that feedback so when the downstream neurons are behaving in a way that's predictable by the the current layer the current area the lateral connections can predict the feedback that will come from the downstream layer and so the the feedback is cancelled and so the the the feedback connections don't interfere with the feedforward and you have essentially immediate convergence however when the downstream neurons are receiving feedback that you know contradicts what they have been doing that is not predictable by by this current layer then there will be a mismatch between the the feedback and the prediction to cancel that feedback from the lateral connections and that difference will exactly correspond to the equivalent of the you know what backprop wants to compute so the error that the neuron is trying to correct corresponding to the gradient of the whatever prediction error or reward with respect to that neuron so this is I'm only sketching the ideas here and and this can be combined with the ideas I told you before to make the the dynamics converge faster there are also other works that are interesting in which again the structure of the pyramidal cell with the apical dendrites on one hand and the basal dendrite on the other hand actually play an important role to decouple the the the the feedforward computation if you want from the error computation there are other issues that we are working on so one one problem with the theory that we have right now is that it requires the network to have symmetric weights in other words it for a neuron a to b if there's a a weight in the feedforward direction there should be the same weight in the feedback direction which is not biologically plausible and so we've we've developed a version of the theory with weaker conditions that don't require symmetric weights and would still lead to convergence so that's that's one thing we're working on and we have an archive paper on we're also working with Joel Ziblerberg, Blake Richards, and Tim Larry Krap on actual experiments on mice to try to test some of these ideas and it's going to take a while before we can test the full generality of of these theories but we are starting small yeah so so some of the questions that can potentially be tested so for example one of the hypotheses that would be could be tested eventually is that simply there is something like gradient descent that neurons in the middle of a big chain of computation will change their synapses so that at the next trial say their their behavior corresponds to causally leading to a better prediction so I think it is actually possible to test these hypotheses by collecting statistics about the spiking behavior of of neurons which are believed to be changing in the context of a surprise where say the animal doesn't doesn't expect what is being observed another related hypothesis and now it's connected to this notion that the errors are encoded in intemporal derivatives that if we look not at the next trial but just like in the say tens or hundreds of milliseconds right after a surprise that in a neuron that we know is changing because of the surprise we should see its activity move its average activity move in the direction of what would correspond to a better prediction right so not just so the first thing at the next trial is because synapses have changed whereas the second thing is because even before synapses change the activity has changed because of feedback connections that are driving the activity of the neuron towards a better value so so these are two different kinds of hypotheses and the second one really gets closer to these ideas about how error would be encoded in the brain we could also test these ideas that I mentioned about a role of lateral connections to actually cancel the feedback connections I'm not sure exactly how you know experimentally that could be done but I imagine this would still be something feasible if we can if we can measure what's going on on the on the path from the apical dendrite to the soma okay let me now switch to a related problem up to now the kind of implementation of backprop in the brain that I've told you about really has to do with something like static computation something that happens within like a 100 millisecond or 200 milliseconds but the way we're using backprop in deep learning allows us to train systems that have dynamics that you know enroll over much longer durations and and so these recurrent networks as we call them can learn to manage to to to predict or to produce sequences like for speech recognition or machine translation or whatever and and these are trained with backprop as well but it's a form of backprop we call backprop through time which which requires a form of computation it seems totally implausible like basically it requires storing all of the of the steps of computation of the network over time and then replaying them backwards in detail with with gradients being propagated and that sounds completely ludicrous for a brain so what are the options it's not clear yet you know humans and animals obviously learn not just you know when there's an instantaneous feedback or like prediction error but also over longer durations so one idea that we're exploring is to use an associative memory in order to to do the job so so the idea is this so let's say that you're driving your car and you hear a pop sound and you notice it but you just continue driving and then maybe an hour later you stop and you take some gas and then you see that you have a flat so so then you realize that oh the pop sound it was probably you know something that punctured my tire and I should have stopped and changed the tire okay so so what has happened there is that your your associative memory is is recalling an event that happened say an hour ago and bringing back your mental state as it was or part of it an hour ago and and now you're basically able to change synapses so that the the sort of behavior and interpretation and decisions that you made an hour ago would have been different after that change so so we implemented something like this in in simulated experiments where we use a simple associative memory at the level of the hidden representations of a recurrent net so the recurrent net in the forward phase just you know goes over time like this but at every moment in time when you get some kind of error it's allowed to recall a few events in the past through the associative memory which is just matching current state with past states and and then for those states that are being recalled you're allowed to do back prop into those states and and the way you do that is that the associative memory you can think of as like containing a prediction of the future given the past and we can which is implemented by by a piece of network and we can back prop through that piece of network on the spot and then back prop into how that that that state of the world in the past event led to actions or interpretations or whatever so so this is this is an interesting path that connects both neuroscience with memory and and and cognitive science of course and it uses a form of attention mechanism that's similar to what we've been exploring in the past so in terms of back prop just to summarize it's really the workhorse of amazing successes of deep learning in recent years and it'd be interesting to see if brains are using something analogous not exactly the same thing but that approximates the computation of gradients in an efficient way because if you look at the typical ideas that are believed by many neuroscientists that you can estimate those gradients by some sort of perturbation method these do not scale to the the size of the brain just that the amount of noise that comes up from these estimations do not scale so so we need a mechanism of efficient credit assignment and and we need to discover how the brain does that so i told you about equilibrium propagation which is an approach in which the same circuit can be used both for making predictions and for estimating gradients and how using an associative memory could potentially avoid this need for back propagation through time which which is absolutely not biologically plausible so i would just want to mention a last bit of research that we're doing that i call the conscious as prior so that connects more to cognition but also to attention and the idea is that there are things in the world that can be predicted or explained using what is currently in your you know that you're thinking about in in your attentive consciousness and and and these these pieces of information are tiny compared to what is going on in you know who you hold right it's like a very very low dimensional object which can be used to predict what's going to happen the next for example uh involving very few variables so for example you know i hold my glasses i can drop them and i can you know mentally predict that i'm going to be able to catch them and i can make that prediction with almost you know perfect certainty and and that prediction only involves very few variables like my glasses and where my hands are and the fact that i'm standing up um so compared to the full state that my you know brain is is registering about the world right now so this ability to uh describe um elements of the world using very few dimensions or very few variables implies something about the representations that the brain is building that uh the representation is not like uh like pixels where if i just pick a few pixels and try to predict a few of them given a few others it's not going to work but but with the right abstract representations i can do it i you know if i represent things in terms of like the positions of these objects and and other abstract quantities like this i can uh make very powerful predictions so it's it's uh i call this a consciousness prior because the idea is that the constraint that we we have to be able to make these kinds of low dimensional predictions uh imposes something on the representations that are being learned uh so that they have this sort of abstract um nature that we would like to force onto these types of networks so um i i realized that we're um a bit over time but i wanted you just to to get a sense of uh this this direction of research