 about order from chaos and computation in neural networks. Jonathan. Thank you very much, and thank you for inviting me. So like Chingiz, I'm trained in physics and kind of find my way into theoretical neuroscience. But while Chingiz is just advocated for why we should understand deep learning in order to understand the brain, I'm going to offer a different perspective of why perhaps we should look elsewhere and not just understanding deep learning. And the basic argument behind that, which can follow through this talk, is that our brain does more than just classify cats and dogs or identify digit, or at least the brain of most of us. And if I'm trying to maybe pinpoint what's different, what's different than just this simple feed forward image classification is the basic element of time. Because when we operate, we do different tasks of cognitive tasks and computation. We interact with the world. We're getting feedback back from the world. And our actions take time. So in those deep networks, you lose the time perspective in the same. Another thing is when we look at the brain, it's very dominated by a highly recurrent connection, not feedback connections, which are more suitable for doing computation or dynamics during time. So the framework that I'm going to talk or advocate to use is what we call computation through dynamics. And the idea is this is we have some task we want to do. And this is some interaction with the world. This can be solved using some algorithmic solution, computational solution. Now the main idea or assumption here is that this algorithmic solution can be implemented as a nonlinear dynamical system. But basically that's what the ball is. Living quantum aside, basically we are all some kind of nonlinear dynamic systems. And this nonlinear dynamic system that solves the problem can be implemented as a neural network. So if we had this piece of cortical network, we get some input. The input could be modulated by time. It could be a temporal input. Something is happening, and we get some output. And from this, we can get a variety of different cognitive and computational abilities. We get memory. We can do integration, temporal pattern generation, temporal correlation, Bayesian sampling. And it allows us to do different cognitive processes, as evidence accumulation, decision making, language processing, motor planning, and control, sensory processing. So this is a very rich framework. And what I'm going to talk about today, it's going to be very general, a high level. But you can think about it as more temporal generation, pattern generation, and motor planning. Because this would be the easiest and most stressful way to think about this dynamical system. So how do those kind of recurrent neural networks that some people in machine learning work on come to neuroscience? So a lot of times they use in-cylical hypothesis-generating framework. There is some network we train it, usually using back propagation through time to perform some task or to match some activity of neurons. Now this is very effective, training with back propagation is very effective, but there's no reasonable biological implementation for this thing, for many, many reasons. The brain just does not do back propagation through time, or at least as we use it. Nevertheless, this has been used in a variety of neuroscience studies where they're trying to emulate the brain and then study this network, and actually they've been used pretty successfully. We have learned a lot on the brain by studying those networks. And because of this huge success, it's very tempting to try to understand why, how can we do a more biological back propagation through time? And there have been such studies of trying to do it, but I feel that's pretty much not what you should be looking for, which find to solve something, not by forcing it to be something that is not. So instead I'm gonna try and kind of propose and a little bit study a different approach. So first a few observations on the brain. If we take an isolated neuron from our cortex and inject currently to it first, I hope all of you know or heard that neurons are basically activated by spiking. There is this kind of discreet action potentials. But if we insert a stable disinclined neuron fires like clockwork, very stable. But when we look in the brain, or not only in the brain, but even in cultured neural networks, we find something very different. We can have the same input to that network again and again. But each time we do it, and here on the Y-axis, you get a trial to get a very, very different behavior. And we can measure the variance versus the mean count and you can see that the final factor here is about one or even larger than one. So from this very accurate behavior, we get this very noisy behavior. But when we look at behavior, behavior is pretty reliable. If you think about what you need to do every day and especially a specialist like tennis player, cellist, they need to perform again and again the same action with high accuracy. So this is a bit of a puzzle, right? Why do we start with something very dependable, go to something completely undependable and noisy and go back to very reliable behavior? Now, we're in good company, we're not the first one who's thinking about it, so John von Neumann asked this question. Way back, especially in related to biological computation, how do we get reliable computation using unreliable components? So the framework I'm gonna use here to understand relies on reservoir computing. It came from basically from type of machine learning. So I'm gonna go do a quick overview of what does it mean. Imagine we have a network, so this is some just recurrent network and neurons just going the day doing about what they wanna do. And then we put some input into it. Let's say we perturb it in some way. And we wanna make sure that when we perturb it, the net we can read out specific output. You can imagine this input to be some higher order in the place of the brain telling us to move the hand and the output would be the actual control signal to the muscles and we wanna learn it. Now we can take a readout unit and it just connected to randomly, obviously we'll get nothing. So one way to do it is just do it many, many times again and just slowly learn, do some kind of regression and learn those weights so we can get good approximation. Basically, we have kind of the activity of those very noisy neurons X like a sort of a basis function. And from this basis function, we can construct any function. Of course, if this was infinite and have enough mass properties, this could be the case, but the networks are not infinite. So this is pretty much limited by the basis functions that we can get from the way of the world. So in order to improve that, there's what we call as computing with a closed loop. Basically, we can close this loop back onto itself and that actually allows us to learn much more. I'm not gonna go into the details, but the first, there are some issues of stability, but the first algorithm that we're able to solve it very nicely was suggested by some students in Abbott. It's called the force algorithm and they were actually able to teach a stickman to walk. What does it mean? What does it mean here? Basically think of every point here on the body as contained of three outputs and then giving an input, they just train each of this point to do the specific trajectories which translate to this walking man. So we're able to do, to teach a lot of things, this network's dynamical patterns. Now most of the sufficient is at the edge of chaos. I'm gonna talk soon about why or what's specific about the edge of chaos. It's important, but that leads to some issues. First, once we learn, chaos is a train. Meaning once we learn something, the network is no longer chaotic. It's part of the, what this machinery does, which leads to some problems with continual learning. If we want chaos and we get rid of it, once we learn something, how can we learn something new? Also there's issues of task inference and most importantly, which I'll touch just a bit today, which is, this also is not biological possible. This is a false algorithm. So things I'm gonna talk about today and hopefully get through today. So first is how can chaotic network faithfully encode signal computation? We wanted to be to have a network. We wanted to be chaotic or noisy like we observed the cortex, but still get out something very reliable computation out of it. And the second part, which we'll probably get on it to do briefly is how can this network learn efficiently and still remain chaotic? So a good place to start is close to the beginning, Alice. So first, how do we get this chaotic network and what's so special about it? And this goes back actually all the way to 1988. Some policemen, Sandy and Sandra, some else, so that if you define a nonlinear recurrent network and that's what we have here. In my age is the input into each neuron. We call it something like the membrane of the potential of the neuron. It follows this differential equation where R is the output of the neuron. It's just some nonlinear function of it inputs and they're all connected through them through some disorder connectivity matrix. So this is ordered JJ. It's random. It's case like one over N, very much like spin glasses. The only difference here is that J does not have to be symmetric because this is not really spins and neurons are not generally the connection between neurons and all of your symmetry. So what they found that, and so we have the parameter G here, so sorry, I said some external input in that book. They found that if you have this parameter G here which basically says what is the amount of the disorder in the system, the magnitude of disorder. If you increase G enough, it becomes chaotic. So for small g, we look at the different neurons so each blue line here is a neuron. We see that this network settled into some fixed point. Nothing is happening. But if you increase the disorder in the system, basically there is a transition and the system goes to be chaotic. And by chaotic, I mean every neuron is doing, is fluctuating internal generated fluctuation and you can actually calculate the positive life of the exponent. This is actual chaos. Interestingly, we can actually find this. This is, and we find that this is sort of not exact but looks like a second order phase transition. It's a continuous transition. And with a second order phase transitions come some critical effects and those critical behavior of the system is actually what's favorable a lot of time for computation. For example, long time constant, strong correlation, all sorts of things. And this is one of the reasons that this edge of chaos that just mentioned before is favorable for computing. We also found, and this will be important at the end of the talk, we'll go back to that about the dimensionality. So we have a network and it starts to be chaotic but when we look at the dimensionality of the chaotic activity, it does not take the full space. It's just a small fraction of it but while it's still for the small fraction and this fraction increases, we increase the disorder in the system, it is extensive with the system size and the fact that it's extended with the system size will be important at the end. Okay, so we saw how we take those very clockwork neurons. There's no noise in here but we still get chaotic behavior. And the first thing when we get a chaotic network and we want to use it is how to get rid of that chaos because as we said chaos is not really good for computation, it's not reliable. So how can we remove this chaos? So to try to understand this, I'm going to use the most simple complication we can do which is an autoencoder. What is the autoencoder here? This is a network receives a signal X, it's a low dimensional signal. A lot of time I'm going to use just a one dimension but everything I'm going to take can be done in several dimensions. We're going to inject it to the network, each neuron is going to get some projection of it and we're assuming we're going to train something to read back the signal out of it. Now, no matter how well we can do, if you can see on the right, basically we'll never get exactly the target. Why? Because the chaos brings noise. And so we get some approximation of that. And the question is, how can we make this more exact? One way is to take infinite number of neurons but again, the brain and the circuits are limited. So how can we increase that? Another way is to increase the signal. So we can just push a stronger signal, the signal is stronger than the noise and then we can read it out. But this is not a very good plan. First, if we increase the signal, we increase the firing rate and this is not very efficient. Second, if neurons are saturating, we're going to move away from the dynamical regime and it's going to hurt our computing. So how can we increase the signal but still not increase the firing rate? And easy solution is basically take the output and send it back into input. And so it can't cancel the strong input. This needs some synaptic balance and I'm not going to go into the details of how it becomes. Basically, synaptic balance means that the input into each neuron that feed forward input and they can input kind of balance and cancels out. So overall, the neuron does not increase its firing rate. What do we get here? And what we get the same thing that we saw before but now the input is some W. W is the input projections of not of the input X but of the difference between X or N and X and our estimation. So this is, and if we take B to be strong and stronger, this is what we call predictive coding. Why is it predictive coding? Because what the neuron now are coding are not X but deviation for the expectation of X. Deviation between X and X hat. And this is known to be very efficient energetically wise of firing level wise but we've also been found in different animal systems that we actually found and chose it's happening. And we can see what happens on the right. We'll see what happens when we do that. We can increase this effective B, make B larger as we go down. We can see that the readout becomes much, much more faithful. We have less readout error but the underlying system remains chaotic. So this is a system but now that we have, we have that the question is, let's do some math, let's do some theory which we've kind of avoided until now and how do we solve it? This will be the most technical slide of this talk. I'm not gonna solve the entire main field but I wanna give you the main idea of the main conceptual idea of the solving it. So we have this high non-linear dimensional system of the Hs and we're gonna divide the Hs into two spaces. One is a coding subspace which is basically where all the code lives or where the input and the output lives and the rest which we call the bulk. And we'll assume that the dimensionality is much more than the number of neurons and we also assume it's that the Ws are basically orthonormal and for large network, random would be good enough for example for that. So now we can solve for those things separately. If we look first on the bulk, basically what we have this addition that we have here by removing the low rank or the low dimensional thing is just a very small correction to the activity and if you look at that, that looks like the original chaotic equation from a few slides ago. So basically this is what just give us chaotic dynamics in this bulk space. If we look on the coding space, we can define those U which are basically the fields in the direction of the coding and we get a lower dimensional equation for the behavior. And now this is where the trick lies of how we can do it because what I've done to now, I've taken a nonlinear dynamical equation and I've broken it up like it's a linear dynamics. Obviously we cannot do that but this is where the trick lies because D is very small. So we have interaction between those two spaces but in the bulk space it barely fills the interaction due to the coding space because it's so small and all the interaction of the bulk space goes into this noise term. So in the mean field we can just look at it as noise and just start it as properties. The first thing we can, the last thing we can do to solve this equation just look at U at this field U and divide it into its mean and the mean here is like a thermal mean or mean over the noise and you can look at it mean over time and the fluctuation and then we solve static mean field or first order mean field for the mu and that will give us the bias in the estimation of excess eventually and we can use second order mean field or dynamic mean field to solve the fluctuations and give us the other, the fluctuating error of the readout and that's basically the framework for the mean field. What we get from that, so this is, I'm showing you here the solution. So first you see the delta H is the order once meaning that the underlying neurons are all chaotic and they're dancing and they're jumping about but as we increase B we can see the decrease of the overall error and the error is, here it's not the U but it's already the fluctuations in the actual readout and I've put here for reference both from theory and simulation what happens instead of chaos we would have noise. So the blue one is what in the case of a chaotic network just inject gargling noise and it's an interesting thing that this framework is much more effective in reducing this chaotic noise than just thermal noise and this helps computation and another reason why we want to use chaotic networks it's actually have a very interesting reason due to the time constant of the system and you can look at this paper if you want to understand more about why the difference here between especially in the appendix of this paper. Okay so one more thing, you can say great we can remove now the fluctuation we can get good reliable readout why not just increase B to infinity and remove all the noise so we have the most reliable output but that's not the case and we went back to John von Neumann who said every network or nervous system has a definite time lag between the input and the output so the time lag will become important and why that? Is that because of hipsters? So why hipsters? I'm gonna use here, I'm gonna quote here Jonathan Tubal on the paper with non-conformist try and stray for mainstream trend they oftentimes end up making the same choices because they are too slow to spot no longer popular trends too slow, that's the key point here and it's too slow basically is our delay imagine that whoever calculated this readout there or in this case it's just linear but it could be something more complicated it took time for the feedback to come back so where does this feedback enter in our equation? You can figure out yourself that it's pretty much in the coding subspace because in the chaotic subspace there's chaos, chaos is symmetry over time so the delay doesn't really matter there but where it matters is coding subspace what does this thing do? We can write the characteristic equation for this ODE and look for its stability and its stability is when this gamma here disappear becomes zero and we're left only with the imaginary part and that basically depends on B and B times this thermal average over the derivative of the non-linearity we can plot and we can see a phase transition below that we'll have stable dynamics and above it we're going to develop strong fluctuation you can see it here below it we have some noise that we've seen before where we can actually follow the output but if we go to the non-stable oscillatory regime we are developing strong fluctuations and this depends on the function of the B of the amount of feedback but so what's new here basically we can see that there's global instability and oscillation in the mean field so all oscillations come in the mean field while the underlying, if you look at the underlying neuron they're still chaotic, they're still very much chaotic so like we've changed the B also and it depends not only on the delay but it also depends on the noise and the noise goes through the fluctuation through this thermal average over phi and we can change the amount of chaos for example the G from before the G was the amount of disorder in the system we can change it and actually we can see that we can stabilize the chaos if we add more chaos the system become more unstable sorry, more stable while we remove the chaos it becomes more unstable the intuition behind that basically is this delayed feedback trying to synchronize neurons like those hipsters while noise and chaos trying to desynchronize neurons because each one gets some independent noise so these are the two opposite effects and every time we get two opposite effects we get trade-off and we get trade-off we get optimality and we can actually calculate that so here we see the calculation for the mean field so we look at the fluctuations of the output it all comes from the source that comes from the chaos but it depends on two terms the first term is what we talked about before when we add the balance it goes like 1 over b and larger balance so it's not larger feedback and more balance is the synapses makes the movement this is what we can see here without any kind of delays in the blue but once we have delays and there is a critical level of balance there is also some resonant point which attracts those fluctuations and we can see as we increase that something, the fluctuations and specifically around this frequency increase and when we have those two things we expect to find optimality so we can see here that here I change the amount of noise, g and we can see as we increase we can find the total fluctuation there is an optimal point going back a little bit to biology so what we actually now predict that we at optimality we should see some oscillations in the brain and indeed we see oscillation in the brain normally that we can say that the oscillation basically also oscillation basically depends on the amount of the delay and what the amount of delay is depends on this whole feedback chain that we saw for example if the feedback is just a small feedback within the network we would expect only external delays we would see fluctuation in the high gamma regime that we record in the brain but if we look at if this delay has to go to several assynapses or several networks until it comes back we'd expect oscillation closer to the theta regime and one thing we're trying to check now with experiment is to verify this kind of finding if they are unrelated okay so what did we get so far and I have to run I guess because there's not much time left is we got cortical activity which is a basic pattern generator but with this kind of we still get a low error decoding subspace where we can use the code and we can go on arbitrary linear dynamic system I haven't shown that I just an auto encoder but with this simple setting we can do arbitrary dynamic linear system and at the time we find find level disorder and global oscillation so going back to what I wanted to show we said one thing is how to encode with chaotic network the second how can we learn I'm going to go very fast about that and because I'm out of time so remember that we looked at this force algorithm so we kind of understood maybe how we can solve the problems of chaos but you learn it because you keep the chaos after learning but what about task inference and more biological learning so for that we look again at the brain and we're looking for this kind of feedback mechanism that we have in the brain we actually find them and the key thing for that is a cerebellum cerebellum sits here on the back of our head and while it's sometimes called a small brain it actually holds more neurons than the rest of the brain altogether so it's a very important part of the brain that's been developed with the neocortex from the beginning actually we can find something similar even in the brains of flies so beyond that we know that it has a very distinctive structure unlike the cortex which is a random neural network or something more interconnected it's more layer wise looks a little bit more like a deep network but we know that damage to the cerebellum leads to deficiency in learning especially motor but not only and it's layer wise and it's connected to the cortex as we look and it gets input from the cortex and put its output to the cortex so if you guessed it the kind of idea is to take this output loop that we've done for the force but replace it with something like the cerebellum that can actually learn this output in a more efficient way so one thing you can see this is very huge layer so this is a layer in the cerebellum called the granule cell it's a very very huge layer that's very conservative and very typical to the structure of the cerebellum that is a huge expansion like it's a feed forward layer with a huge inspection as it comes from the cortex so actually we're going to focus about what this layer does to the learning and I'm going to run through that so the idea or kind of the intuition of what it does we can think of imagine that this is a cortex now let's ignore the recurrent connectivity but there are some blobs of activity each blob could be something that we want to learn or activity in some way or representation of different tasks and what we have seen in before in previous work that as we increase the size of this cerebellar layer and the sparsity activity net we can read things from that much more easier the kind of intuition is that if we look at the cortical layer and then we have those different blobs and each one of them again is a task representation of dynamical types but the cerebellum is going to look something like that so it's much easier to read out from intuition basically cerebellum looks with a magnifying glass into the motor cortex and it has to learn more effectively how can we show that again without theory but just some numerics let's take two networks and see what you mean by effective learning so we have this recurrent network like the force that we just said and it gets different input each one is a different static pattern that it gets in the pools and we want to teach it for each pools to do a different target so it gets different neurons get activated we want it to produce different output each one would say one would be move your arm this way one would be move your arm that way and we ask how can we learn it what the effect of you learn it and how do we how do we what do we do with efficiency how we define efficiency basically what happens when we add more neurons so in one way when we have this force like we can take and add more neurons into the cortical layer while on the cerebellum kind of architecture we keep the cortex straight but fixed but we add neurons into the into the cerebellum as as you can see that if you look at the without the expansion or without the cerebellum if we add more neurons we don't really improve the efficiency of our learning but if we add neurons to that big layer we can see that that the error drops like one of the end like what would you expect without going into the theory why do we see them now I'll go back to the beginning and what I said about about the kerosene and the dimensionality is basically that when we have this chaotic system in here when we have some dynamics then it's a low dimensional but extensive with a size so as we add more neurons here we're actually adding more dimensions to the error what instead of we're adding in here we are able actually to remove out the error okay so to finish up basically what we're doing now or going next is doing to a more full cerebellum like structure of this and this actually maybe relates to other works here or what you're interested about deep learning is how to learn those dynamical system in cortex but using the power of somehow a deep learning or deep learning with wide layers which we know a lot about but we can now connect it to the cortex one more thing that we've done but I didn't have time to show that using this expansion we can actually start using or at least in some cases start using local learning or something like heavy and learning meaning that we can move away from this forced learning and use this kind of mechanism to learn in a biological possible way to summarize so we have shown that the disorder and chaos makes cortical networks good pattern generation data and it's a good and we can use it for learning this diversity of patterns feedback is a lot of reliable low dimensional task encoding and cerebellum like expansion allows efficient learning of low dimensional tasks and representation thank you very much these are my group members and collaborators bolded are the one that are that I'm working with something that is related to this work and if anyone is interested in that please come talk I have vacancies for PhDs and postdocs and thank you for your time very nice, thank you very much I'm sure there's going to be I'm sure there's going to be questions I think Ali was the first one do you have any idea why force kills chaos? why it kills chaos? yeah why doesn't what is the mechanism? yeah what is the mechanism? so when you increase this you create this loop basically you can change the point or increase the loop with the input you can change the transition to chaos that's basically the assumption actually I haven't tried to look exactly what did happen as a function of learning but the assumption is that especially if you are in those nonlinear networks in ton H and it works less well with value but the strength of the input of this loop changes the goes into the calculation of the transition to chaos so it can change this and basically and that's why it works close to the transition if you're at the edge of chaos you don't need much to move to become more stable it has a bias somehow like it prefers to not use chaos in your way well you try to try to do something that's not chaotic I mean you have a very output is very reliable thank you Stefan thank you I probably missed how did you perform the analysis in the second piece of your talk when you started to have a hidden layer before the readout after the reservoir I didn't miss I didn't talk about it okay can you comment on that on which part you mean on that here um okay this one okay so the analysis we've done here is what I'm showing here it's for static patterns so forget now the recurrent activity and we just look at the static pattern imagine that here we get an input some set of static patterns and what we want to do is just basically train a readout so if you think about it in biologically this would be some kind of cortex this would be like a granule cell and it would be like a pockinji cell that needs to learn out of them and the idea is to ask okay let's have just the pockinji cells active only for some of the output so basically it correlates those different tasks so all we need now to do is to be able to ask what happens if I have a readout and I want to ask to train a network so our this output is positive only for an arbitrary uh... subset of the labels just arbitrary of them and then we ask how can make this arbitrary classification so each output can only listen or transmit some of them and for that you can look at those I mean there's work with a Bakhtosh Babadi with Sampolinsky and we have a I think it's a new paper from 2016 that we use this kind of ideas and basically what we're doing is just implementing that not now we're doing it implementing it with the current connectivity I will check the paper probably and there's another question from Alada and I have a feeling I know in which direction it's going to go so concerning dynamic of main filteri when you talk about the two subspaces so the coding subspace and the bulk subspace how you do determine the noise in a self-consistent ways it correlated to the other parts or okay so uh... the eta term yes so the eta term comes from from the bath now the contribution of this part of the coding subspace is negligible so I can solve uh... dynamic main field for here we're just ignoring subspace because the contribution is just of order d o square d over n and it's much much smaller than so here we get chaos and once we have the chaos here and I have the autocorrelation function of of of the activity in the bulk basically eta is the autocorrelation so it's not gaussian noise okay this was my question so it's related to the autocorrelation what I was referring here it comes from the fact that the autocorrelation of the difference between gaussian noise is autocorrelation is not gaussian thanks thank you I don't see any more immediate questions so let's thank the two speakers of uh... this afternoon session once again the coding stopped