 Can you hear me now? Yes? Good. First of all, thank you very much to the organizers for having me here, and thank you for coming to listen to me. I will start with the most important parts. This work was not just done by me. There is at least four people who have contributed substantially to this work that I will talk about. One is Brian Daniels, who is an assistant professor in Arizona, Will Roo is my long-term collaborator in Toronto, C. elegance experimentalist. Sam Sobor is an experimentalist who works on birds, studying how birds learn to sing. And Damian Hernandez, my former postdoc, who is now a faculty in Burialochia in Argentina. So what I promised I will talk about is learning dynamics of biological systems, but then I realized that this is actually a very heterogeneous conference, meaning that I don't usually get to speak to these people about many other events. And so I decided to sort of do the thing which I always advise my students not to do, which is to cramp too much into the stock and actually talk about two different, completely independent things, because that's probably the only time when I can talk to all of you about them. And the only thing that connects them is that we're using more or less machine learning methods to try to understand what's going on in different biological systems. And so the first example, we'll talk about what is the dynamics of the in a susceptive pain, escape response in C elegans, in the tiny worm which is about a millimeter long. And there what we are going to be using is essentially recurrent neural networks, but in a way that allows us to deliver interpretable model, dynamical model of what goes on in the worm, rather than sort of the usual big black box that we get when applying machine learning methods. And then the second part we're going to talk about how the bird controls its own muscles and there, even though we have substantial amount of data, this data is just not enough to use, at least in our hands, to use sort of plain vanilla black box machine learning approaches. And we have to go beyond those and I will try to sort of point out if you think that we're able to do. So this is the animal that we are going to talk about in the first part. The worm is about a millimeter long. It's the best-studied organism, multi-cell organism in the world. We know every single neuron. We know all of the connections between those neurons. We know every single cell. They're all the same in all the worms modular, the sex of the worm. And yet we have no idea how this worm works, meaning that we have known the connectivity patterns of the brain for about 30 years now and we still cannot predict how worm is going to respond to a simple stimulus. And the stimulus that you've seen here is this big red dot which hits the worm's head. This is an infrared laser. Goes on very quickly, increases the temperature around the worm very quickly and then goes off and the worm doesn't like being hot and it sort of stops, looks around, retreats, then crosses over its own body, makes a random term in some random direction and tries to escape that, right? And that's a behavior that we will try to model. So how are we gonna model this? I'm always tempted to read this thing aloud. This is a short story by Borges and that's sort of the philosophy which I have about modeling biological systems. And the title of this short story is of exacting science. In that empire, the craft of cartography attained such perfection that the map of a single province covered the space of the entire city and the map of the empire itself and entire province. In the course of time, this extensive maps were found somehow wanting and so the College of Cartographers evolved the map of the empire that was of the same scale as the empire that coincided with it point for point. Less attentive to the study of cartography, succeeding generations came to judge a map of such mangy cumbersome and not without irreverence. They abandoned it to the rigors of sun and rain. In the western desert, started fragments of the map are still to be found, sheltered in occasional beast to beggar and in the whole nation, there is no other leg is left of the discipline of geography, right? So the reason I'm putting it together is that again, talking, putting the slide together to show to you is that again, in this warm, we know a lot. We know there is a lot of details about the mechanical organization of the world, cell biology organization of the warm, neural organization of the warm and there are attempts, very large attempts to try to model this warm on a scale of one to one, meaning on this molecular scale where we will try to explain what the warm is doing by counting every single RNA inside this warm, every single myocyte and so on and so forth. And that's not what we're interested in doing. So my own, what's going on? Biasis are such that, I think I'm just far away from the laptop, that's an issue. My own biases are that if I want to understand something, if I want to build a model of something, I should not be building a model of everything that goes on in the system, but rather a model that is going to explain the specific question that I'm interested in. In this case, escape response, I don't care about the locomotion, I don't care about its other behaviors, like feeding behaviors, mating behavior and so on and so forth. There's just one question, input output relationship, let's build it. And then in that case, I am not interested in trying to build the most detailed model first and then try to course grade and throw details out, which are unnecessary because then I'm solving a harder problem on the root to the simpler problem and that's sort of the quote that I remember the most from reading Wabnik's book back in the late 90s that you never want to solve a harder problem in the root to the easier one. And so instead what we will try to do is to build a model which is as simple as possible to explain the behavior that we're interested in and as more details emerge experimentally, we will refine this model rather than course grade in that model. So in this case, we're interested in dynamics, input output relation, right? The temporal dynamics of the stimulus goes in, temperature is a function of time, and then on the back end, we need to get the velocity of the worm as a function of time. So how are we going to do this? So what we will try to do is that the dynamics is temporarily local, right? We'll assume the dynamics is temporarily local, which means that we should be able to explain this dynamics to model it in terms of some kind of differential equations. We don't need to go to time delayed systems or integrated differential systems and so on and so forth. That's the first assumption we're going to make. We're not going to fit curves. I'm not going to try to predict a trajectory, right? You know, having a trajectory and then what's the future of the trajectory? Rather, I will try to learn the dynamics itself and then from the dynamics, I can generate as many trajectories as I want. I will neglect the stochasticity at this point and my current students are trying to relax these assumptions. And so effectively what we're interested in is we're trying to look at the time series, x as a function of time. In this case, it's a one-dimensional x, velocity as a function of time. In principle, we can do it more dimensions and we're trying to basically write down systems that look like this, right? The set of first order ordinary nonlinear differential equations that describe this data. And of course, this is not an easy thing to do. I need to be closer to the laptop. This is just not working. And so it's not an easy thing to do, especially not an easy thing to do automatically, right? The question is how do we enumerate all possible systems of differential equations? How do we search through that space? How do we not overfit all the usual questions that emerge when we do statistical inference as still going to be there? In this case, there is an additional complication. We know going back to even late last century, sort of to the structural risk minimization of FAPNIC or to Bayesian model selection and so on, is that we are generally able to find a consistent estimator, consistent, probably consistent model from the data if the models that we're fitting form some kind of a nested class, right? In this case, forming a nested class of differential equations is rather hard. I could imagine myself starting from a linear system, DX by DT is AX, where X is some matrix of observation that we're making, but then I can expand it along multiple different quote-unquote dimensions, right? I can add nonlinearities to the system, or I can add hidden variables to the system. This worm has 300 neurons in it and probably 30 or more are actually responsible for different parts of this locomotory escape behavior, right? I'm just recording a single variable X, but clearly there will be hidden variables, latent variables in the system. And so any two-dimensional set cannot be uniquely ordered. And as a result of that, I cannot find a unique, sort of structure, unique nested set of families of models that I can just look one at a time and see how they fit the data. And as a result of that, I have to start again making assumptions. And so what we are going to try to do is to look at this sort of two-dimensional set of more hidden variables, more nonlinearities, and chart just an arbitrary path through that space starting from the simplest model, the fewest degrees of freedom, the fewest nonlinearities, and ending eventually at infinity and infinity, just putting a line and exploring model after model and after model on this line, using Bayesian model selection to choose the models that are good enough but not too good to overfit the data. And then if we have chosen a correct hierarchy of models if we sort of go this way, and it turns out that we can fit very easily and the hierarchy is good, if it turns out that we've chosen hierarchy sort of goes maybe like this in some weird way, it will take us a lot of terms in this model until we actually will start fitting, right? And then hopefully by seeing which hierarchy provides the simplest fit, the easiest fit to the data, we will actually be able to learn something about the underlying structure of the model. Maybe this model is actually not nonlinear, right? Or something along those lines. So, let's skip this. There are two different model families that we are going to explore. One seems pretty familiar, right? This is almost recurrent neural networks except that we change the order of nonlinearity and the summation here is called sigmoidal networks. You can map if the time scales, the stars over here are constants across all variables that you are dealing with and they are absolutely equivalent, mathematically equivalent to recurrent neural nets. The reason why we chose this is because this seems to be a bit easier to map on biochemical kinetics which is what we originally developed this model for. And another set of higher, and so the idea here is that we start with the axis, the observed variables, they interact with themselves, interact with other variables, interact with input signals, which are eyes in this case, and that just grows this model by adding additional axis in a line, right? Sort of in a column. And so the number of units increases, they all start interacting with each other. At some point I say, I have finally the simplest possible network that can explain my data, I'm gonna stop here, right? Another network that we are dealing with is another type of model that we are going to be dealing with, something called the S systems. This again, non-linear recurrent systems, they don't look like neural networks and they aren't, but that's just a different type of activation function, right? The idea here is that axis activate each other as power loss, and then they deactivate each other as power loss, and this comes again from biochemistry was developed by Mike Savojo some 40 years ago. The idea is that in a logarithmic space, if I take log of axis, this to the first order looks like a linear response model, right? And so then you start adding additional terms to that. And so in both of these cases, we know that any dynamics in principle can be modeled by this type of equations. The nestedness of the structure is pretty clear. I just keep on adding additional variables and the system grows. And so all requirements are satisfied. So I'm guaranteed by using Bayesian model selection to find, to not overfeed and to eventually, as amount of data goes to infinity to actually have a consistent, statistically consistent result. So before going to try to model the worm, let's try to see if we can with this very simple approach learn something that we already know exists. And so what we did is we decided to simulate a bunch of trajectories of planets moving around the sun. And in fact, not actually whole trajectories, but just two points from a trajectory, right? The planet starts at a certain distance away from the sun. And then some random time later, the planet ends at a certain distance away around the sun. And we have a couple of hundred of such trajectories or such pairs, right? Beginning initial conditions, final conditions. Can we learn that there is an underlying gravitational, universal gravitational law, one of our squared that explains that? And let's look at these figures over here. What I'm doing is showing a face portrait of the true system, right? There's two dynamical variables in the true system. There is radius and the velocity, dr, dr, dt and r. And this is how the true system is going to look like. This is a planet moving in a circle, radius constant, velocity constant. This is an ellipse. This is how it looks in this coordinates. This is a hyperbola. Our algorithm applied to the system learns something that looks like this, which of course is not exactly something that is on the left, but of course we don't know that the hidden variable should be a velocity rather than velocity cubed or some other function of velocity, right? The important part is that we have discovered that the equations of motion are second order equations that you need two dynamical variables to explain the dynamics. And if you look deeper, you will see it's more small complex of both of the systems is exactly the same. So we have actually discovered the correct dynamics, right? Subject to some small statistical fluctuations. And then looking on the left, what you see is the dotted lines are now planets which started, which we have not trained on, right? We release them with some random positions, velocities, and we ask where they will go. And the solid lines is what our algorithm predicts. And it's pretty good, right? From about 200 sort of input-output pairs, we have discovered the underlying Newton's laws, right? And so we like that. And so that's why we called the algorithm Sir Isaac. And so you can download it on GitHub if you'd like. There's one point that I wanna make here before going further to the worms. And that is, this fit is done with the S systems, with the power law activation functions, right? If I try to do the same with the sigmoidals, this is what I get. So we always train on the first 100 units and we try to predict what's going to happen for the next 100 units of time. And so what you see here is that the, effectively the neural network is very good at interpolating, right? We are quite good even on the systems that we have not trained on for the first 100 units of time, which is the range of which we're trained on, but from different initial conditions, we are doing pretty good. But then extrapolation becomes pretty bad. And the reason here is that the actual forces in one over R squared diverge near zero, near R equal to zero, and it takes a whole lot of sigmoids to create an infinity, right? So, yes, better. Yes, and the score that we get for the, I mean, they, I'm a defied likelihood or whatever, it is better for this system than for that. But I think that what's even more interesting is that if I look at these two cases, I clearly see that I've learned something about the underlying dynamics, and that is that that dynamics is not saturating, right? At infinities or at zeros, that it has divergences at the infinities and zeros. And that, to me, is maybe even more important than learning the actual equations, right? Because I've actually understood something about the structure of the forces. Equations depend on parameterization, divergences don't. You had a question. You learn conservation of momentum or conservation of the energy or conservation laws, I mean. Okay, in this system, we did not, right? So I skipped those slides because I wanted to scram again too much stuff into it, but we have applied, of course, this message to other dynamical system. So one specifically that we started with is the glycolytic oscillations in an east. And so that's a seven-dimensional dynamical system, but in fact, it has a conservation law. And what we were surprised to find out, I mean, maybe it's a coincidence, maybe we got lucky, but what we were surprised to see is that our engine, in that case, this one didn't work well because biological dynamic doesn't have infinities, doesn't have infinities. Sigmoids are a lot better for describing biochemistry. And so, fitting with sigmoids, we produced the feed there and the system said that six variables is enough, right? And turns out that there is a conservation law, total number of carbon atoms is conserved, even though there are seven chemical species. There's the geometry, which means that only six are independent. And so in that sense, we have discovered the conservation law. We do not discover it in terms of energy is equal to a constant. In this case, over here, we're only looking at radius, not x, y, z, but rather just the radius, the distance away from the sun. And so in this case, the conservation law goes, the momentum conservation goes into the fact that you have effectively repulsive force near zero, one over r cubed repulsive force near zero. So there is no more conservation law left. It's already in the effect of energy, in the effect of force. So now let's go back to the worm. This is what the worm behaves. Looks like this is the same worm that I showed you before. If I apply very small current, the worm, which usually goes forward blue, causes, which is black, just stops more or less. And then it either can start going forward or it may make a turn or it can reverse. And as I increase the strength of the pain that the worm experiences, its behavior becomes more and more deterministic. It goes through this very specific epochs where you are going forward, you are reversing, you are turning, and then you are going forward back again. And so what are the questions that we're asking here? Again, I don't really care much about fitting the dynamics. I'm interested in trying to understand what are the general properties of the dynamics. So what are the questions? So the interesting question in the field of animal behavior is how do behaviors get generated? Are there multiple distinct behaviors, sort of the reverse behavior, reverse motion and the forward motion that the worm can select? And so you can view them as two different attractors and the worm, due to perturbation, goes from one attractor to another attractor. Or is there always a single dominant attractor, let's say going forward, right? And then as you apply the stimulus, it's not that you are switched from one attractor to another, but the whole attractor itself moves, right? So the idea is when you apply a stimulus, does it change the landscape of all possible behaviors or does it just move a worm from one point on that landscape to another point on that landscape? And so that's sort of a common question in the field of studying behavior of small animals and that's what we try to answer. This is the type of data that we get. We apply the current, the worm goes forward, we apply the current here, worm immediately recoils back, starts slowing down, eventually at about four, five seconds it's gonna stop and cross over itself and start going forward again. The solid lines with ranges around them is the mean of a set of a bunch of worms plus standard error of the mean plus minus. And then the dotted lines as a fits. By all measures, the fits are very good. We get chi-squared per degree of freedom of less than one, which is even worrisome that we're overfitting. It turns out we are not. And so for this system, the reason why we started working on this is because my previous graduate students spent three, four years on trying to build a detailed, biologically plausible model of the system accounting for reasonable sort of knowledge of the physiology, right? And in that case, so this is a curated model, we did quite well. We got an R-squared between our model and the data of about 0.8, 0.85, right? So we explained 85% of data that was available for explanation, right? And then we ran this inference engine on the same data and we got something that looks like that where the unexplainable variance almost kisses zero here, right? So we on average explain about 0.92, I think, R-squared of 0.92, which for biological data is sort of unheard of. You don't usually model behavior of whole organisms with the R-squared of larger than 0.9. So that's interesting, right? So we fit the data quite well. Can we make predictions? Why should you believe our fits, right? And so in that data set, we only fitted the data sets to the first two seconds of the activity after the warmest heat with the laser. And then after about four seconds, five seconds, the warm usually stops, crosses over, as I already said, multiple time and goes forward. Our data doesn't know that this is what's going to happen, right? We do not use that data. We only use first couple of seconds of behavior. But we can ask our model at which point will the warm stop, right? And this is a prediction of the model. There is no three parameters here anymore. So depending on the amount of the laser current that you provide, when will the warm stop and stop going backwards and start going forwards again? And again, this is not a fit. This is now a prediction. This is our curve for prediction. And this is a means of different warms that experience the same as current. This is a similar prediction where we now are trying to predict when will the warm stop based on its fastest backwards speed, right? The peak reverse speed. And again, there's no three parameters in the model. This is a prediction. This is the data, right? So it seems like we are not just interpolating the data, not just fitting the data, but we are making predictions based on the first two seconds of behavior about what's going to happen at about four to five seconds after the laser has applied, which is interesting. And this is sort of my test usually for whether the model that somebody builds is in some sense a pure statistical model, which are always good for interpolation, but are rarely good for extrapolation, versus if it's a model that has recovered underlined physical mechanisms, whatever they are, because if you got the physical mechanisms correctly, if you got the actual system correctly, then you can not just interpolate, but you can extrapolate at least for a while. Okay, I keep on moving away from my laptop and that's not good. So this is how the dynamics looks like, right? Coming back to the question, is this a single attractor or multiple attractors? So if the worm doesn't have a laser hitting it, it has just a single attractor according to our model. Here it is. By the way, the system is two dimensional, even though the velocity is just a single variable that we're trying to feed. Our engine again tells us that you cannot explain this data unless you have an additional hidden unit in the system. And so that hidden unit is at zero in at rest. The worm is going at 12 certain pixels forward and then you apply laser current and what happens is in our model, the attractor moves and the worm starts chasing its own attractor. So it seems like the external stimulus rearranges the landscape of possible behavior and then the worm evolves on an evolving landscape. But there is always just a single attractor at least within our model, which is sort of an interesting observation whether we will fully believe it or not we need to test additional more interesting stimulation patterns, but this is where at least we stand at right now. And now this is finally the model that we have derived, right? And it's of the sigmoidal form that I have promised you it will be. dv by dt relaxes to zero is driven by the heat that the worm experiences and has sigmoidal interactions with itself and sigmoidal interactions with a hidden unit. And this is a hidden unit X2. And what you will notice is that this hidden unit is just a linear, it's a low pass filter of the velocity. The interest of the heat, the interesting part here is that for each one of these terms, I can tell you, or rather not I, but will remind experimental collaborator, tell you which neuron is doing this. We probably got incredibly lucky, but we can assign a particular neuron or a particular known feature of a behavior of certain neurons to each one of these terms, right? We know that there is a forward speed generator in this form. We know that the heat when it applies to the worm just shuts it off. We know that when this forward speed generate, velocity forward motion generator pacer stops, it immediately turns on the backward pacer. We know that when the backward pacer runs for a bit, it turns itself off and turns on the forward pacer and so on. And so each one of this term, if you look well enough, you can assign which of the neurons are responsible for this behavior, W11, for W12 and so on. The model is extremely interpretable. There is one thing that we didn't find in the literature. And that's this thing, right? The integration. It turns out that nobody saw that the worm is integrating its temperature. This was not a known function of any of the neurons in the literature on the worm. But since the model works so well, we are sort of willing to claim that this is, in fact, what's going on. And again, the beauty of working with a good experiment is that they can go and check. And they did. And it turns out that the AFD neuron is, in fact, essentially a linear filter. At least one of its functions is effectively, in fact, a linear filter of the temperature that the worm experiences with a timescale of about one second just happens to be. So not only we are able to predict, but we are able to relate things to the known biology and to predict the existence of yet unknown biology in this system. And so that's about the end of this story. And I think I have about 15 minutes to move to the second part. And so what I want to finish with here is that what I've been trying to do here is not to build the true model, right, the interactions of every molecule with every molecule, every neuron with every neuron, but the simplest possible model, which recovers the behavior that I'm interested in, we do this by effectively building a set of recurrent neural networks and trying to find the smallest possible network that is able to explain the data that I have. It's a totally opposite direction from what the modern machine learning sort of literature would tell us to do, build the biggest thing and it's gonna be, and then try to make sure that it doesn't overfit, right? We're going from a different direction. And the interesting part is that because of some coincidence or maybe because of some deep profound reasons that we don't understand yet, the model turns out to be predictive and interpretable. And that's very surprising. So I'm gonna go, mm-hmm, yeah. So I'm curious about the fact that you started from this model to try to capture the dynamics, but then you eventually say that you landed on finding, I mean, even predicting the existence of a specific neurons performing some specific functions, right? So I wonder why then the way around would not work? Why if you actually start from the biology, from these neurons that you know exist, you put them together, you wired them together, then you're not able to model the behavior of the world. I think it's a great question, right? So there's two reasons. The first one, I have no idea what, which part of the known biology to actually include in the model. There is some 30 different neurons which are known to affect locomotion. And so, but not all of them apparently are really important for this type of behavior, but maybe even a deeper question. I mean, we know that AFD is a thermosensory neuron and it affects, you know, works in thermotaxis, works in the susceptive taxes and so on. Nobody, this neuron is very well characterized. Nobody has saw that this neuron is a low pass filter with a, you know, linear low pass filter with one second time scale, right? So people have never applied fast enough stimuli to this worm to see that it has, that that neuron has this function. And so even if I tried to build a model of this escape circuitry, there was just not that bit of knowledge in all of the available literature that I do have that behavior. So I'll really fly over the other part, but here what we're interested in is sort of solving this old question that was posed, you know, by Chuck Stevens some many, many years ago, decoding the neural code, you know, the neural code is the enigma of the brain. How do we figure out, can we understand which specific spike patterns carry information? In our case, we're interested in carry information between the brain and the muscles. How do the brains control their own muscles? And this is a problem which we have already seen here in this conference. Simona Koka was talking about essentially the same problem in a different disguise, right? The problem looks like this, where you have this huge matrix, and maybe not huge, but large matrix, where you're recording from multiple different units for multiple different times. In her case, this were protein families, and then the values of amino acids. And what you're interested in is to understand which combinations are over-represented in this case, right? In our case, we're going to be looking at time slices over here. And then the number of different motions that the bird does in this case, the bird is trying to sing. And so every rendition of a song we're going to record at which point before that song, the bird has, the certain neuron in the brain has either spiked or not spiked, right? So this is my, this axis over here, zero, one, spiked, not spiked. And then many, many different renditions of the same song. And so what we're interested in is trying to understand which of the spike patterns are over or under-represented there. And those things which are over-represented are possibly the keywords in some sense, right? The code words, which is what the animal is using to control its muscles, right? And if they are over-represented jointly with a certain feature of the behavior, which in our case is going to be the fundamental frequency, the pitch of this behavior, then they are at least correlatively code words that are encoding the subsequent behavior, right? And so exactly the same problem is showing up in predicting structure of proteins and trying to understand 3D structure of genes and trying to, of chromosomes, trying to predict gene expression, you name it, right? There is a whole big field of very, very related problems across all domains of biophysics. The problem with this is that at least in our hands, for the type of data that we're interested in, we cannot use the standard approximators for distributions like restricted Boltzmann machines or anything like that to find those over-represented patterns. And the problem is, the reason why this is the case is that we just have too few data points. 40 or 50 units that we're recording from, there are two to the 40 multiple possible combinations that could happen, multiple different code words. We have, if we're lucky, 1,000 recordings and two to the 40 is much larger than 1,000, almost no signals repeat more than just a handful of times. The signal to noise ratio is tiny. You don't have a situation where a certain code word is just massively over-represented. They are all just a bit above the expected noise level, right? And on top of that, there are no such things as sort of symmetries that you can use like convolutional symmetries, translational symmetries, rotational symmetries. Those things don't exist in this type of systems. And so you are in the regime where you are massively over-sampled, under-sampled and at least in our hands, none of the black box methods work. And so what we are trying to do is basically write down this joint probability distribution as a standard sort of energy-based style model and then try to identify which of these coefficients are non-zero with some kind of reasonable probability, right? We're not even going to ask which of these coefficients, oh, what these coefficients are. We're gonna be very happy if we can say that this coefficient is non-zero with probability 80%. And so it's quite likely that this word actually contributes to the neural code. And this is enough for us because then I can go to my collaborator and tell him, look, what you can do now is you can stimulate this word, this neuron at this particular moment in time. And we can see is the animal going to respond in a certain way, is it gonna sing the song that you wanted to sing, right? We don't need to do everything with machine. We just need to have putative things that later could be tested. And so again, as I mentioned, the big problem here is that every neuron in this animal's brain is different from every other neuron. Every brain, every animal, is different from every other animal. We cannot pull together data sets across neurons, across animals, and we're horribly, horribly under-sampled, right? So how do we solve this problem? And our idea is that, again, we do not want to actually fit the distribution. We don't want to build a model of the probability distribution. We just want to understand which terms in that distribution are likely non-zero, right? What are the code words? Not exactly what they mean or how strongly they contribute to describing the distribution. And the way that we do it is the following. So first of all, we discretize a spike train. Zero, there's no spike one. There is a spike at a certain time point. Behavior, we're also going to discretize if you're singing above median picture, below median picture at zero or one. So we are now trying to fit this probability distribution, but instead, what we are going to do, we are going to multiply each term in this probability distribution by a binary variable S, an indicator variable. If this variable is zero, it means that that term doesn't contribute. If the variable is one, it means that that term contributes, right? And what we will try to do is estimate a posteriori given the data that we have, which of this Ss are zero so once and really don't care what the status are, what the actual coupling constants are, right? So I can put some kind of a regularizing prior, say that status are all very small, which is reasonable for this data because none of the patterns happened at very, very large frequencies. They're all barely above chance. So there is no strong terms there. Every contribution is very, very weak. And so I will put this small, very, it shouldn't be weak, it should be very strong regularizing prior on status, right? So all status are very small there. And this gives me a small parameter. What is the typical scale or feature of the terms that I expect there to be? I'm going to call it epsilon. I can integrate out all of the status in this probability distribution to the couple of leading orders in epsilon. And this is what I'm going to get, the posterior distribution of this indicator variable. Should I or should I not include a particular term given the data I observed? It's an Ising model. It's actually now an Ising model not on 20 or 40 or 50 spins, but it's an Ising model on two to the 20 spins, right? Because there are, each particular pattern has its own indicator variable associated with this, right? So pattern 0, 0, 0, 0 has one indicator. 0, 0, 1 has another indicator and so on. So it's normally a much harder Ising model than I had before except that the couplings, the epsilon's here are small. And so it becomes a problem which I can solve perturbatively. And we have a code for this which you can read and we can of course calculate the values of this Hs and Js and they make sense, right? The Hs is basically how over or under represented a particular pattern is given, you know, subject compared to a null model and Js are actually kind of interesting. They tell us how patterns compete with each other. So if, for example, I have a pattern of three spikes which can explain a certain over-represented structure in the data, then maybe this pattern is also over-represented or maybe this pattern with four spikes is also over-represented, right? Because they're all correlated to each other. And if I'm trying to assign some feature in the data to a certain pattern, maybe I should assign it to just one, right? To the most irreducible one, to the strongest one. And so it turns out that the interactions between the patterns have a negative minus, minus times the strength of the correlation between those patterns. So patterns that could co-occur are suppressing each other. Patterns that cannot co-occur are actually co-activating each other. And so this is, you know, I'm just flying over this very quickly. It was just finished piece of paper, piece of work a couple of weeks ago. And so this is a type of thing that we observe if we do the analysis in this bird. So this, for example, a neuron that's a single neuron and we try to understand what this neuron codes, what are the patterns that predict high pitch or low pitch? And so the blue ones are one predicting high pitch. The red ones are predicting the low pitch. This is the same for neuron number two. This is from the same neuron. Can this neuron predict, let's say, the volume, the amplitude of how the bird sings? Or what is the complexity of the song, the spectral structure of the song? And what you will see is that, which is kind of interesting biologically, is that these neurons are tuned. Some of the neurons predict pitch but do not predict the amplitude, but maybe predict the spectral complexity. Some other neurons do the opposite things. Another interesting story is that none or very few of the patterns here have just a single spike in them. Most of the patterns are combinatorial patterns having a spike here, a spike here, and spike here. And that means something while an individual spike sitting here actually has no predictive power. So it turns out that this code that we're dealing with is not a simple code where you just linearly add spikes, but rather a precise spike pattern down to about two millisecond resolution. Spikes happening at exactly the right point is what is the language, the artographies that these neurons use to communicate with the brain. And we can take it further. I'm gonna skip this part. I'm just gonna focus on this thing, just one last slide. So we can do the same analysis in the young bird and in an older bird. And it turns out that in the younger bird, all of the keywords that we observe have individual spikes in them, while in the old bird, they're all multi-spike patterns. So it turns out that the young birds, when they learn, when they just learn the song, they're first trying to figure out how many spikes to put at an appropriate moment in time, but when they become older, they start figuring out knowing the number of spikes, where should I put them, right? And that's a very different mechanism for learning. It's the, you know, when they're young, they're figuring out what the rate of firing is. When they're older, they're figuring out when exactly to fire. And coming back to sort of machine learning, it would be maybe nice to introduce this idea of timing in artificial neural networks and try to see if we actually, if we can take them beyond learning rates, right, to learning timing of action potentials and trying to compute with timing, maybe that's going to make them as, you know, somewhat more powerful, just like birds are still a lot more powerful than in controlling their vocal chords and any of our machines that we can create. And I'm ending right here. So timing is the right keyword. Thank you very much.