 Can you hear me? Is that on? Great. OK, well, thank you so much for inviting me, Sonya. It's an honor to be here. I'd like to talk about some relatively new work from my group. I'm really glad that Sonya organized this workshop because it's a topic we've really just started working on in the last year. And it's the first time that I've really had a chance to talk about some of this work. So this will be you all are my guinea pigs for hoping to make a coherent explanation of this basic approach. So let me start by just setting up the problem that I want to talk about. And this is the same kind of data that Nikos and Rob both just talked about. Supposing we have some neurons in the brain, here I'm just depicting three neurons that you can record from simultaneously. And so each of those neurons has a spike train that determines how those neurons are responding, say, to some particular stimulus condition. And for now, for this talk, I just want to concentrate on simultaneous spike patterns. So for example, what this involves is I'm going to specify a bin width delta. And then I'm going to look at the pattern of spiking that occurs across all the neurons in that population. So I'm going to represent these patterns as binary vectors here. So in this first, this last bin here, I have no spike, spike, spike, so that's 0, 1, 1. The previous time bin, 1, 1, I have spike, spike. No spike, so that's 1, 1, 0, et cetera. I'm taking the raw data here and I'm reducing it to a sequence of binary patterns. And so for this talk, we're going to ignore the time structure of this data. We're really just going to look at modeling these binary patterns. So our basic goal is to build probabilistic models of such patterns. And I guess the high level goal for my talk is I'd like you to understand a little bit about why this is such a hard problem to address and to get some intuition for the tools that we're using to solve this. And so the tools that I'll talk about mainly come from Bayesian nonparametric machine learning. OK, so why is this hard to do? Well, first of all, I'll show you. It's pretty easy for this example. If I only have three neurons, right, three neurons, each of which can emit 0 or 1 spikes. And I'll just say by convention, if we had more than one spike in a bin here, we'll declare that to still be a 1. OK, the object of interest here is the distribution over spike patterns. So with three neurons, of course, there are two to the third or eight possible binary patterns from 0, 0, 0, all the way to 1, 1, 1. And we're interested in the distribution over these patterns emitted by the three neurons. This is a group that's mostly silent. So the most common response is all 0s. It rarely do all three spike at once, but you can see there's some variable distribution over the different patterns. And of course, this is really just a model of the marginal distribution of these spike patterns, p of x. Although the problem that Nico's talked about, for example, decoding what an animal is representing, we might measure these distributions in different stimulus conditions. So we're interested in understanding how do these three neurons respond under one stimulus condition versus another one. We need this fundamental object here to do any kinds of information, theoretic, or decoding analyses. But I won't talk too much about the neuroscientific applications beyond that. OK, so for three neurons, as I said, this is relatively easy. I can just go measure my population and I can now make a histogram of how often I observe each of these possible words. The problem is, of course, that if I have M neurons recorded simultaneously, there are two to the M patterns. And of course, there are two to the M minus 1 degrees of freedom in p of x. All of these numbers here have to sum to 1. And basically the last one can just 1 minus the sum of all the other ones. So there's one less degree of freedom in this entire distribution than I have total number of patterns. The problem is, of course, that if I get up to large populations, so 100 neurons, I'm now talking two to the 100th possible pattern that's larger than 10 to the 30th. And so in the history of the universe, even if these were observed every millisecond, these patterns, you'd never get enough data to observe even one of every pattern if the neuron were responding independently. So we'll never be able to build, this is a giant problem. This sort of histogram-based approach, observe your patterns, count how many times you observe each pattern, put those into a histogram, and you're done. That approach won't work. And so I want to talk about how we can sort of get, how can we go about making a tractable attack on this problem? So one approach that's been followed in the literature is to look for restricted parametric models. And so the simplest kind of model you could think of would be an independent Bernoulli model. This would be to say each neuron just independently with some probability p. It's like you've got a bunch of biased coins. You flip each coin, which tells you what each neuron does. And of course, the probability then is just for each neuron, whether it's spiking. It has a parameter that determines each neuron's probability of spiking. Of course, this is a very simple model to fit. All I have to do is for each neuron, compute the relative fraction of the times that it spiked. And there are only m parameters. For a population of m neurons, each neuron has a single parameter. It scales very attractively to large populations. The cons, of course, is that this is this very limited flexibility. In particular, it ignores the tendency of neurons to spike together. Arguably, this throws away the most interesting thing that we set out to study, which is what is the correlated, what are the dependencies between these neurons? So to go beyond this independent model, one popular approach in the literature are so-called maximum entropy models. These are also called energy-based models. The basic idea is that we can define an energy function that often is expanded in terms of the order of the interactions. We have first-order interactions. This is like the Bernoulli model. Each neuron's tendency to spike or not spike. And then we can add these second-order terms, which capture the tendency of pairs of neurons to either spike together or to not spike together. And if I stop at the second order, this is a model that's often called the Ising model. And when I define this energy, the probability over patterns then is just the normalized 1 over z times e to the minus energy. One problem is, of course, that this denominator requires a sum over 2 to the m patterns. So advantages of this model is very parsimonious. It can explain large-scale patterns in terms of pairwise interactions. Let's say if I stop at second order. And the literature has shown that this is a very good model for medium-sized populations. The cons, of course, are that it's difficult to fit. I have to compute this normalizing constant, which involves summing over 2 to the m patterns, is computationally intractable. And recent work from both Jonathan Victor and a lot of Schneidemann's groups have shown that this model, at least the second order model, is not adequate for describing large-scale patterns. We start to see systematic deviations from this model. OK. So I want to start by just introducing briefly one other parametric model that we can use as a replacement for the Ising model. This is a model that also has only second order interactions. I'll call it the cascaded logistic model. It was introduced in a cosine poster from Manish Sahani's group. Basically, you induce an ordering on the neurons. And then it's a logistic regression of each neuron against the previous neurons that came before. OK, so this model is very easy to fit. It has low computational cost. And we've actually shown that it's equivalent exactly to the Ising model in some cases. Take that fact now and set it aside, and we'll come back to that in a little bit. OK. So the big picture I'd like to paint for you here is that we have this sort of parametric continuum of models. For m neurons, I have 2 to the m possible parameters, or 2 to the m minus 1 possible parameters, to represent that distribution with a histogram. And at the other extreme, I have an independent model with only m parameters. Both the Ising and this logistic model here sit here with order m squared parameters. And at this other end down here, I want to call these, these are massively parametric models. 2 to the m parameters is so huge, I might as well call it infinite. And so I'm going to call these, these are practically non-parametric models. OK, so what I want to do for you now is to think about a particular effective way of introducing non-parametric models for binary patterns. First of all, let's just introduce the idea of a non-parametric model. The idea is that the, and this is a big idea in statistics, the idea is that instead of fixing our parameters in advance, we would like the complexity of the model to grow with the data. In other words, if I have a small amount of data, let's say this, we'll add more parameters to the model as we see more data. So I give you a small data set, right? You can't specify very many degrees of freedom in your model. So you should use a relatively simple model. If you get a larger data set, we can start to use a more flexible model with more parameters. And so Bayesian non-parametrics give you, just give you a nice way of setting this up so that automatically the model will invoke more degrees of freedom as I get more data. OK, so let me tell you about the particular model, the particular Bayesian non-parametric model that we'll use. It's called a Dirichlet process. All right, and basically this is just a way of defining a prior over those histograms in a particularly nice way that I'll tell you how. So the Dirichlet process has two components, this alpha and this is what I'll call G theta here. G theta is what's called a base measure, is the sort of technical name for it. This is just a parametric distribution of the kind we've already seen, think of Bernoulli or Ising or the cascaded logistic model. And these are simple models for binary spike patterns. The standard approach in the literature of course is for this base measure to be uniform over all patterns, so assume a priori that all patterns are equally likely. But what's new about our approach is that we're gonna use these simple parametric models to exploit the sort of ability of some of these models to at least well approximate the distributions we're interested in. Okay, so I'll tell you more about that in a second. The other piece of the Dirichlet process is this concentration parameter, which controls basically how far our distributions are allowed to be away from this parametric model. I think it's probably easiest to explain this with a picture. Think of the space of all possible distributions p of x. Remember I said there's two to the m minus one degrees of freedom in that. I can think of the manifold over those distributions defined by a simple parametric model. Now if someone gives me a true distribution this black point here, it may not lie, it may not be exactly described by one of the simple parametric models that I've defined. However, with a relatively small amount of data, I can fit the parameters theta that would move me along this manifold and get me a close approximation to the true distribution. So that's the parametric piece of the Dirichlet process. The other piece of course is this parameter alpha, which allows me to move off of the manifold. And so alpha controls how easily I can move off of that manifold. Small values of alpha actually allow me to move off the manifold very quickly. Large values of alpha mean that I'm very close to the manifold here. Another way, I'll show you one other way of visualizing the Dirichlet process or thinking about the Dirichlet process. It's this very simple formula here which is called the predictive distribution. This is the distribution over new binary patterns given all the binary patterns I've seen so far. So what it says is the probability that the n plus first pattern, given that I've already observed n binary patterns, the probability that that pattern will be the k-th pattern is proportional to the number of times I've already observed that pattern, n of k, plus alpha times the probability of that pattern under the base model. Okay, so in other words, the parametric model assigns a certain probability to that pattern, that's this gk. And I've also have already in my dataset seen how many times that pattern came up. And these two things together influence the probability of seeing that pattern again. Okay, so there's kind of a tension between these two pieces here. One is how well, how much probability mass is described to that by my model, and the other is how many times that I already see it. So pay attention to the data that I've already seen versus pay attention to the parametric model. And so we have two extremes of the Dirichlet process. If alpha goes to zero, right, we basically ignore the parametric component and we just look at the past data. This is just a pure histogram model. At the other extreme, if alpha goes to infinity, I'm effectively ignoring how many times I already saw that word, that pattern, and I'm only paying attention to the parametric model. Okay, so if we learn in alpha, it'll teach us, if I go back a slide here, it'll teach us sort of how far off the manifold of parametric models we get. Okay, here's the final attempt to give you some intuition. Why is this a smart way to proceed? Let's say I just gave you, here's a simple example. I have five, I'm recorded from five simultaneous neurons to their 32 possible patterns, right? And let's just say I've observed 15 samples now. So I haven't even observed as many samples as I have number of degrees of freedom in my distribution. There would be 31 degrees of freedom in the distribution over simultaneous responses from five neurons. I've observed these patterns here. What do you notice about these patterns? They're mostly zero, right? This is sparse data, spiking is relatively rare. And so if I've just observed these patterns here and I now ask you, well, what's more likely than my next pattern? Will it be more likely that I'll observe an all ones pattern, or is it more likely that I'll observe all zeros and one neuron spiking? I should say neither of these particular binary patterns occurred ever before in this dataset that I've already observed. So a pure histogram model would assign these each equal probability. I've never seen either one before. I don't know what to do. But if instead I've said, well, I can tell that spiking is rare and I use a parametric model that says individual spikes are rare, I'll correctly infer that it's much more likely based on this data that I would see this pattern than that I would see that one. Okay. I'm gonna skip over inference. Basically the model has very tractable inference properties. We can do maximum likelihood inference for alpha that controls sort of how close I am to the manifold and theta, which controls how far I am along the manifold. And we end up with what I'll call a universal binary model. I'm using the term universal because these models will converge to the correct distribution for any data. And you can think about this as it's basically taking your histogram model and centering it on a histogram that's defined by a parametric model. Okay. All right, let's show you a simulated example. So I created a fake data set. This is a simulated data set where the true data came from an Ising model that's a maximum entropy model with only second order interactions plus some extra synchrony. Okay, so there's some tendency for large groups of neurons to spike all at once. And what I'm showing you here is the Jensen-Shannon divergence between the true distribution which is defined by that model and my estimated distribution as a function of how much data I've observed. If I start with a pure histogram model and I have small amount of data, I'm far away. Basically, I make big errors if I use a pure histogram model. Whereas if I use either independent Bernoulli or cascaded logistic, they do much better. However, as I gain more data here, what happens is the histogram model converges exactly to the right distribution, given infinite data, right? Whereas these other parametric models are not capable of flexibly representing this distribution. So they saturate at some level. They never get down to the true, to zero Jensen-Shannon divergence. And so what we'd like is to be able to sort of capture these two regimes with the universal model here. So by fitting this universal model that I told you about, we get this curve here. This is the universal model with the Bernoulli base measure. Here's the universal model with the logistic base measure. You can see that with small amounts of data, they're doing much better than any of these models. And then they also converge to the correct model as a function of gathering more data, okay? Here's the same kinds of plots for some actual data collected by Arnold Graf and Tony Machin's lab. Here's recordings from 10 V1 neurons in a Utah race setup. We're showing again the distance from the test distribution as a function of how much data I've observed. The histogram model is bad for small amounts of data but good for large amounts of data. The Bernoulli model starts out better but saturates, it doesn't converge. If we make a universal model using a Dirichlet process or with a cascaded logistic model, we do much better. And here you can see the log likelihood ratio of patterns that I saw in the actual data versus how well they were predicted under the model. So the universal model, after we get to this last point here, is correctly predicting the, basically there's on average no discrepancy between the frequency that those patterns are predicted and that we actually observed them in test data. How am I doing on time? Five minutes. Five minutes, good, okay. So that's part one. And the second part we could skip entirely, is that five minutes, like 25 more minutes to 25? No, this is 25. Oh good, good. Okay, so I don't have to rush quite so much. Yeah, let's move to the next example though. Okay, so the second piece, I'll talk about applying these same universal binary models. So the idea is centering a histogram of a flexible nonparametric model on top of a constrained parametric model. Another thing that this is very useful for is entropy estimation, okay? So entropy is this quantity here. Let P be a distribution over binary patterns, right? That's got two to the M elements in this vector, which sum to one. The entropy of that vector is defined in the following way, where PI is the probability of the i-th pattern. It's just the sum of P of PI, log P of PI, summed across all two to the M patterns, okay? Most of you are probably familiar with entropy. It's a very important information theoretic quantity and it measures the population's representational capacity. So this has an upper bound on how much information could be carried by these binary patterns, by a whole population. Now of course it's difficult to estimate this quantity because for large-sized neural data sets, we never observe most patterns, right? Patterns will observe some patterns more than once, but many of the patterns, like all ones, we may never observe in our data set. And so it turns out that you can show, analytically, that all entropy estimators are biased. Generic entropy estimators tend to regularize the estimate of entropy in the following way. They assume that all unobserved patterns are equally likely. So in other words, they make a histogram of the observed patterns and they look at all the patterns you didn't observe and they assign them the all, they treat them all the same. And so obviously we don't have to do that and we're gonna use the same idea from this universal binary model to show that we can use a parametric model to smooth our histogram across the space of unobserved patterns. Okay, so the idea is to use the universal model for entropy estimation. What we can exploit is the fact that the posterior distribution over P belongs to the Dirichlet family. If that doesn't mean anything to you, you can just ignore that. We get this nice formula for the expected entropy. This is the Bayes least squares estimate of entropy given a data set. It involves basically a giant sum over these diagramma functions, which if you can do this in your head, you win extra points. Three minutes, okay, good. Basically ignore these equations. This is to show you that there is some content there, not just these words. I'm gonna, this sum is giant, okay? We have to actually sum over two to the M terms, but it's a tractable for the appropriate choice of parametric model. And I'll just tell you about one new parametric model here. This actually comes from a model that Mateo has worked on. We call it the total count model, which is the probability of any pattern a priori before seeing any data only depends on how many spikes it contains. Okay, so basically what, this captures the same tendency that maybe spikes are rare. And so I'm unlikely to see all ones, but I'm more likely to see a word with only one or two spikes in it. Okay, so this total count model, we can look at this for simulated data now where the true distribution, the true data had basically two modes in it. There were either, this is the X axis here with the number of spikes per word. And there were many words with only one or two spikes in them. And there were also many words with around 10 spikes. Okay, so there was a synchrony component of around 10 words that had 10 neurons spiking in it. This is up for a population of 30 neurons. And you can see that the entropy estimator using this count model inside a Dirichlet process converges very rapidly to the true entropy, whereas these other sort of state-of-the-art methods, this is the above estimator from Liam Paninsky or the Neimanman-Schaffi-Bialik estimator take a lot more data before they would converge. Okay, we can also do this for power law. And again, we get good convergence for both of these universal models. I'll say the theoretical properties of this entropy estimator, it's consistent, meaning it's guaranteed to converge for any data. But moreover, it converges much faster when the data exhibits statistical regularities. The intuition that I want you to have is that when the data look like the parametric model, we can get close with relatively small amount of data, and then we can get the rest of the way using this sort of sloth that the Dirichlet process allows us. We can sort of capture ways in which the histogram of the actual data differs from that described by the parametric model, and we can do so in a way that means that we'll always get the answer right in the limit of infinite data. Okay, so here's an application to real data now. This is data from E. Gieselnisky's lab, 27 on and off peristals cells. Here's the histogram of the number of spikes, and you can see that even with very small amounts of data, these universal estimators converge very quickly. Here, the plugin estimator requires sort of many orders, several orders of magnitude, more data before you get an accurate estimate. Okay, so just to sum up, I've talked here about one particular way to build a flexible model that can scale to large data sets. I've called it a universal binary model. The basic idea is a Dirichlet process centered on a parametric model, where the parametric model reflects something that we believe about, spike trains. So that's what makes this unique, I think, is we're exploiting some domain-specific information about neuroscience, the fact that we have some good parametric models that are adequate, but they're not sort of perfect for any actual data set. They combine the parsimony of a parametric model with the flexibility of a histogram or a non-parametric model. Of course, we still have to worry about the choice of parametric base model. I've talked about three possibilities here. Bernoulli, the cascaded logistic, or this total count model, which is very useful for entropy estimation. Okay, that means I have to stop, but I have two slides at all. I wanted to sort of zoom back out for a little bit. I talked about one particular way to scale up, but I think it's useful to bear in mind, for those of you who are interested in statistics and how to build models that scale, I have kind of a short list of ideas that I think are very useful, and various other pieces of work in my lab kind of try to exploit these other tools, and I'll just list them briefly. These are statistical ideas to exploit for building large-scale models. The one I focused on today was hierarchy, the idea that we can let complex models borrow from simple ones. So we start with a parametric model, and we use that to inform a non-parametric model, a larger model that's more complex. They have a hierarchical relationship to each other. But the other idea is you've heard of many of these, the sparsity that many of the parameters should be close to zero, smoothness that nearby parameters should be close to each other, locality is that dependencies are local, and low-ranked structure, the fact that often spatial and temporal structure can be separated out. Okay, so there's a lot of work to be done here, and it's a really exciting time to be in computational and statistical neuroscience, because there are a lot of these ideas that are coming up in machine learning and pure statistics, but I think there's domain-specific information about neural data that we can use to take these ideas and make them even more powerful. So thank you very much. And I just want to thank my collaborators, this is my postdoc, Memming, and you guys, students, Evan and Kenneth, have collaborated with me on all this work. Thanks. Hi, thanks for this interesting talk. I'm just wondering, like, you've showed one slide very quickly. If I remember correctly, there are 30 neurons or so for real data, right? How do you test your models for real data in this case? Sorry, which, do you want me to end? For this one, yeah, 27 on and off cells. This first question, second question, I'm wondering. What was the question? How do we know what the real data is? So you really test the real histogram model, you know? We don't know the real, yeah. So there's no line for true entropy here. There's no line, okay. They're all convert, but basically we have enough data here that they're all converging to each other. So we think that out here, our residual uncertainty is sort of only that wide, but here, even with small amounts of data, but you're right. In the previous figure, I showed true entropy for each other two when I simulated data. But here, yeah, we don't know. So it would be difficult to do this for 100 neurons because they actually won't converge given the amount of data. So then we'll be still, no no ground truth in other words. So you have to trust the other older models converging at some point. Yeah, I think we'd have to make arguments based on simulations. I mean, in some sense, you're going to still have big error bars because with 100 neurons, you can't really, you have high certainty about the value. And just a very brief question. So what if maybe it's a trivial question, but what if you add a relative problem to something like, let's say, you have 100 neurons and if somebody tells you, 30 neurons are firing. Adding that information, can you reduce the number of parameters? You know what I mean? So let's say, X number of neurons are firing. If I know this, what's the probability of the whole model? If I add that information, can I like decrease the number of, can you decrease the number of parameters to build your model? I guess it seems like you're saying you got some additional information. I guess the way I'm thinking about it would be that you would add, so in other words, I didn't talk very much about how we design a base measure, but we could design base measures by combining different kinds of models. So someone tells me these neurons are independent except when 30 neurons fire. So we could basically have a, we could form a mixture of two base measures. One is the Bernoulli model, which sometimes the words come from that model. And then other times we see 30 neuron synchrony. And so we can mix together those two distributions in the same framework very tractably. We would have two alpha parameters that control how much each of these distributions govern our real data. So I think this is an area where there's a lot of potential for new exploration. What are other tractable models? I should say, what are other statistical regularities in neural data that you might want to capture with the parametric component of the model? Basically, the more work you can do with the parametric component, the better off you are. You don't need to rely on that kind of slot factor from the, so you can kind of, I think you should go on your discussion. Oh, sorry. Sorry. Thank you. One question. I have a short question. Is there any chance to apply this analysis in a time-dependent manner? Not as we've done it so far, but I think that's the great question. Yeah, so one is we've ignored temporal, I should say, yes, we have done one thing with temporal dependencies, which this idea of hierarchy. If you think about temporal dependencies over a long period of time, you've got too many possible histories. And so we can use the same idea of smoothing, of using a simpler model. So a simple model would be just look one time step ago, fit that model, use that as a prior for the bigger model. So that's called a hierarchical Dirichlet process. And we've looked at that for single neurons, but we haven't looked at it for populations. So I think there are two big unanswered questions here. One is building in time dependency. What I'm describing is a slightly crude way to do it. The other is there's no stimulus dependency. There's no explicit. I'm basically imagining there are two stimulus conditions. We measure p of x to one stimulus. We measure it to another one. But something like the GLM, where it's a time-varying stimulus, also it wouldn't be applied very easily here. Okay. Okay, thank you. Thank you very much again. Thank you.