 As Sonya said, I was trained in statistics. But I'm also part of, at Carnegie Mellon, we have a very large collection of people, generally speaking, in machine learning. And we started mostly through statistics in the School of Computer Science. I actually joined the School of Computer Science to help create a department, a machine learning department, some years ago. And when we did that, we had a retreat to discuss different perspectives. And one of my computer science colleagues during this retreat got up and he said, you know, I finally figured out the difference between statisticians and computer scientists. Statisticians want to solve problems with 10 parameters and get it right. Computer scientists want to solve problems with 10 million parameters and get an answer. So these two disciplines have kind of merged together. And now what I think we're all after is, you know, really, how can we get it right while also scaling up? I'm going to talk about this a little. But I have to warn you, I don't have deep or insightful things to say. It's really just kind of the emerging frontier now. But maybe my perspective is a little different than some others you might hear because I'm a statistician. So we are interested in, and by the way, at the beginning, I'm going to repeat a lot of what Nico said with different words. So we're interested in the problem of what's often coding, which is to elucidate the representation and transmission of information in the nervous system. And the first answer to how neurons represent and transmit information came from a series of results. Here I've listed three different, it's the middle button, right? The other one. OK. Three different sets of investigators, all of whom ended up getting Nobel Prizes for work on different systems, partly articulating this general fact that neurons transmit information by modulating firing rate. And I'm just going to show you some data from Heartline in the 1930s. He recorded from Horseshoe Crab or Limulus, which has a large optic nerve coming from one of the eye, well, from the pair of eyes. And he recorded from a single neuron after when shining a light on the retina. And here's a bright light. You get rapid discharges. These are the spikes that Nico talked about. If you let the light get dimmer, you get less rapid discharge. And dimmer is still even less rapid. And the point is that the firing rate seems to encode the strength of the stimulus. Now these data were interesting, but they're a little deceptive. Oh, and by the way, he called the place where you have to shine the light in order to get that particular neuron to respond. He called it the receptive field, which is a term I'll come back to in a second. These data are deceptive because if you look at most data in cortex, the layer of the part of the brain that most people, I think, are interested in, you see a lot of variability. So here's one experimental trial. And here these crosses are the spikes as they occur across time. This is when a particular cue emerges. And this is time in milliseconds after the cue. Then the second line is the second trial, which is as nearly identical as the experimental can make it to the first trial and so forth. There are 15 repetitions. There's a lot of variability. This is one of the things that I wanted to impress on all of you. There's a way you can summarize this that's very popular. It's a nice histogram where you just average across the trials and normalize so that you get in spikes per second here. And it shows the general tendency over time for the neuron to fire. But it's still, as you can see, very noisy. This is why there are interesting statistical problems. So the neural coding question, first of all, is what stimuli or actions will drive neurons in a particular part of the brain to fire rapidly? But then as time went on and Nico was saying, as we get into, say, the 1980s or so, people started asking more complicated questions. How is information represented in ways that go beyond the gross changes in firing rate? And in particular, how does the network of neural responses vary with experimental conditions? So throughout, the statistical aspect is we have to identify the signal together with some extraneous noise. And that's why there are statistics. So here's again the Utah array, which was used in the data I'm going to show you by a former computer science student, Ryan Kelly, very talented guy who did a postdoc with me for a short period before going on to Google. And here are these raw tracings coming off the electrodes that Nico talked about, filtered either for a local field potential or spikes. These are a bunch of waveforms that repeat. And when they do, then they're considered to be spikes. And then the next step is to spikes from the same neuron. And then the next step is to show where they occur in time. The blue is another set of spikes of waveforms from a different neuron. So you get a different second spike train off that electrode. And here's a nice movie that Brian put together to show the nature of the data that you can get. This is primary visual cortex from an anesthetized animal who's being shown a movie. And you're going to see the movie flash or in front of you here. And each circle is the receptive field for a different neuron. When it's red, that neuron is firing. When it's black, it's not firing. And down below, you'll see the spike trains as the movie evolves. So the point really, the reason I'm showing you this is to give you the sense that there's a very rich set of data that can be collected on these arrays. It's complicated, large numbers of neurons. And it allows people to ask more complicated questions about the structure of the network. I also wanted to mention, and there's a nice graph that was made by these guys Ian Stevenson and Conrad Curtin a couple of years ago, where they looked at the number of neurons being recorded versus the data publication. This is a log scale. And there's exponential growth. So now we have a couple hundred neurons. This number is going to grow. We can expect it to grow substantially over the coming years. So there's more and more interest in methods for analyzing this kind of data. So how does the network of neural responses vary with experimental conditions? The complications here, we have dozens to hundreds of neurons. There are unknown interactions. There's multiple timescales involved. And many possible measured or often unmeasured drivers of activity. Now, the data itself, these spike trains, actually, the data set is not very big by today's standards. Here around 300, 3 times 10 to the 8th data values. The big challenge is the complexity of the relationships. So we don't yet have comprehensive analytical methods. We do have, I think, a solid statistical framework that can produce some simple network descriptions. Nico showed some. I'll show you some more. Sonia's worked on this quite a bit. Jonathan, there's a lot of people doing a variety of things. I think they're all fairly closely related. And I think we are at a stage where we can articulate the problems that need to be solved in order to scale up. So in the remainder of the talk, I'll give you a quick example and just recapitulate. So this is just an example. Not something we've done already, but we're trying to do with one of the neurophysiologists in Pittsburgh. And it's similar to something, again, that Nico showed. There's recording in two brain areas here, the frontal eye field and V4. And the general question is what happens in the brain during a visual attention task when the monkey has to concentrate on one particular location in space. And one of the appealing notions about which there's, I would say, fairly strong circumstantial evidence is that the two areas will communicate through these oscillations of the type that Nico showed. And the notion is that one area will oscillate, that the other area will tend to oscillate, and that this will produce synchronous firing in one of the areas, say, here. And that will effectively increase the firing rate of neurons in that area. So the statistical step that I wanted to take was to establish the relationship of the synchrony to the oscillation. Or, for that matter, other possible variables, the question being how much of the synchronous spiking is due to the oscillation. So now let me show you a little bit more about the way we go about this. You can get a feeling for the nature of the computation. Spike trains occur either in theoretical contexts, meaning the math, in continuous time, shown here. And here the natural framework and probability theory for stochastic events in time is what we call point processes. But the data occur in discrete time, typically at the one millisecond level as a binary time series, a series of zeros and ones, one when there's a spike, zero when there isn't. If we're looking at synchronous firing, then we're talking about the case where, let's say, two different neurons fire in the same time bin. This time bin, in our work, was five milliseconds. And here's two different pairs. One pair over here is a roster plot, again, like I showed you earlier. Another pair, one, is two different neurons. Here's a different set of two neurons. And the circles indicate the times at which both neurons fired in the same time bin. And it turns out that these two pairs were picked out to show very different behavior with regard to synchrony, though you cannot see it with a naked eye. One of the sources of this difference is illustrated in this picture. So here's one neuron on many, many repeated trials, 125 repeated trials. And you'll see these bands of activity which indicate that, and I should say, the stimulus was changing across time. So these bands of activity indicate stimulus-related firing. So here's a place where the stimulus was driving this neuron to fire. Down here, you'll see maybe not a vivid representation, but nonetheless, there are bands of activity where this neuron tended to fire for certain stimuli. Over here is a different representation. On every line, we have a different neuron's activity. So this is the first trial for 128 neurons. This is the second trial, again, as nearly as possible, exactly the same experimental conditions. Second trial for the same 128 neurons. And now you see bands of activity that seem to run across all of the neurons. But interestingly, that activity occurs at different times on different trials. So this appears to be not stimulus-related. And in fact, it presumably has to do with the fact that an anesthetic was used. So there's a well-known phenomenon under anesthesia that you get these slow waves of activity going across large numbers of neurons, and that seems to be what's happening. But if you're gonna look for synchrony and you wanna see whether the synchrony is related to the stimulus, the problem is you're gonna have a lot of neurons firing synchronously during these slow waves. So one of the things that Ryan said about doing was to really untangle that, remove that effect from the assessment of synchrony. So the models that we use are based on firing rate. As I indicated, this is the place you start. And you might, one way to think about what firing rate is, is you just take some window of time, which could be broad, a big window, and you just count the number of spikes and divide by the length of time. That answer is useful for some purposes, but it suffers a number of statistical defects and what has evolved really began in the 1960s and has gotten more sophisticated more recently, is what I would call the statistical signal processing answer. Firing rate is really represented by an instantaneous intensity of firing. It's a smooth function, and this is now within a point process framework. And what we say is the probability of spiking in a small window of time based on some driving variables, which could just characterize the experiment, is what's called this intensity, or conditional intensity. There's probabilities of the intensity times the length of time, delta. And this is now what we call the firing rate function. So we now model that in various ways, yes. Please try not to point at me with... Oh, sorry. Well, you know, it doesn't go off. I don't know, it's not going on. Is there some way that this goes off? It stays on, I don't know why. Okay, sorry. I'll try not to blind you. So here's the variable that would determine the firing rate, and it involves the spiking history, that is how, when this neuron has spiked in the past, this allows for non-quasant spiking. And then there are other more interesting variables involving the stimulus, or this field potential that Nico talked about, or possibly the activity of other neurons. So statistical models of spike trains are, they involve two things. There's a universal formula, it's rather simple, that characterizes the noise in the spike train as a function of the intensity function, and then there's also a model for the intensity. Oops, sorry. So here's the formula, the formula itself doesn't matter, but what matters is we've got the probability of spiking in terms of this intensity function, and then we've got the model for the intensity, which have some effects for stimulus, history, and other variables. And this will now have anywhere from dozens to hundreds, occasionally thousands of parameters. It's what we, and statistics generates what we call a likelihood function, and that is maximized in order to get the estimates that we need in order to fit the model. So there's an optimization problem over anywhere from dozens to hundreds or thousands of parameters. Then this fits in the framework of modern regression, which is just like least squares, except it's more general. And we have a variable, a response variable that's observed, this would be the spikes, the spikes in our case. There's some function of variables that characterize the experiment, and then there's some notion of noise that gets combined with the deterministic part of the representation. So here we would call this point process regression, or it sometimes comes under the GLM, generalized linear model technology. Again, in our case, the noise comes from this part of the model, and the deterministic part is the way the stimulus or spiking history or other variables affect the firing rate function. So model for the intensity, and then we have a data fitting method, which is typically maximum likelihood, it's another involving optimization, so it's again optimization over a fairly large number of parameters, and then the statistical inference. And I wanted to mention that this is often based on resampling methods, most commonly we use resampling methods, which means that having done this fitting, which could take some time computationally, you then have to repeat the process hundreds or thousands of times in order to do the statistical inferences. Significance tests or confidence intervals. And this often, when students do this, they will often have to let the computer run overnight in order, if they're using their own laptop, it'll often run overnight, and one of the efforts that people have started to make is to speed this process up. So for synchrony, we want to look at covariate effects, what is it that drives synchrony, and I'm going to skip the methods here that we used, it actually took a fair amount of mathematical and conceptual work in order to come up with a way of kind of fitting synchrony into this framework. But I'm just going to go back to these two pairs of neurons. It turns out that when you analyze these two pairs, in both cases, you have strong, very strong evidence statistically for excess synchrony above what would be predicted by chance. When we then introduce a covariate for the slow wave activity, this is where the two pairs are different. In the first left pair, it turns out from this regression analysis, you can see that the synchrony is due entirely to the slow wave activity, but in the right pair, it's not due to the slow wave activity, at least not solely to the slow wave activity. So this is a case where a pair of neurons show stimulus-related synchronous activity. And then we can then get the proportion of spikes that are explained by a given covariate. So we have the firing rate or the time-bearing firing rate. For the first pair, we have half the synchronous spikes are explained by the time-bearing firing rate, and when you include the network history, you get all of them, at least up to statistical error. Whereas in that second pair, you never get more than about 50% explained, and the rest presumably is from the stimulus, that is, the stimulus is actually driving the synchronous spikes. And now what we're after really is not so much the slow wave activity, but the beta oscillations or gamma oscillations, beta is what Nico talked about, gamma is a little faster, and currently we're looking at data to try to see how much is explained by those kinds of rhythms. Yeah, okay. So the challenges in getting it right while scaling up, I mentioned not so much about the size of the data. However, I did want to mention that if we look at the spike trains, we have small data sets, but if we look at the original recordings, which Nico showed, the actual voltage tracings, that's actually big data. That's like 50 gigabytes roughly for an hour of recording. And now you have the strategic question, which I think runs through all of big data. I've seen it, you can certainly see it in genomics where there's, you know, you start with the alignment files, and then actually to do most, almost all statistical analyses, you go for like, heplotype analysis, you go to variant call format, which is a drastically reduced format of the data. Here again, typical, you start off with the raw data and a drastically reduced format, and I think it's a strategic question, how often do we want to reduce the data exactly when, and I think it's an interdisciplinary question. I don't think it's something that should be decided upon merely for computational convenience. I trust that you all agree, but this is where you need interdisciplinary teams. Complicated relationships involve complicated statistical models and difficult computations. The hard part is getting it right. Now in our case of synchrony, there were more than 8,000 pairs of neurons. I just told you about two pairs, and the hard parts, one of the hard things in the statistics is to say, well, you're gonna get a lot of these synchronous, you're gonna see synchronous activity purely by chance. You have to account for the 8,000 pairs, and this is what's sometimes called multiple comparisons, and we had to develop a new method for dealing with that in this particular context. Still comparatively easy, but took about a year of work, at least part-time between two professors, my point being that even the relatively easy problems when you're trying to get it right can take some real effort. The harder part is that the data can be sparse relative to the model complexity. So for example, if you look at triplets, rather than two neurons, you look at three. Here, from our data, you see very few triplets, and if you wanted to study something about the way triplets are related to the stimulus, you would have a hard time because the data are sparse. So currently, there's lots of interesting analysis being done with parallel spike train data. Please keep in mind that these data are highly noisy, and so we need statistics. I would say that in the next five years, there will be substantial progress, but it won't be through, I would guess. It's rare to have big breakthroughs. I think more will build building on existing ideas and statistical machine learning, which are things like parallel optimization, online learning, and so forth. And especially this will involve using models with unmeasured drivers where we can characterize some of the activity using variables that we introduce into the model to characterize, again, to try to identify things that we haven't already measured. So again, we need to be clever with the experimentation. Nico showed an interesting graphic that talks about the iteration of analysis and experimentation. This is also something I think is very important and will involve new ways of thinking. We also need to be clever with the computation and with the statistical analyses, but I think importantly, as I've tried to emphasize already, these three areas kind of have to merge and we all have to work together in order to be sufficiently clever to tackle the big data problems. Thank you. Questions, go to. So, thanks for a very nice talk. One thing you didn't talk about was this problem of spike sorting. Yes. So is that something which is less of an issue, you think, for the Utah arrays or how would... No. So this is, yeah. So why... So it's the problem that we all like to sweep under the rug. Ryan, as it happens in that data set, was very, very careful. He did it by hand. That is, he went back very carefully over all of the units. And actually only about 50% of them were high quality and he was confident that they were well sorted. So is it a big problem? Absolutely. So for those who are not aware, identifying those distinct waveforms, I showed you a picture that Ryan made where there were the blue waveforms and the red waveforms. That was the nice case. There's much more... Well, equally commonly, there are very difficult problems in trying to distinguish them. And yes, Jonathan has worked on this recently. Many people work on it. It is a very important problem and one that definitely needs to be solved, especially for things like synchrony, as we're interested in. Yeah. Okay. Yes. Srijoy. Do you think your analysis would scale up to other data sets? Like could you apply your analysis to Niko's data sets or the data sets that Jonathan's working on or are they very specific to the data set that Ryan had in hand? No, no, I'm anxious to apply it to Niko's data. Absolutely. That is, I don't see a scale issue in what we have done so far because we've tackled what I say is relatively easy. You can work in pairs of neurons. That's not hard in terms of scale. What's harder is when you want to combine larger numbers and you have to figure out how you're going to do it because as I said, the first issue is the data or sparse. So you have to come up with a different paradigm. In fact, Jonathan's going to talk about one kind of paradigm to deal with, I think, with larger numbers of neurons but it requires different ways of thinking. See a chance to team up, so to say, with computer scientists who do this mining of large set of data which then can be transformed into neuroscientific questions. Are you talking about if we had all the data kind of? No, I think, for example, we heard this morning this talk about this ex-mine. Yes. I think that these kinds of methods could be in some cases be transferred to our problems as well as we, for example, did with the frequent items at mining. Do you, did you get in such kind of... Well, I'm not, are you still, you're talking about text, you're going into journal articles and so forth still? No, no, no, no, no. No, it's about data. I'm talking about your scientific question, problems. Yes. Bike trains and so on. And I think people here have a lot of tools available which, so to say, need to be transformed into our kind of problems. Right, so yes, I'm certainly very anxious to see what the possibilities are. And I think to me, the important thing is to build into any of those efforts, the analytical tools, actually I would maybe build into is not the right word, to take account of the analytical tools that people will want to apply. So if you're going to, let's say, put a lot of data sets up in a repository and you want to have relevant ontologies and so forth, you need to take account, I think, of the analytical methods that are likely to be used. So in that sense, I think it's absolutely essential to collaborate. Certainly there will be great opportunities for analysts to use those kinds of repositories. But only if they're useful to the analysts. Okay, thank you very much.