 My name is John Victor, and as the program indicates, I'm going to try in an hour and a half to give an introduction to data analysis and neural coding, a really small subject. So I'll try. I hope that there's some interaction, and I really hope that I'd rather have interaction and not get through all of my material and instead have a conversation. And in part because of that, I want to make sure I begin by thanking the people who I've been talking with and working with over the years and developing some of these ideas with, so I don't fail to do that at the end because we're out of time. And I will show you some data, but I will also try to just communicate some ideas and some themes that might be harder to get from individual papers. In Outline, I really wanted to talk about two main things. One is all the different tools that one has for describing an input-output relationship, of course, in the context of a neural system, but many of them were not even developed there. And second, not so much describing in terms of input-output relationships, but just characterizing an input-output relationship was sort of a very different flavor using information theoretic methods. So each of these are fairly large topics, and so we'll have to content ourselves with just a framework, but hopefully that will be useful. There are two themes that I hope you will see as recurring. One is that there are a lot of different ways of doing each of these things, and why is that? And I hope that will become apparent, so that you don't just choose the one that your advisor said or the one that you read about last. And the other is that data analysis can almost always be viewed as some form of dimension reduction, usually using a model, and I think that's also just a very good way of thinking about it. So first, why so many approaches? And again, I hope you'll see these things emerge implicitly, but just to be explicit about it. And perhaps the most important is what Dr. Lenick said this morning, that there can be many reasons that you model. One is that you might want to do something relatively classical, so a structure-function correlation, what part of the nervous system does, what thing, and the model structure helps you do that. Another is that you might take kind of a different philosophical view, which is that you want to determine the minimal complexity of something. So you can rule models out, because models make predictions that certain things have to happen, you do an experiment, and you find that it doesn't. And I'd say a third thing, which is maybe the most relevant to some people here, is that you may want to just have a compact description of an input-output relationship of something, so that you can use it for another purpose, to design a piece of electronics that works the same way, or that you want to access one part of the nervous system. You can't do it directly, but you have to go through the retina. You'd like to have a model for what comes out of the retina, and you don't really care about what happens inside. So one could have many different goals. Another axis is that the systems that we study can vary a great deal in complexity, and they're in many ways. So they can be approximately linear, they can be highly nonlinear, they can be very broadly tuned, they can require very specific sorts of stimuli to drive them. And that influences the choice of methods. Experimentally, sometimes you can control the input, sometimes you can't control the input, but you can at least measure it. Sometimes you can control it and you can't measure it. So that also gives you different reasons for doing different things. And then when we get more specific about the outputs of neural systems, sometimes we're recording signals that are continuous, like intracellular voltages or the EEG. Sometimes we're recording outputs that are discontinuous, like spikes, or events that at least were considering to be discontinuous, just trains of events at particular times. So that is something you need to think about. And sometimes we care about what the typical response is, and that's all that matters for our purpose. Sometimes we really care about what the entire stimulus response distribution is. We want to know about the variability or how the variability depends on the input. So it's not quite you can mix and match all of these things, but even if you can only mix and match a few of them, I think you can see right away that there will be lots of different strategies that one might want to use. So I'm going to mostly use the visual system as a model. But even before that, we can be a little bit more formal about what sorts of things we're going to try to do. So the idea is that there's an input, which in the case of the visual system is spatiotemporal, and there's an output, which is let's say the response of a single neuron measured in some way. And we want to characterize that. We want to find the F. For formal purposes, it doesn't much matter whether the stimulus is a current or a voltage injection or a sensory stimulus such as light. Obviously if it's light, it might well be an image, so it's a function of space and time. And the response, as I suggested before, could be a large number of things. It could be a voltage, it could be a current, it could be a local field potential, it could be a firing rate. And when I say that we want to find F, what one usually means is that you have in mind some sort of a cost function that you're trying to minimize. Namely, you have an F, which is formally your model, that predicts the response, and you want to ask how different is the response that I actually measure in the data from the model. So you want to minimize that. But you may want to add some other terms when you find your F. So for example, you may have some prior notions of what sorts of models are likely and you could build that into the cost function, or you might decide that you really want to have simple models even if they're not so perfect, and you might want to build that into the cost function. So although one is saying we want to find an R and often by reflex find it by some sort of least squares approach, what we really mean is that we're minimizing a cost function which could include other things besides the difference between the response and the model. So if we want to talk about situations in which we care about response variability, we're not thinking of a stimulus which is just mapped by some function into a response, but we're thinking that there's a whole joint probability distribution or conditional probability distribution of the response given the stimulus. So our task is to find P which is not quite the same thing. But there are important situations in which this problem reduces to the one before. So for example, if our model of the probability distribution is that the response is something deterministic plus some additive noise, then you can probably see that this maps into the previous problem. We can also think about situations in which the noise is additive, but the noise depends somewhat on the stimulus. So that's a slightly harder problem. The usual crank that one turns is called maximum likelihood. And in this situation, so what you want to do is find the probability distribution that maximizes the likelihood that you got the data that you did, which means that you want to find the, so the reason that this maps is that the probability of getting a response given the stimulus is determined by how far that response is from the most likely stimulus. So that in turn means that what we want to do is minimize how big the observed response minus the deterministic prediction is given that those differences are drawn from some noise distribution. So often one can map maximum likelihood and probabilistic formulations into the deterministic case. So there are two sort of important cases of this. And one is that if you are, if you're willing to think of the noise not just as being additive but being additive and Gaussian, then in fact going through all this maneuvering leads you to a situation which you're just doing a least squares fit of the response to your model. So that's sort of a common thing that one does. But another situation which is actually quite different is if you're dealing with a neuron output which is, which spikes, so you can think of that as zeros and ones. So now the noise is the thing that maps the probability of a spike to either there was a spike or there wasn't. So that's not additive anymore. And that's a very different sort of situation in which the least squares fitting doesn't work. I'm not actually going to talk too much about that. But just to keep in mind that if you really want to pay attention to the fact that you're dealing with a spiking neuron, you're not doing a least squares problem anymore. Okay, so non-additive noise, more ways to go from here just to get an idea of sort of what you can jump into. It's possible that the noise might even depend on the response. And that adds layers of complexity to what you do. And it's possible, or maybe even likely, that you'd be recording from multiple neurons at the same time. And if you don't care about noise, that's no big deal because then you can just model each neuron separately. If you do care about variability, then it might be that fluctuations in one neuron might somehow be coupled to fluctuations in another. So you have to model that. So things get very complicated and one could start out perhaps by assuming that each neuron's fluctuation away from what it typically does is independent of any of the others. It's a very strong assumption, but advantage that it says that this joint distribution now factors into problems in which you only have to model each neuron and its variability separately. The tension between a strong hypothesis but a hypothesis that at least allows you to get off the ground. If you don't want to assume conditional independence, then there's a natural sequence which I'll probably get to talk about towards the end of things that one can do to loosen that assumption. So a couple final things before we get to actual specifics is that one often wants to study things like adaptation and learning. So how the response now depends on the recent history. That fits into everything we say. We don't have to add anything new. We don't have to think that this P or the F changes over time. We just have to remember that the way we formulate it, it actually depends on stimulus history. So that doesn't add anything. The reason for stressing that is that we are making an assumption which sounds a little opposite. Namely that we're assuming that this P or the F that characterizes the system is what's called stationary. So stationary means it doesn't change over time. But it doesn't, that means it doesn't change over time in the sense that it doesn't care about what the absolute time is. So it doesn't care about whether you did your experiment on Tuesday or Wednesday. Of course it can care on Wednesday what you did an hour before. So our systems that were, our general black box models are allowed to depend on past history. They're just not allowed to know what the clock on the wall says. And that's kind of an important distinction and it allows one to do things that you couldn't otherwise do. Yeah. Could you maybe go back to Poisson's Viking? I'm sorry? Did Poisson's Viking this life before? Yeah. About the Poisson's Viking you said it's non-additive. Could you maybe explain that again? Yeah, yeah. Okay. So let's say that your model predicts that the probability of a spike at a given time is 0.3. There can't be a 0.3 of a spike at that time. There really will be a spike or there won't. So at that time the noise, namely the deviation from the prediction will either be plus 0.7 because it may 0.3 go into a spike or minus 0.3. If on the other hand, so that point is probably pretty clear. But if the probability of spiking was let's say 0.9 then the noise would either be plus 0.1 or minus 0.9. So the noise has to just force it to 0 or 1 no matter what the probability was. So it's simpler than it sounds. Okay. So I did want to talk mostly about the visual system and kind of try to follow my own rules. I want to say something about what the goals are of the examples that I'm going to show. And they fall to the category of determining minimal complexity and the phenomenological description more so than anything else. One can write in one discussion about how things connect to structure but unless we actually put the electrodes there then you don't know. So it's the second and the third bullet points that I want to emphasize. I'll kind of do a whistle stop tour through the visual system and we will go from things that are approximately linear to highly nonlinear but we'll stay in the range of things that are broadly tuned. If the talk went for three hours we'd get to the narrowly tuned things. And since this is a system in which you can easily control the input that's one of the sort of selling points of studying vision I'm only going to be talking about situations in which the stimulus is under the experimenter's control. I think that I'm going to be talking, I'll be talking both about continuous and then spiking neurons. And mostly about situations in which we care about characterizing the average response not the entire distribution. So yes variability is important but not so much for what I'm going to be, the examples that I'm going to be showing. You might care a little bit about the biology itself and there will be a couple of themes that hopefully will emerge. Examples from in the retina, the thalamus where the retina projects to and visual cortex and there's going to be a trend to increasing complexity. So signals will go from graded to spikes and as was just mentioned that forces us to go from additive noise to non-editive noise. The responses will go from approximately linear to very non-linear and variability will go from probably interesting but very well understood to even more interesting and not so well understood. So of course the talk is serial and the anatomy of the visual system often is thought of as serial but just in case people really care about the biology I just want to emphasize that visual processing itself is not serial. And for example when sees this in the anatomy, so nearly every retinal synapse is reciprocal. The lateral geniculate and the thalamus is referred to as a relay nucleus but only about 10% of its inputs come from the retina, the rest come from brainstem and feedback from the cortex. So it's not really just a passage. In the cortex itself we talk about V1 then V2 then V4 then and there's obviously a hierarchy based on the names if nothing else but there are patterns of... I have a question related to 10%. Is it approximate number or somehow you calculate it? It's a number in the literature not calculated by me. But I think it's a good ballpark number. It's more than 1% much less than 30%. So really an oversimplification to think that most of the inputs to the lateral geniculate come from the retina. They don't. They come from their feedback or non-visual. I guess you talk about visual models in which case it would be more than... Yes, yes. I'm sorry to the modelers that things are actually fairly complicated. Okay, so and just to mention the last point that cortical errors are hierarchical in name. There are differences in these structure in the laminar organization of feed forward and feedback projections but whenever there's a feed forward projection there's a feedback projection too. So not talking about any of that but I just want to make sure that I'm not misleading people. Okay, so we start in the retina and there won't be a quiz on this but just to mention the big things that we need to know. In this slide light comes in from the bottom. There's a layer of photoreceptors here. The output to the brain is the ganglion cells which are here and there are the fastest way from the photoreceptors to the ganglion cells is via bipolar cells which form this layer and then between photoreceptor and bipolar or between bipolar and ganglion cell there are synapses and there are also horizontal connections so that there's the opportunity for lateral interactions. So it's a really beautiful system and each of these synapses are reciprocal. Okay, so what's nice about it, especially for the purpose of this morning is that as we walk through we can actually add some of these different layers of the complexities of modeling. So photoreceptors, we can think of a photoreceptor as really just getting a single point input. Not exactly but we can think of it that way. So we only have to think of measuring the response, characterizing the response which is a function of time to the light input which is let's say intensity is a function of time and what we'd like to do is do something better than simply give somebody a dictionary that if you give this response, you get some stimulus, you get this response, that stimulus you get that response. We'd like to have a much more compact way of doing it. And one way to kind of get off the ground is to make a strong assumption which in turn will allow us to have a very compact and it will turn out pretty accurate model. So that's the assumption that this transformation F is linear. So linear is a word that people use in many, many contexts and it has a very specific meaning here. Probably a lot of you know the meaning but I just want to make sure I've said it so we all agree. So linear means two specific things. First, that this transformation obeys superposition. So that means that if I look at the response to one input and I add up to that the response to another input, I could also add the inputs and look at the response and I get the same thing. So I can superimpose two different patterns of light and time, measure the response to the sum of them, and I should get the same thing to what I got if I measure the response to each of them separately and then added them up, okay? And the other is, you know, the mathematicians here might say this is really a consequence of that in some limiting case, but nevertheless it's worth saying separately is that linearity implies scaling which says that if I take the input and I multiply it by some number, I double it or I triple it, then it's just the same as taking the response to that input and doubling or tripling the response, okay? So I double the input, double the output. Why does this do so much work for us? Well, it does so much work because we can take any stimulus at all. So anything that we want to model or predict the response to, we can divide it up into very narrow time slices. We could imagine that we presented what was happening in each of those time slices separately, okay? So this stimulus is actually a superposition of lots of pulse inputs at different times and different sizes. So superposition says we only have to know about the response to any one of them. Scaling says that they're all the same other than the scale factor and then translation invariance which we talked about before so you don't only have to measure the response to one of them, it doesn't matter whether it was here or here or here because it doesn't know about absolute time. So I think I just said that we only have to measure the response to one pulse and let's just, so we give a pulse, we measure the response, let's just call that thing K of tau, okay? So that's known as the impulse response because it's the response to an impulse. Now we can formalize this idea of superposition and that says that the response is the superposition of impulses at different times weighted by how long in the past they were, okay? So now we have a very compact description of how any input relates to its output through this thing called the first-order kernel or the impulse response. And we even have a recipe for measuring it. We present an impulse and we measure the response. So the question is does it work? I mean, it's a nice model but does it actually predict responses? We measure a response or other people measured responses to an input which is a narrow delta function. And first we can test whether scaling holds. So we double the impulse, we double the response, we quadruple the impulse, we quadruple the response, multiply the impulse by 8, multiply the response by 8. So that looks pretty good. And here's another experiment in which there's an impulse and now we see whether that will predict the response to a step. So we take this impulse and we just copy it many, many times and add it up and it predicts the response to a step. So that's pretty good. Did you have your hand up? Somebody hand it in the back? No, okay. So at this point we made a strong assumption but it turned out to be experimentally valuable and gives you a very compact description of the input-output relationships of a photoreceptor. So there's always fine print and the fine print is that we used a restricted range of inputs and if we use a much less restricted range of inputs, so for example we change the background intensity that we're essentially calling our zero by a factor of 10 to the fifth or 10 to the fourth and if we change the flash intensity by factors of 10 to the three then we find that now the impulse response itself now changes. It gets narrower, it gets faster and because of some plotting conventions it actually gets smaller although it's not plotted that way. So what we can say is, sure the photoreceptor is not linear if you look over a very wide range but at any range that you choose, you might have to define that more formally but at any range that you choose a linear model is approximately good. So that's a nice situation too. But then we leave the photoreceptor and see what happens. So now we move to bipolar cells and photoreceptors are pretty simple for 30% or so changes in modulation depth they're really very closely linear, bipolar cells, they're still non-spiking but they differ for photoreceptors in a lot of ways and one of them is some bipolar cells hyperpolarized to light others depolarized but the main thing that we care about here is that the range in which they're linear is much smaller much less than, typically less than 10%. So is there any way that we can sort of bootstrap the technology that worked so well for photoreceptors and use them for something which is linear in a much smaller range. So it's not just a quantitative difference, 30% versus 10%. The linear range for photoreceptors is good enough to account for what happens when you look around the room when you look at different surfaces, have different reflectances they might differ by 10, 20 or 30%. So within the range of sort of normal vision looking around the room not the range of normal vision of when you go from a dark room into the bright outdoors but when you're just looking around the room the linear range is fine. For bipolar cells you're exceeding the linear range all the time. So it's really a big difference. So we could try to do exactly what we did with photoreceptors namely measure the response to an impulse. But here and in lots of other circumstances even though the direct application of the formalism says the way you measure an impulse response is you measure a response to an impulse it may not be a good idea and several reasons for this. The first is that it might be that we can find a linear range but that linear range is so small and the responses are so small that we're now dominated by instrument noise or some other sort of noise. So the linear range whether or not it's actually relevant is actually unworkable experimentally. So that's one reason. Another reason is that if we try to go the other direction and we use tall pulses, big pulses so that we can measure a response and get out of the signal to noise then we may be pushing the system into a range of stimuli that it almost never sees. So sure we might be able to characterize things but it's not in the range that's relevant. And of course you could also be dealing with situations in which you don't control the inputs you can't do any of these things anyway. But if you can control the input let's just talk about strategies that you can use of measuring this impulse response in a better way. So the basic observation is that the response is a linear function of the stimulus but not only that if you have a data set which consists of stimuli and responses you're trying to find the best linear fit. So you're trying to do a regression problem. The parameters that characterize K can be determined from a linear regression of the response on the stimulus. So we're trying to just find the optimal linear prediction optimal meaning in a Gaussian using a least squares cost function and as mentioned before that's equivalent to saying that your stochastic model is my prediction plus Gaussian noise. So that's what we're trying to do. I think I just said this. So because this is a linear problem one can write out a formal solution. So given your data which your stimulus as your measurement are you can write down the correlation between the response and the stimulus at a prior time. You can write down a matrix which consists of the correlation of the stimulus at one prior time with the stimulus at another prior time and this matrix times the matrix inverse of that is a numerical approximation for the impulse response which is what you want to get. So the formalism tells us that we can take a simple measurement. This is the correlation of the response now and the stimulus at a prior time divided by the auto correlation of the stimulus and get our answer. But divide means matrix divide. So when you matrix divide what would you like the matrix to do? You'd like the matrix to be as far from singular as possible to make it easy to inverse. It's basically to maximize your signal to noise. So our challenge is we can look for stimuli for which this matrix is as close to the identity as possible to make the linear problem easy to do. So that's what we'd like. So translating that into other words it says that we'd want to find the stimulus which is uncorrelated with itself if you look at two different times. So there are many ways to do that. So how can you make a stimulus which is not correlated, stimulus value at one prior time is uncorrelated with the stimulus value at another prior time? One answer is you can use uncorrelated noise. So you can use Gaussian uncorrelated noise, for example. When you do that, that achieves the goal of making this thing which you have to matrix divide by equal to the identity. Once you've done that, now instead of this kind of ugly thing we see that the impulse response which is what we want to solve for is simply the reverse correlation of the stimulus with the response. This is now probably raising its head as the familiar reverse correlation method normalized by the stimulus power. So a long time ago this was well recognized that one can use Gaussian, if you use Gaussian noise that makes this multiple of the identity and that leads and because in Gaussian noise this Gaussian noise at one time and Gaussian noise at a prior time are uncorrelated. So this is kind of the rationale for using noise stimuli and measuring input output correlations as a fancy way with advantages of measuring the impulse response. The advantages are using a kind of continuous signal rather than blasting it away with large shots. The drawback is that this orthogonality, the fact that these correlations are zero at different times is only orthogonality in the statistical sense. You have to wait a long time for the fluctuations in the noise to average out and in fact you have to wait it becomes more and more orthogonal in proportion to the square root of the experimental duration. So if you want to make better and better measurements you have to wait longer and longer and longer in a kind of way that scales that scales unpleasantly. So there's an alternative and the alternative is that you can design stimulus sequences built of minus ones and ones which are actually orthogonal to itself after a shift of time. And those are things that are known as M sequences and there are a few others but the famous ones are known as M sequences. So this has the advantage that you can do a finite duration experiment which is always an advantage and have exact orthogonality and that the linear regression problem becomes easy. Everything has fine print but in this case the fine print is there are major technical problems that come up when the system is very non-linear that leads to a whole other branch of things and because in fact some of those problems can be solved but with difficulty. So now how does it work? Here's an example showing that it works. So here this is an intracellular voltage tracing of a bipolar cell when a designed sequence of minus ones and ones of current is injected into that cell. So current is the input, voltage is the output. This is kind of a blow up of what that random minus one and one sequence looks like and the analysis is just this. It's reverse correlation of the response with the stimulus. Now you might notice that there's something extra going on here that there are two input traces. One of them is a sequence of minus ones and ones. The other is the same sequence where the roles of minus one and one got reversed and the reason for that not going to go into detail but people want to ask and be very happy to talk about it is this is one of the ways of dealing with non-linear systems is this allows one to cancel out some of the main non-linearities. But the point is that once you do that you now have a way of measuring the impulse response accurately but with relatively small signals. So you stay within the linear range and you don't sort of force non-physiologic behavior into the system and this is what the data look like that depending on the resting current and what the mean current is that the impulse response changes a lot and you can see it changes from sort of this rather slow thing with a big undershoot to something whose time course is about 10 times faster over a fairly moderate range. So characterizes bipolar cells the whole idea can be used in other situations where there's linearity but only approximately and only in a small range. Okay, bipolar cells. Well, I skipped hard on those because they're interesting but they're not that interesting. Next stop is amicron cells and the reason for talking about amicron cells for us is that in amicron cells at least for some kinds of amicron cells there is no linear range at all. There's still cells that have continuous outputs they're not spiking their arms but for some of them there's just no linear range. Let me quickly skip by this because we don't care. Okay, right, so to show what I mean by that here's a situation in which at the time of the green arrow the light goes on and at the time of the red arrow the light goes off. So equal and opposite stimuli but the cell gives the same response or three different cells give the same response at the on and the off. Remember, so linearity would say that whatever happens at on since off is multiplying the on by minus one then the off response should be the opposite of the on response. It's not even close, it's the same. Okay, so a linear strategy or something that tries to build on a linear strategy in an obvious way just won't work at all. So what do we do? We first, one thing that one can do is see whether there's a relatively mechanical way of extending the idea of an impulse response to something that could possibly give you a response in the same direction whether the input was either positive or negative. So yes, this was developed in the 30s long before people were recording from retinas and the idea is to use something called a Volterra expansion. So the idea is that instead of looking at the response as a linear function of the stimulus let's add a term which is a linear function of the stimulus times itself at another time. So the stimulus influences the response directly but also the stimulus at one time and the stimulus at another time interact via this other object, this K2 and gives you a contribution to the response. Why does this work? Well you can see that if T1 and T2 are the same this is essentially a squared term so whether s is positive or negative this thing will contribute in the same way. Just to make it clear what some nomenclature things the sequence of Ks are called kernels. The K0 we really don't really need to talk about it's just sort of whatever you're calling your steady output. K1 is the impulse response that fits best. K2 are these pairwise interactions which is our reason for doing this and then one can continue the series and third order, fourth order interactions, etc. You might recognize that this is kind of an integral extension of a Taylor series so if you can think of just s as being a scalar a zero-th order term, a first order term, a quadratic term, etc. So this is maybe a nice way to look at it but also shows you that there's a problem which is that if you measure up to a particular point and then stop measuring then what you're left with is a polynomial and no biologic input output system is going to behave like a polynomial for large inputs. So this is a characterization in a range one has to do better than that to have a global model. Again, a lot of different strategies that one can talk about but don't want to go into them right now. The other sort of point which is mostly because of terminology not a lot of practical relevance is that perhaps you've heard the term a Wiener series versus a Volterra series. A Wiener series is the same thing expressed in terms of orthogonal functions in much the same way that one can take a series of polynomials and orthogonalize them. I hope that's meaningful but if not the takeaway is there's not a lot of practical difference between a Wiener and a Volterra series except 1932 and 1957 roughly. So can we measure it? It's hinting at all the various drawbacks but an advantage of this approach is that you can actually make the measurements and you can measure these Ks and characterize the system and the basic reason is that if you know what S is the response is a linear function of things that you know. It's not necessarily linear in S but it's linear in S times a lag of S which is also something you know. So finding the Ks is again a linear regression problem even though the system that you're trying to characterize is a big help and we can also go through the same steps about thinking about what kind of stimulus help us to solve that regression problem and Gaussian noise helps but other sequences help too. So when you use Gaussian noise and you have an experiment of infinite duration then the linear problem linear regression problem simplifies which perhaps not surprisingly means that you can now write down these kernels in terms of correlations of the output with the input at prior times. So the zeroth kernel is just the average output not very interesting, it's kind of where you set your zero at. The first order kernel as it was before is the reverse correlation of the response now stimulus at a prior time. The second order kernel is now a little bit interesting so it's the reverse correlation between the response now and the stimulus at two prior times. But maybe not surprising because that's how it enters into the equation anyway. So yeah. So just one thing to mention I've been sort of loosely using the term first order kernel and impulse response as if they were interchangeable they're not really interchangeable. I really should only use the term impulse response for a linear system and in that context it makes sense because the impulse response is the global universal characterization of the system. When you have a nonlinear system the first order kernel is simply the best fitting impulse response for the operating range that you happen to be in and the apparent impulse response or the first order kernel will change depending on where you set the mean of the input where you set the power and everything else. So it's least squares best fit for the context. So again, can one do this? Yes, one can. And here's a first order kernel so a function of two times as measured from an amicron cell by Naka and colleagues some time ago. It's necessarily a symmetric function of its times but the main point is that if you truncate the series and you only look at the first order response you get nothing. Look at the second order response you mentioned that the kernels we get depend on the stimulus but is it possible for given new stimulus already predict the kernel? So sometimes, of course the answer is so if let's say in systems whose nonlinearities aren't very fancy then if you have measured the kernel for a particular mean and variance then that should work pretty well for other inputs that have the same mean and variance. If you change them a lot then the prediction is not expected to work and this has a kind of a formal reflection on the fact that the prediction of how an nth order kernel changes with stimulus mean is contained in the nth plus first order kernel and the prediction of how the nth order kernel changing with stimulus variance is contained in the nth plus second order kernel but those things are usually hard to measure so if you want to know how it behaves in kind of a different operating range you probably have to make the measurements. There's probably a good time to say that the kernels are I think it's a bad idea to think of the kernels as actually being the model. They are things that constrain models and what you'd like to do is a model that accounts for how the kernels change over a range of inputs and how for models that are they're very different than the stimuli that you use to measure the response so it's not a big trick to predict the response to one Gaussian noise from the response to another. It's a big trick to predict the response to something else from the response to Gaussian noise. So we're almost done with the examples that I wanted to talk about in retina but it may be that there I wanted to just mention something let me ask it this way who's heard the term second-order kernel before they got here? Okay. Who's heard the term spike-triggered covariance? This doesn't help me. Spike-triggered covariance is a so then I'll just skip this slide, but let me so spike-triggered covariance is a synonym for second-order kernel when you're dealing with spiking neurons and I only want to mention that because I would have thought that more people have heard that term and thought that the second-order kernel was something old and antique and the spike-triggered covariance which kind of came onto the picture about ten years ago is a really new and interesting thing but to point out that this process of doing the reverse correlation is exactly what the recipe for things like spike-triggered covariance are which are a good way to characterize cells because it's also a second-order kernel. Okay. So if I had a different show of hands I would have spent a lot of time on this slide. I did not. Okay. So several directions that one could take from here and stay within the realm of the black box paradigm. So I've talked about making the measurements in the time domain. There are many reasons that one might actually want to make the measurements in the frequency domain. I'm not going to talk about them now. I've sort of hinted about spatial and temporal inputs but that I really do want to say a little bit more about. And then there are other classes of models which are called general linear models and those are also helpful black box models especially for spiking neurons and given the time constraints we're simply going to mention their existence. Once we go outside of the black box paradigm we want to focus on the, not the input but on just what the ongoing activity is then there's another class of things that I want to say a little bit about that focus on spontaneous activity. So just a little bit more and then a break. So spatial temporal inputs we've been sort of talking about this situation but we now want to think about the possibility that you're not just injecting a single cell with current but there's a pattern of light in space and time. So what does one do? Well the linear formalism really maps exactly. We just say that our stimulus, instead of being a function of time alone, is a function of time and position and the weighting that determines what the response is depends on not just when the stimulus was but also its position. That's the answer. And then it turns out that the same crank works for nonlinear systems we just have to add a spatial dependence for our stimulus and this kind of looks okay. It looks okay until you try to make the measurements in your lab because then what you find is that you're trying to measure a function of four variables here and if you were so bold as to think that the stimulus didn't have just one spatial dimension but two then this would be k of t, x, y and this would be k of t1, t2, x1, x2, y1, y2. So you'd now be having to make a measurement of a function of six variables. So let's say you took ten time points and ten points in space. So you'd now have to measure a million parameters just to get this term. So it's a method that works in principle but not in practice. Yeah, so you can imagine if you're recording from a neuron and you're putting flashes of light in different parts of space for the same neuron. Yeah. So the measurement problem doesn't get worse when you have more neurons because you need more parameters for the more neurons because we're recording from more neurons. The measurement problem gets terrible when you have a more complex model and making things nonlinear is a very easy way to make the model complicated because now you have a million parameters. Yeah. And the application of these kernels will be very limited once again since you took up all the... Yeah, all the limitations still apply but now there's the problem of making the measurements. So basically a few ways out. The most elegant way is to have a non-generic model with a small number of parameters. One can try to creep up on that by assuming, for example, that this thing, which is a function of two times and two positions has some special form. So the neuron is sitting, let's say, in the retina and the retina is looking at a particular part of space but you're exploring that space with points of light not just where that neuron is looking but nearby. So there's some region of space that the neuron cares about but with different sensitivities right? So first you could have a stroke of genius and come up with the right functional form so you don't have to have so many parameters. You could try to creep up on that by say, let's say that the dependence of the second order kernel on time and space is a dependency on time times a dependency on space. So instead of having to measure a million points as you'd only have to measure two functions of a thousand points it's a reduction, it's not that much, maybe it would help and so I'm basically mentioning this to show you the limits of turn the crank methods and when you have to start thinking about models that are really specific to the system. First, if you're dealing with a linear situation then it's not such a big deal to measure this function of time and space and just to show you an example of what that looks like so here's just an example that we reverse correlated the response to a sequence of basically zeros and ones or minus ones and ones in space and time measure the reverse correlation and get a map basically a movie whose snapshot in time just shows you the sensitivity of the neuron at each position in space at a particular time in the past. Again, once you try to scale that up to nonlinear you get into trouble. The other direction I wanted to explore a little bit just lots and lots of terms is what happens on a focus on spontaneous activity patterns so if you have n neurons then at any point in time then any or all of them could be firing at any point in time there could be two to the n things that could happen alright so that could mean that you need not quite two to the n but two to the n minus one you have to measure two to the n minus one independent probabilities the last one you get for free because they have to add up to one but still when n is large and now one can record not just from three or four neurons but from many dozens it's just not possible to even characterize this distribution too many variables so what do you do or what can you do the first thing that you could do is again let's make a simplifying assumption so the simplifying assumption could be that each of the neurons are independent not necessarily independent well whether they choose to fire or not is independent given that they're all seeing a particular stimulus so the proper term is that they are conditionally independent so you need to characterize the firing probability of each neuron separately then you can predict the responses of each of these combinations because you just multiply the probabilities so that's I think what I just said wonderful if it works because now it reduces a problem with two to the n parameters to a problem with n parameters but that may be too much of a restriction and are there things that one can do that are somewhere between n parameters and two to the n and you know the answer is yes and one way that has shown itself to be reasonably useful is this approach which one gets called the pair wise maximum entropy model but is also called the icing model because of its analogy to physical spin systems so the idea is that one is going to try to model the probability distribution of all possible firing combinations by saying to some extent the neurons are independent but then there are also pair wise interactions between them and all I have to do is measure the probabilities of each neuron firing which are somehow embedded in the alphas and the interactions which are somehow embedded in the betas that's a lot fewer parameters it grows like n squared instead of like two to the n and I can see if I that is an accurate account of firing patterns so this works in the retina and it works over reasonably large and satisfying range of n so when n is a hundred two to the n two to the hundred is humongous but n squared is only ten thousand so a big win in the retina and a big win in a situation which neurons are not firing independently problem of course well before problem so it's I think useful to think about whether this is a surprise or not that you get pair wise interactions and at first glance one would say this is not so surprising neurons interact in pairs they're connected to each other and so therefore if a model incorporates pair wise interactions then that's the right way to go there's no there are no triangles between neurons that we know about that cause all three neurons to fire at the same time so maybe pair wise interactions are sufficient but problems not really the model the icing model is symmetric and connections between neurons are not symmetric and the model does not take that kind of thing into account most of the time even if you're recording from a large number of neurons you're not recording from all of them so there are hidden neurons that are interacting pair wise with the neurons you're recording from and they're not in the model either and probably the biggest thing is one doesn't want to you don't need to model a series of snapshots of which neurons are active at each instant what you need to do is you need to model how that depends on what happened before because there's dynamics in the network so the pair wise model has to be considered a nice phenomenological model it might give you a good description but it can't possibly relate to a mechanism because it neglects the fact that synapses are asymmetric that they're hidden neurons that they know about and that they're dynamics that you're ignoring so it works in the retina but it can't be considered expected as sort of a gift and when you go to Cortex as we showed even if you're recording from just three neurons which gives you only one extra degree of freedom beyond the pair wise model the pair wise model still fails so we need better models pair wise models are good places to start there are ways of adding on to a the pair wise snapshot model by including stimulus dependence, hidden unit dependence, dynamics and there are other approaches that give you a similarly strong reduction in number of parameters but are descriptive but not based on the physical idea of a spin system and just want to highlight the dichotomized Gaussian model because it has the same number of parameters as the pair wise maxim entry model but actually fits cortical data so and one can even spin a story that there are underlying Gaussian ways of activity in their thresholds and that causes correlated firing among neurons so you know there are interesting things going on so to summarize we have a large toolkit because we have a lot of different things that we might want to do model for a reason that have different goals that have a wide range of complexity sometimes we can access the input sometimes we can't the output is continuous sometimes it's spiky sometimes we care about variability and sometimes we don't so pause for me possibly for you but this is also a good time for questions and then we'll talk about information theoretic things wait for the microphone exactly you differentiate evoked and spontaneous recordings from the spikes if I give a firing pattern can I say this is evoked this is spontaneous I think that's a really good question one probably should say I intended to evoke it versus not because obviously you're not controlling all the variables you're just controlling one and I guess spontaneous kind of refers to the fact that you're generally modulating the input so you're looking at endogenous activity and therefore you're focusing on patterns of variability evoked kind of implies that you think that you're controlling an important input to the system and you're looking at what's coupled to the input but you're right there's always spontaneous activity buried into what you think is evoked and to ignore simplifies the modeling but may throw away a lot of interesting stuff and also like there will be a delay right for example if we consider right whisker system any sensory system when I give the input to reach to the particular brain part anything so there is some delay there's a time required but when I record and I evoke so how we consider this situation so formally this is not a problem at all all it means is that until the time of the latency your response is uncorrelated with the input now if you don't know that and you waste experimental time measuring those parameters that are all going to ultimately be zero you're not doing as good a job as if somehow you had enough knowledge to say well I know that the response really starts now and I can choose the start time of my kernels or any other model parameters to begin at the latency it's not a problem with the formalism it's very useful to know when you design your experiments and you know sort of choose your parameters I'm not sure I'm answering your question though Is there any open source database available for doing any research in this immigrant cells and understanding his response database I'm not sure there's a lot of there's a lot of databases of cortical I shouldn't say that there are several people have made available their recordings of cortical neurons under different circumstances there's probably a very general question that we need to talk about which is that when a modeler says okay I want a particular kind of data how likely is it that that data exists and then how do you go about finding it the particular data that I mentioned for the triplet interactions from our lab that's actually available and a couple of people have re-analyzed it they're in general if you write to somebody they generally are happy to share their data but I think that this is one of the problems that we have is that there's no way you can just search What are the challenges there to be addressed in this area Challenges in this area well, I think better models but the only way to have better models is to have system specific models not so generic anymore so it really does require domain knowledge it's easy to say we need better models it's hard to say how to go about getting them These are all typical problems from neurodynamics and so in the larger context of neuroinformatics I guess the question is similar to the previous one how would you integrate all of this and how far are we from getting from reaching a consensus in terms of the models that we use and what is the number of different models you talked about domain specific models what is the amount of different models that we're talking about in the end which 10 years from now or 20? I wish I could answer that but I think that even if I could I would have to say that it depends what purpose you're using the model for and just as a very specific example our lab developed a model for retinal ganglion cells which was a good model from the point of view of showing that simpler models can't be right namely linear filter then another linear filter but it wasn't a very good but it was very good for showing that you needed all those parts but it wasn't very good for a good prediction outside of the stimulus set that we used to make the model much more recently Sheila Nirenberg's lab developed a different model which is actually simpler than ours and can't be exactly right on the other hand it's much better for predicting the response to an arbitrary input much much better so I can't imagine a situation which one would ever want to get rid of either of those even though they're actually mutually contradictory I think at this point it's helpful to remember that basically all models are wrong but some are useful and the domain of usefulness might be very different for the different models that's why you might want to have them around for different purposes for example when I studied physics somebody told me Boas atom model is wrong well of course it is wrong but it's tremendously useful so we're still using it and I think one of the one of the biggest problems that we have is in the communication that we have very different expectations about what a model is supposed to do for us so some people expect the all-encompassing complete model of the brain and then of course nobody will be able to live up to that expectation and others are content with much simpler results and they have much lower expectations so they're easily content but they never talk about their expectations ok fair enough ok maybe if I say this ok if you do you worry about cognition because let's separate these two fields cognition and let's say treating diseases or something like that so when you do all of this you're worried about what understanding cognition or understanding the systems at a smaller scale and then maybe using them for different purposes I care a lot about cognition and disease but I'm only trying to present what's in a what are in the different compartments of a large toolkit and I feel most comfortable presenting them in the context of something that is fairly well understood and I sure that one will need somewhat different styles of models for other things you know I think this is still a useful framework even when you're talking about disease you need to know why are you modeling you need to think about how complicated the system is are you going to be able to measure the parameters are you trying to predict outside of your measurement set or are you just trying to make a compact summary of your measurement set I mean look we even know Newton's laws are wrong but they're really useful they give a really compact description if you're trying to track a satellite going around the earth you aren't going to do it with Newton's laws there are too many things that are wrong we start there because we think we understand a lot because of them it's a compact description and it gives very good intuitions about what else you might need to do so no one would say they're right and they are a theoretical construct or what we accept as a theoretical construct by no means a good model of what goes on so it depends entirely on what utility you expect from it the convenience of Newton's laws is all the intuitions you get from it that's actually why I asked about cognition because when you worry only about cognition and when you expect things to work fine then things get much simpler so the people who do artificial intelligence don't worry about any of these things many of them don't not everybody but they don't and they model this with much simpler elements and everything works fine they can do many of the same things I mean at the level of behavior which is what I do a lot of I would say that reinforcement learning theory is a very powerful model of behavior but it never works it almost works a lot of times but it never really works it's like Newton's laws to say that all of our behavior is driven because you're trying to seek goals is a really good description of how behavior works but if I give you a set of data from some behavioral task and then say fit it with a vanilla reinforcement learning model it won't work I was sorry I don't want to do it yeah maybe one comment to the artificial intelligence example I think this is precisely the expectation point of view because what has been done in this field is that you're skipping a step you're trying to distill general processing principles out of the behavior of a group of people but if you try to reverse this process you're not going to get anywhere because you lost the information so for example if I have a patient here who's autistic you can have all the artificial intelligence in your world you will not be able to explain why this particular person has this problem lost any relation to his particular brain and his brain configuration and everything so you might get a general statement out of a system it's like observing train station and saying on average I have 25 trains per hour going from track number 5 but if you're actually a passenger that wants to go from Amsterdam to Leiden these theories are not going to get you anywhere here on your topic you've been presenting mostly single processing methods and so on and I wonder in terms of modeling what's the importance of data visualization because I would say here we do mostly a data driven analysis but the truth is we still need hypothesis driven so that's a really interesting point sometimes the modeling tells you what you want to plot against what so it's helpful but often let's say a model will tell a model especially a simplified model will tell you that let's say things have to fall along a line and then it tells you let's plot it that way you see that it doesn't fall along a line and then the way it doesn't fall along a line which could depend on all sorts of other parameters and things and across experiments and stuff like that you use your brain which is a really wonderful pattern recognizer to form hypotheses about the way to assess the model in part models are good because they give you variables that are useful to visualize rather than just okay here's the raw data I give you a pile of raw data there are countless things you could do with it which once you did that and put it into the best visualization engine it wouldn't be very helpful but if you knew that you should let's say plot this prediction against this response and you saw that it was saturating so gee I got it right almost I now see that there's an output nonlinearity I'm just as a trivial example but I think the visualization step is not part of the modeling but it's part of the process in which you go back and say okay how do I make a better model so we'll just continue a bit and change tracks a little bit so in the last part of the talk I wanted to talk about information theoretic methods which kind of have a different flavor although you still might apply it to situations you have an input and output regarding the system is some sort of a black box so there is something called information as a formal quantity not an English word so why doesn't want to calculate it why is it challenging to estimate and one could say well it turns out that the most interesting things that you want to calculate are always the most challenging to estimate but you'll see why so as is the case with dynamic models it's difficult but there are things that one can do and there are also a very large number of different methods and if I have time I'll just present an example but this time actually in the taste system so why calculate information oh well I still wanted to say what I said so what happens here okay so okay great so there are a few different reasons that one might want to do this one is that as you'll see or perhaps you already know information formally defined is a very natural quantity and you can use it to compare very different systems and see whether they're general principles that emerge like you know you get one spike is required for each bit of information not true but it might be something that you might find determining the metabolic constraints on information transfer see where information is lost in a neuron if you can find a set of stimuli that a system transmits a lot of information then you might conclude that that's what the system is designed to do so it's it's a nice number the reason I want to talk about in case I have time to actually show you an example is that one can use information to evaluate whether neural code different candidates for neural codes and I'll also try to be a bit more definite about what I mean by a neural code so what that means at least at first pass is what aspects of the spike train of action potential is used by the nervous system to convey information and particular whether the exact time of a spike as opposed to how many spikes is something that's relevant so we'll actually come back I hope we'll get to that and it may also be possible to use information theory to rule codes out rather than say this one is plausible to say this one can actually be excluded and you'll see why that is and the final reason which is I think actually fairly important is that it's a very natural way to describe how closely associated an input and an output are without having to postulate what functional form it is that's associating them so it's a nonparametric measure of association and in many senses the best there is and the flip side of that is because of its advantages it's really hard to estimate the application I want to focus on is trying to understand something about the neural code and I think it's useful just to describe what the question is this is a very famous slide and it's kind of hard to give a talk about anything in the visual system and not show it so this is a slide from the work of Huble and Wiesel who recorded from primary visual cortex and these are actual raw data so it's not processed in any way of extracellular recordings of a neuron while in visual space a dark bar is being moved across some area of space and what this shows is that as that bar moves over a particular area of space the neuron might fire a lot in one direction and then not in the other at slightly off of that best angle it fires a little bit less and then less it fires less so this neuron is selective for the orientation and this is kind of a beautiful thing because this only happens in visual cortex it doesn't happen in the retina one can spin a lot of tales about this and in particular with some modeling you can convince yourself that the way that these neurons are able to discriminate this orientation from this that it fires more for this one and for this correspond to a human's ability to tell one orientation from another but to do that you have to make a lot of assumptions about whether the neuron is independent how the little man inside the head reads out that activity and stuff like that but nevertheless we like to think that these tuning properties account for behavior what we know that we don't say what physiologists know that we don't say is that if you if you have the if you have the goal to do the same experiment twice so to present the same stimulus to the same neuron twice you won't get exactly the same response in fact you may get something which is actually pretty different so that the pattern of variability is quite dramatic in cortex and so that's one problem with interpreting this kind of a figure as a measurement of what a single neuron can do but the other I think is probably more of an issue which is that if let's say you were to change something that wasn't the orientation so instead of changing the the bars orientation you change its color you change its width you change its velocity then it would fire less or it would fire a different amount so the neurons activity can't just be a signal of orientation it depends on lots and lots of things and so how this is decoded and how this activity represents what's out in the world is what one means by quote the neural code and so just to sort of think about some of the possibilities and some of the implications of neural code so you can think of neural codes in maybe two very large categories one in which you just count up the number of spikes in some window that's a number that says something about the stimulus and the other is that it's the the whole pattern of activity in some sense in some sense matters so what are the implications of this if let's say now we're not talking about a visual system anymore but let's say just a touch receptor and then at some particular time the skin is indented so the neuron is firing in some irregular way the touch there's the touch the neuron is now firing faster it's a good idea to know as quickly as you can that something touched you so at this point do you know well you don't know because this interval could have come from this distribution or maybe not so you have to average a few spikes to be sure that now the neuron is firing faster than when it was firing there so on the other hand if the neuron was regular and whoever was interpreting this firing pattern knew that the neuron was regular then it could use the presence of one short interval to say aha something happened so if all you know about is rate you have to average if on the other hand intervals are important or precise timing is important then one spike tells you so there could be a qualitative advantage to codes that are based on spike time there's another sort of advantage sorry it was on this side so if the time course matters then the meaning of a spike train can't be just considered a scalar so it's actually a time series so you can for example change this response into another one by adding spikes but you can also change this response into another one by having it be more transient so you could imagine that you can represent two attributes one in terms of the number of spikes and the other in terms of whether the response is transient or not so it's kind of qualitatively different that if you use the dynamics of the response then if you use the dynamics of the response then a neuron's output doesn't have to just represent one attribute could represent the mixture of several for example but it gives you more variables to play with so what one would like to know is are neurons basically intending to represent a rate and the fact that they have to fire in spikes is a bug and that the nervous system has to average over them because that's the way transmit signals long distances or are spikes a feature can the actual timing represent things so what would be the way to answer this experimentally what you'd like to do is take an organism that you would agree has fairly complex behavior like a primate and reach inside and move around the few spikes and see that behavior changes or that it doesn't so maybe some of the genetics would allow us to do that but even then you'd have the problem that if you move around the spikes then you're also changing the rates in different windows so it's not quite clear even though one knows that there's going to be a big difference between simple codes that depend on counts and complicated codes that depend on detailed timing it's not even clear how to formulate the problem and it's also not even clear how you'd approach it experimentally so I think this is one of the appeals of information to you is that it at least allows one to formulate the problem well and possibly to put some to make some suggestions about what the answer is so A so it's difficult to enter this directly A because of doing an experiment in which you move the spikes around and B that there's always going to be some relationship between detailed timing and number of spikes over a window information theory allows us to ask how much information a spike train could have and whether that can account for behavior and decode it in different ways whether it can account for behavior the the problem is that information theory wasn't built for this purpose information theory was sort of built for the opposite purpose so Shannon circa 1948 the question was here's a man made communication system can we some and we know the way it's coding we know how it carries voice sounds on how much information it can carry so what is you know if we know everything about a system can calculate the information here we're sort of doing the opposite there's some experimental measure that we can make that tells us something about how the system works so it's a very different question even though you know it's related and this is kind of the reason that application of information theory to experimental data is actually a topic it's not it's not an obvious thing let's take a couple slides unless you want to read the time is the great gift of nature that keeps everything from happening at once and that's really the reason that we can't immediately disentangle neural codes so to emphasize information theory is the framework proposing the problem but it wasn't that wasn't why the theory was developed so what is information information is a reduction in uncertainty and that allows us to give it a formal definition so for example for example let's say that somewhere in the outside world any of six colors were presented and your job as an observer is to figure out which one and you get to look at the response of a neuron and after you see that response you're sure that only one of two possibilities was present so we could say that there was a reduction of uncertainty from six possibilities to two and the Shannon theory says that we will quantify that as the log of the ratio of the number of possibilities before to the number of possibilities after so why does that make sense why does it make sense to do that well first sometimes things are rarely that simple so it could be the case that all of these six colors were equally probable a priori but it's very unlikely that when you observe a response it tells you exactly which stimulus it was it probably just reweights the probabilities so we need to extend that definition of just the log of the number before to the log of the number after into a situation that says no response is definitive it just reweights the probabilities so we need to have some way of taking our numbers which are the probability of the particular stimulus was associated with a particular response and turn that into a quantity called information it turns out that there's really only one way to do that which is what makes the theory so nice there's really only one way to do that provided that we're willing to make two statements about what properties that we want this number to have so the first property is that it should satisfy the quote the data processing inequality that's a lot more straightforward than it sounds the idea is that let's say you're the observer sitting here and trying to figure out what the stimulus is and you know that if there's no spikes then these are the probabilities if there's one spike then these are the probabilities two spikes then these are the probabilities and you do the best you can to guess what the stimulus is the next day you could say well gee maybe what I should have done is when I see this one spike maybe I was wrong so instead of using these probabilities I'm going to use some mixture of those that shouldn't help you that shouldn't allow you to do a better job by pretending that your observation was incorrect and it really should have been something else and we want to make sure that if you use that weird rule of pretending the observation was something else and then calculating information you should get a lower number to reflect the fact that you're not doing as well so instead of looking at the actual output you look at some corrupted output the amount of information should be less so that's one property that any reasonable definition of information should have yeah and it goes by the formal term of the data processing inequality so there's a second property which is that if you have independent channels you might want to think of these as independent pipelines then the amount of information in those two channels should add up so for example let's say these six stimuli fell into two shapes and three colors and there's a color neuron that tells you which color it is and the shape neuron that tells you which shape it is and so the color neuron let's say is definite and the shape neuron is definite from these two things you can figure out the stimulus so here you had a reduction of uncertainty from six to two here from six to three but together it tells you the response so what works out nicely if you use the log is that here the log ratio is a log ratio of three here the ratio is a ratio of two and together you get a ratio of six so the logs add up but the basic idea and maybe you can see it at this point is that if you want reduction of uncertainty from independent channels to add then there has to be a log the implication of the data processing inequality are not so straightforward but the two things together lead to exactly one way of making the input output probabilities and calculating the quantity that satisfies both so that's where we get information from I'm not going to go through the details but to say that the key quantity is something called the entropy which is a quantification of the uncertainty of the stimulus either before or after you see a response and the entropy the entropy has a very special form the sum over all of the different kinds of events that could be present so the events meaning which object was present each of their probabilities times the log of their probabilities so it seems straightforward you measure the probabilities you take their logs and you multiply them by their probabilities and add them up and that's your description of the uncertainty before there's a response you now do the same thing after you see a response after the response and the difference of those two things is the information how do you have the minus if you get it wrong then it's the neg information neg engine just convenient right but the log the log is the problem not forgetting the minus sign okay so now when you start to get into the fine print you realize that there's a problem and so straightforward let's just sum over the different kinds of events their probabilities times their logs so what do you mean by a different kind of event well sometimes we can control what the stimulus is so we think that those are the different kinds of events that's pretty straightforward the problem is on the response side so we have a spike train we know the time of occurrence of each spike we measure with our analog to digital converter or time or whatever but now we have two responses in which one spike differs by a millisecond should we call that a different kind of event and use that to establish a different set of bins for probabilities or not okay so it turns out that this is where the data processing and equality which made information such a beautiful thing comes back and bites you because it says that if you confuse if you decide to consider two different responses the same you're necessarily short changing yourself in terms of the amount of information so when you divide up the response into more and more categories which you have to do so data processing and equality tells you you should divide up the responses into as fine categories as you possibly can ideally you should if two spikes differ by a microsecond you should consider the possibility that they mean different things this will get you into trouble because you will almost never observe responses to actually measure stimulus response probability in a way, so information usually assumes that you have agreed on a code in a way this is where this comes in because you would have to basically measure how much the information content changes depending on the change of your code so now we're turning exactly we're skipping slide ahead but that's probably a good idea because of the time but this is exactly the idea for a variety of reasons you can't consider all responses no matter how slightly different they are different so you have to agree on you have to hypothesize what ones you think are functionally the same once you've done that by saying well if a spike doesn't move by more than three milliseconds basically the same spike now you might be able to make measurements and calculate information you can then say well what about if the spike moves by ten milliseconds and I use larger bins does that reduce the amount of information if it doesn't then it means that that three milliseconds of resolution is irrelevant to the code if it does it means it's relevant and that's actually kind of the one of the big bottom lines that I wanted to get to and I see that it's late so I'm going to stop for a little bit more but I'll just say a little bit more and then show you kind of a framework slide and then I think we'll wind up it sounds technical but it's not but the problem with slicing the bins into finer and finer parts is really really insidious and let me just kind of explain why as a typical example so let's say you had a student and you wanted that student to tell you the average height of the males and the average height of the females in the university and you sent the student out to make the measurements and the student, since it wasn't a very large place he actually he measured 100 heights of males 100 heights of females and instead of the normal thing of just writing down the heights and averaging them he made a histogram of each so if he made a histogram and he binned them let's say in centimeters, no problem in each bin, you could measure the average height from the histogram let's say I had a really weird student who measured the height to the nearest nanometer and he made a histogram of heights with a nanometer resolution so every bin with either zero in it or one you would know nothing about the problem you'd have a very bad estimate of the probabilities of those different heights but you kind of know that the ones that were similar were equally likely but more importantly you could use that histogram even though it was binned in a really stupid way to measure the average height exactly so when you're trying to measure an average binning something in finer and finer bins doesn't cause a problem if instead you're trying to find the logs of the probabilities things would be really problematic and the reason is that the log of a probability is not a linear function of the probability so making a positive and a negative error don't cancel out making a positive and negative error on exactly where you've been the height that cancels out so there is a bias in estimating p log p when the probabilities are small and unfortunately the bias is in the opposite direction of the bias from the data processing it says that you have to make them small so as you make them smaller and smaller you don't know whether your bias is positive or negative but you know it's big so that's a really bad situation and yes there are ways out of it but the ways out of it all involve building a model or making some explicit statement about the relationship among different spike trains which is a way of formalizing the idea of what is the neural code so I'm going to skip to my final slide which kind of shows that framework yeah good news, bad news so you can bin spike trains into millisecond segments if they're not too long but if they're long enough to be relevant to primate behavior then you need more bins than you have atoms in the universe so you have to do something else and ok so here's kind of the framework that I wanted to get to and it's basically just a way of thinking about models for neural codes so you have your sequence your data which is just a sequence of spike times and different responses so you first have a choice of whether you want to consider those as events in continuous time namely a point process or do you want to consider a sequence of symbols like zeros and ones or long intervals and short intervals or medium intervals and short intervals so that takes you down two branches are you sort of conceptualizing spike trains as happening in continuous time or as a sequence of discrete symbols once you've done that on either branch you can explicitly postulate a relationship between the symbol sequences so the purest thing to do is to say every symbol sequence is different than every other and they have no relationship to each other that gets you into what was called the direct category of methods which requires an enormous amount of data and basically is impractical for situations in which the resolution of the system is maybe more than when response durations are more than the resolution of the system so useful for insect systems with very short latencies not useful for vertebrate systems but nevertheless a reasonable approach and has a very weak model namely symbols at discrete times and different symbol sequences have no relationship to each other you could work with symbol sequences and say yeah but symbol sequences might be related for example if one symbol sequence contains another if it's saying something that's related that gets you into a bunch of methods which are not nearly as data hungry because you've made a stronger model there are at least four of them that I'm aware of and here they are on the other side of the major branch you could treat spike trains as events in continuous time now some event sequences contain one spike some contain two some contain three and you could either allow yourself to postulate that the dependence on the number of spikes is also systematic or you could decide that you want to treat the one spike train separately from the two spike trains separately from the three spike trains etc and each of those in turn gives you a cluster of methods and obviously there's no point in trying to explain these methods but I think there's a lot of point in emphasizing that each of these sets of hypotheses about what the relationships between spike trains are gets formalized by a method for calculating information from spike trains and as mentioned whether you expect that your result will depend on that and that if you ignore something in your model which is actually relevant then you'll underestimate information assuming you have enough data so I hope this is a good framing for start for anybody who wants to do this kind of thing and thank you for your attention and questions and thank you