 This is our seventh lecture and the last time we derived what I think is a very important result that if neural responses are given by a logistic function of a function of a stimulus then the model has an information-preserving population vector or a readout and it is given by this expression, it's the average of whether the neuron produced a spike or not. It's preferred feature or a receptive field or a place field. And here I specifically highlighted this scaling vector beta i but last time they were incorporated in the definition of the vector w. So now in the first part of the lecture today I will go over the implication of this derivation and how it differs from the standard population vector and what is the standard population vector. So in the standard population vector, even though the neural responses depend on this, so let's think about what happens with this factor beta. If beta is large then we have a very steep nonlinearity, maybe Angela, maybe would you mind drawing this? Sorry, the line? Yes. Yes. I think it would be, if, somehow I cannot comment or annotate on my, so I will... We just see the arrow, the pointer. Yeah, the arrow. Yes. So I will trace it with an arrow and if you can record it on the blackboard and I think that would be good. So as a function of this product, w i and s, in this case w i is normalized, the unit length. This function looks like this, it's a sigmoid like that. Should I... Yeah, if you can draw it on the blackboard. Ah, okay. Yeah, so should I draw this sigmodel, right? Yes, thank you. So it can go between zero and one. Yeah, thank you so much. And now if, so this is for some kind of a standard beta, but if beta is very large, it becomes a step function. So it should be like this, right? Yes. And if beta is smaller, then it will be more shallow, yeah, so like that. So in principle, in the yesterday's last time presentation, the w i were not normalized and this beta factor was incorporated into the definition of w. And the reason I am separating them is because you can think of beta i as noise in a neuron. So a neuron with large beta will have smaller noise and neurons with small beta will have large noise. So beta is, as in physics, is like one over KT, so maybe Angel, if you can write down beta as one over KT, then for physics students it will make a connection with statistical physics. So and why we go over this in more detail is because in neuroscience literature, this factor was not taken into account even though it is, even though the expression that they use was similar except for that factor beta and we will go over the consequences of this. So way back when, in 1986, George Apollos empirically observed that a procedure that, you know, it was an empirical observation, so they didn't have a derivation, so they missed the beta factor. But that was their procedure and empirically it was observed that it works very well for, as a measure of freedom of neural activity. So how is it done in practice? So in their case, they were thinking about modern neurons. So neurons who spikes in code movement, movement of a hand. So in this case, the example neuron, this is just one neuron and its characterization is shown on the right. So the animal, the monkey is making various movements like this and so it can make, if it makes the movement down to the left or to the right on the screen, then there is lots of spikes from that neuron. And when it makes a movement in opposite direction, then there are much fewer spikes. So each little tick is a spike and each movement is repeated 10 times so you can see variability across multiple trials. So as a result of experiment for one neuron, you can draw a vector, a preferred vector W for that neuron which will be somewhere down to the bottom. And so by studying that neuron in isolation, we can assign to it just one vector W and that characterizes its ideal direction to which it produces most response, most spikes when the animal moves in that direction. So on the right, we have a picture for one neuron and many, many experiments to characterize just one neuron. But once we obtain this characterization, we can characterize the whole movement that the animal makes by monitoring many neurons. So for each neuron, after the detailed study of that neuron, we get WI. So this is obtained for each neuron across many, many separate experiments. But now you're monitoring in real time. So the WI unknown and what is known across multiple neurons is whether or not it produced a spike or not. And a spike is denoted here as one and the absence of a spike is minus one. So both spikes and no spikes matter. So if the, if this neuron, if the movement was in this direction, so it's like the neuron when it's not producing a spike, it's voting for direction opposite to its preferred direction. Any questions about the procedure? Any questions? Looks like there is a question, no? No, seems no, no question. So, so does, so do you think we can think of this as a democracy, right? So does everybody see the connection with the democratic voting? No, no. No, so each neuron has a preferred vector, W. And then it has only one vote, you know, zero or one or one and minus one. And then we add the contribution from each neuron as a W and we wait by spikes. And as a result, we get a pattern T. So that's my connection with the democratic vote. So a neuron produces a spike and by producing a spike it pulls the average that will be the final outcome of the neuron population towards its preferred direction. So if, you know, each neuron is given its preferred W and then it has a choice, produce a spike or not. And then we add them up and get the final vector. So well, I mean, it's not a mathematical statement calling it a democracy, but do you agree or you disagree? Or at least is the procedure, any questions about the procedure? I think that is a question. Yes. Hi. One question. So this T is just for one neuron, right? No. It is a joint result across many neurons. So this is the population response here and it's one vector. So you think of it as the final movement. If we have a brain-machine interface, which says I'm monitoring our eye and the person physically cannot move the hand, but I have another robot and I read out our eye and I want to know where does the person think they want to move the hand. Then for each neuron, once I sign the WI, I weigh the spikes in a given moment in time and that gives me the direction where I think the person wants to move the hand. So the brain-machine interfaces are useful in paralyzed people where the brain is still working but the neural connection between the brain and the limb has been cut. So in that case, they work by monitoring the neurons if they can and then you even can fit this WI by asking, for example, the person where they want to move. And once the device is calibrated, then when a person produces various spikes across the population, then the weights are assigned. This is the sum across neurons and you get a single vector which represents the expected movement of the hand. So in other words, yes? Go ahead and I think there was another question later. There was another question, right? Yeah, I think so, right? Yeah, it was about the WI. So if I got it right, they are fixed by the experiment and they represent something like, for example, the direction of the movement or something like that. Yes, so the WI are fixed per neuron. Well, they can change slowly through adaptation, but so the WI are fixed per neuron and then you add them up with the weights from each neuron to get a neuronal population readout. And so in other words, the relationship between the left and the right-hand side of the slide is as follows. On the right-hand side, we use many observations across time on a single neuron to get its WI. Once we have this information across many neurons, then we can make a prediction at a single time point by average across the population. So on the left, we average across the population at one moment in time. On the right, we average across time to get estimates for one neuron. So in other words, this vector T here is kind of the expected motion. But I can also use this equation. I'm using, you can think that I'm using the same equation on the right-hand side, but where T is the actual T and RI are the actual RIs, but the I is over time. And then we do the average to get WI. That may be more confusing than clarifying. But please ask. And such that we figure it out. Okay, thanks. So every time you ask a question, it's helpful to see, you know, it's like shining light on an unknown part in the derivation. So anyway, this is the procedure. It was in the context of movement, but it actually is very ubiquitous and across many, many species and different modalities. So instead of modern neurons, you can think of this as visual neurons. In this case, WI will be a preferred visual pattern for a neuron. So I'm monitoring neuron for a long time to get its preferred visual pattern. And then once I repeat this experiment across neurons and I said, and I have a set of say 100 neurons for each of them, I know their preferred visual pattern. Then once I monitor in real time their responses across 100 neurons, I can add their preferred patterns and get an estimate of the stimulus of one moment in time. Okay. So any... Well, let's try to move on. And then if there are questions, then please ask now. If not, maybe it will be clear from the next slide. So this is... As you see, this procedure is similar to what we... So now a few comments. What we derived was similar to this procedure, except we had another factor beta i here. And this factor beta was the magnitude of this receptive field, in this case then the experimentalist just normalized it to one because all we care about is the direction of movement. So let's take it as a vector. But the mathematical derivation says that the magnitude of the vector has to be scaled by a factor that represents the noise. So if you accept that this is kind of a democratic voting, what they're obtaining from information theory is not purely democratic voting, which is biased now by reliability or experience. So a neuron that is less noisy will get a stronger weight. But they will go over it in other details. So in other words, and remember the formula that we derived from information theory is very general. It just says I have a logistic function and it doesn't... It should be applicable as widely as this procedure. It doesn't matter whether the neuron is visual or auditory, whether it's a modern neuron or it represents some complex decision-making part. So this is a general readout and you can see that empirically a very similar procedure based on these standard population vectors is very ubiquitous across many species. But the question was raised if this is the correct procedure because it makes an assumption that if neurons have the same preferred direction or preferred pattern WI, then their spikes should be averaged. And this goes against the notion that neuron code is very complex, right? If we have N neurons, there are two to the N different response patterns and these patterns, if we are using them, they have a capacity to convey a lot of information and if we are just adding neurons, adding spikes, we will be losing information. So people were doubtful about this population vector procedure and they decided to test whether the key assumption of this population vector and the key assumption is that if neurons are tuned to the same stimuli then their responses add. Does everybody see this assumption in this equation? So where does it show up? So if we have WI, there are two neurons that have the same WI, so W1 equals W2. I think, Tanya, sorry, there is a reply here. Sorry. Does it work? I'm sorry, I lost what is the assumption that we are making. So consider, so this sum is the sum over neurons. Imagine that WI, there are two neurons with W1 and W2 that are the same. In that case, if W1 and W2 are the same, then according to this procedure, I can just add spikes R1 and R2 and the contribution of these neurons to the overall average will be the same. In other words, I can add the spikes of neurons which have the same WI and not lose any information. So we go R1 plus R2 and it doesn't matter whether, in a combinatorial code, we have 0, 1, 1, 0 as two different patterns. But here, it says that R1 plus R2 is all that matters. So the pattern say 0, 1 and 1, 0 has the same total number of spikes across these two neurons and so they, we are losing our channel. We are losing your channel which says 0, 1 and 1, 0 as two separate channels and now we are saying it's the same kind of code word. So with two neurons, we have technically four code words, 0, 0, 1, 0, 1 and 1, 1. So Angel, maybe it's useful to write it on the board for two neurons. So you said two neurons? Yeah, let's consider two neurons. In this sum, we consider two neurons and in principle, we have four different response patterns. What should... Maybe you can just draw a table, Angel. Ah, okay. You draw one, you draw two and you sum up. So minus one. Yes, so I can... Ah, okay. I can now, let's see. Maybe I have a table, let's see like this. So we can talk about this, but this is kind of a table, this is for three neurons and imagine that, so kind of maybe we can look at this table then here. It's a hypothetical table. You have three stimuli and neuron one responds to stimulus one, neuron two responds to stimulus two, neuron three to stimulus three and therefore if you know which neuron produced a spike, you know exactly which stimulus produced a spike. But if I sum spikes across neurons, then the total number of spikes here is one, one, one and so I lost information. So that is how much information is conveyed in the population count. So that's kind of a similar maybe table. But this example is a little bit of a simplification because so suppose in this case the stimuli has to be along this, so for to make relations with our formula, all stimuli have to be along the same axis. But I think it's hopefully generally illustrates the point but going back to our expression, if in the neural population you have two neurons that are tuned to the same say direction of motion, WI, then in this population vector you no longer distinguish their identity of these neurons because you just average in the sum, we can say R1 plus R2 times W1 because W1 and W2 are the same, I just add the spikes from these two neurons and I no longer keep track of which neuron fired and which did not. Yes? Questions? Are there questions? Yes, maybe one Tatiana. Okay. Sorry, just confirming, T is normalized or no? T, thank you for your question. So T is, it doesn't have to be normalized but it is true that it will have a maximum value and it's an important question that once we go to a more general procedure with beta I, so imagine that the stimuli can go between minus infinity and plus infinity but in this procedure WI have unit length and so at most the length of T even if all WIs point in the same direction the maximum magnitude will be the number of neurons so this T goes between minus N and N so I have compressed the stimulus that is goes between minus infinity to plus infinity to a finite range and part of that compression is the consequence of the sigmodal nonlinearity that Angela drawn on the board but now summed across neurons so it is true that from stimuli to the three doubts T there is a nonlinear compression but what is interesting is that T is a linear function of neural responses so it is a linear readout so it's a linear readout but it gives you a nonlinear version of the stimulus because neural responses are a nonlinear function of the stimulus Okay? Sorry, we have a question online Do you see it or should I read it? Yes, I will see Why the information transformed was zero Oh, okay, in the table So in the table that is here you have three different stimuli and the count was the same, right? So in this example, if I say that the information between stimuli and the count because the count is the same across all different stimuli so its entropy is zero and so the information is zero Okay? Thank you All right So now we can talk about this assumption so people wanted to test so the key assumption is that the neuron are tuned to the same stimulus and their spikes can be averaged and here is the first study almost 20 years ago from Jonathan Victor and this is in the primary visual cortex and neurons are similarly tuned so their Ws are not identical but similar because it's from a similar... with tetrodes from a similar part of the cortex and what these investigators have shown is that the information in the population count was significantly less that information when you pay attention to which neuron produce which spikes so in the labor line code and in their discussion they say that this is an argument that nervous system takes advantage of the complexities of the neural code of this combinatorial power but how it does it is not clear because you have to keep track of neuronal identities and potentially have two to the n different response patterns but now we have a solution that we will be testing in terms of this information preserving readout so that's one study from Jonathan Victor and then another study is from Bill Ballick and collaborators they also wanted to test this idea and that's in the population vector if you have the same tuning then you can average spikes so in this case they... these are earlier experiments so they make a synthetic population by aligning all neuronal tuning to have the same preferred direction and creating a hypothetical representation so the logic has several steps the experimental design has several steps in it so in reality these neurons are tuned to various directions but we can measure how this neuron responds to stimuli that are offset by a certain amount from its preferred direction so we get this tuning curve and then we can place and create a synthetic population of how these neurons would respond if for a given distributed... for a given real stimulus so it says, well I recorded this neuron separately but I imagine that how this population would respond to a motion of... to the actual motion that was displaced from relative from its preferred directions and so in this case one can have different neurons this way time that way and one can compare the information transmitted if we just monitor the sound... the sound activity of this neural population and that's in comes or keeping track of which neuron produces spike and you can see that you get many... much more information in the words which is combinatorial code compared to calm alright so we have a question from... that's an answer try it once alright so this is I would say a more direct test but also was... had several steps because the population is synthetic in those days it was hard to record many neurons at the same time so they imagine that for each neuron a real neuron with a given research... with a given direction you... that was recorded there's another neuron with the preferred direction that we need and so we shift the stimuli and assign the responses relative to that preferred direction so then but now we have a solution and even in the discussion of Leslie Osborn and Bill Daly's paper they said well how could this be and one possibility is that if... you know these neurons they have the same peak response rate they have the same preferred stimulus and yet the identity of the neurons matter for the purpose of characterizing the stimulus and you can imagine that one way it could matter is that if you have a stimulus that is somewhere on the flank of the curve then if the narrow tuned neuron did not respond but the broadly tuned neuron did respond then that will help us say oh so narrow down the range where the stimulus could have happened and so by... even though these neurons have the same direction and the same peak magnitude they have different tuning widths and because they have different tuning widths if we take that into account that carries information about the stimulus but is ignored if we just add those spikes in but now the claim is from our information preserving procedure is that there is actually a linear way to take this into account so how can we get there so most of the studies of population decoding are based on these tuning curves so you plot the spike probability as a function of deviation between the stimulus and research from the preferred direction now there is a one-to-one correspondence between this curve a tuning curve which is shown here and parameters of our logistic nonlinearity this is called linear nonlinear model the reason it's called linear nonlinear model is because you have the actual stimulus kind of a pattern you project it onto the dominant preferred pattern so this is our vector w and that goes on the x-axis and then the probability of a spike is this nonlinear function and it has two parameters the threshold and how fast it rises so one over beta and using these two parameters when you vary them you can vary the orientation tuning curve so for some reason when people talk about encoding meaning how they stimuli I convert it into spikes people usually work with this formulation linear nonlinear model because the stimuli can be complex they're not necessarily described by one parameter such as orientation so then the linear nonlinear model is more useful description but for the decoding people said, well, it's... I'm not sure how to get down let's simplify the picture and let's work in the context of the tuning curve so this is fine because there is a one-to-one relationship between parameters but the claim is that mathematics is much simpler in this LN formulation in terms of parameters alpha and beta rather than parameters of the tuning curves so for example to illustrate this relationship between parameters of the LN model and tuning curves we can discuss that when beta is decreased so the tuning curve becomes... the nonlinearity becomes more shallow then the tuning curve kind of simultaneously is squashed and broadened and if you change the threshold then the peak will decrease but the tuning will change somewhat but not as strongly so in other words when I change the parameter threshold and gain in this formulation there is a family of tuning curves that one can obtain on the right but the advantage of this formulation is that I have a simple information preserving population factor here the beta and alpha are related to the difference between the peak and the trough and they're related to this width but you have to take the logarithm of the curve near the peak and take a second derivative and that will be something similar to beta so mathematical formula is more complicated and therefore if I want to derive the information preserving function from tuning curves I would say it's impossible from tuning curve it's very easy from this LN formulation so now some intuition for how so the claim is that we can take a linear function of neural responses and capture all information that there is in the neural population because that's a definition of a sufficient statistic so let's see how it works with two neurons so suppose I'm trying to encode stimuli in a two-dimensional plane and the two neurons have their preferred directions w1 and w2 yes? is that a question? I don't think so so the two neurons have four different patterns 0, 0, 1, 0, 0, 1 and 1, 1 so these are four different patterns and on this two-dimensional plane they're represented as four dots there are still four values of a single two-dimensional vector so far we haven't simplified a lot but imagine that you now have three neurons and then many more neurons so first I'm adding the third neuron and it's not spiking so we get the same four values and then we add the third neuron and now when it spikes this is 0, 0, 1 and then the whole procedure is shifted sorry sorry, there is some noise sorry sorry? so in the end we have the airplane today it has to be outside so we have to wait for the... ok, it's not our fault yes it can be assured that our military is working to protect it it looks like an alien invasion yeah, I am from Kiev so I'm very distracted by the current... I can imagine so in this case so with three neurons we have eight possible patterns of... that the population of three neurons can produce so technically it can be a very complicated function but now the claim is that it's not... there is a geometry in this space meaning instead of saying that this is an arbitrary function of an arbitrary eight patterns these eight patterns can be encoded as eight values of a two-dimensional vector if kind of receptive fields of neurons live in it to de-flame so I'm doing a projection from two to three eight patterns to eight values of the vector and now imagine you have n neurons but the job of these neurons is to encode position in a two-de-flame so we'll wait for the... for the clear size so with n neurons we have two to the n values but they still live in a two-de-flame and so this is two to the n values of a single... of a single two-dimensional vector and you can see now how... and the claim is when I monitor that vector so when the neurons... various combinations of spiking and not spiking of these neurons will give you one of these values and the advantage of this geometric... so we have converted what was an information theory and pattern problem to a geometry problem where the readout is a vector and it has continuity and topography to it because when the stimulus changes even if there is noise in neural responses this vector will move among the nearby values and so if I don't want to monitor all two to the n values I can coarse grain my available space to the number of patterns that I can monitor and in this way capture most of the information but with the smaller... not necessarily monitoring two to the n values I feel it's a good time to ask for questions Any question? Yes, one, Tatiana How do you place the position of the possible values on the plane? I mean it's at random or uniformly distributed? So it depends so these points, the possible points are determined by W's so in this case if you have three neurons and they have these vectors then the three basis vectors 001, 100, 010 then they will determine the eight possible patterns that you can have Now the question is how to set and you will have freedom for setting these W's in a way to maximize information so in particular the derivations that we went over about how to code multi-dimensional inputs and how to filter them I will wait a little bit So there was a question in the chat about which neurons to include in the population so include all neurons that you measure so the more neurons you have the greater the accuracy the greater is the number of possible values so if I only have three neurons I can distinguish so the stimuli I encode in the eight values but if I have many neurons like this then I will be able to distinguish stimuli that are closer because they have higher number of values about the possible values with which stimuli can be encoded so I have another simulation illustration about the difference between information preserving and classic population vector and then maybe we will ask for questions so imagine that you have again three neurons but now these two neurons have the same direction but they have different noise values in that case you have in the information preserving version because these vectors have different lengths the patterns of spiking will be different whether the first neuron is spiking or the third neuron is spiking and so we have our eight patterns but in the classic population vector prescription all the vectors are normalized so when you add these vectors you have 0, 0, 1, 0, 2, 0 and same thing on the other axis so instead of eight response patterns you have six response patterns so now I think the questions so that's the connection between information theory to geometry we say that information theory says patterns and we sum over possible patterns and the claim is that if you have this minimalistic model of neural responses where the patterns are based on a linear projection between the stimulus and response pass through a nonlinearity then you can convert these patterns to a continuous kind of topographic mapping where the position of the neural responses are mapped as a discrete set of points but on a continuous map somebody has to ask a question so maybe I will ask the question so what you are saying is that essentially by having different vectors you allow the neural response to have a metric essentially to have a sense of which stimulus are close to which one essentially yes so by having different betas we are not losing so you see that if you ignore beta you will capture qualitative features of the coding the patterns will be similar but some of the symbols will be merged so that explains why when you look in the detail as in Bill Bialik's paper or Jonathan Victor's paper when we ignore the noisiness or the differences in reliability among neurons then information is lost but at the same time you can understand that this procedure is kind of robust and why the standard population vector generally works but I think also the theoretical advantage is that it also shows that well we will go over maybe with additional examples how the nonlinearity how all the diversity in the tuning curves can be summarized with just one extra parameter so maybe that's okay thanks alright so then we can test this so another one is that more reliable responses should be weighted more strongly and no other information is necessary for the readout for example the threshold do not contribute now the threshold will affect just like there was a question how should the W be chosen depending on how we choose the W will determine how much information that particular population can capture but once the information is there it doesn't affect the readout same thing with the threshold if I carefully position the threshold which was the topic of a few lectures back then that determines whether the neurons will convey more or less information but once the thresholds are there I don't need to know them to read out to capture full information so here is an illustration for addressing this question that was raised in the Leslie Osborn and Bill Daleks paper that if you have a population with the same preferred direction same peak response but different tuning curves with and in this case these differences in tuning were differences in beta factors so in this case when you compute information so the black line is full information red line is information preserving population vector and this is the classic population vector here in gray that only sums the responses and in this case the W is the same for all neurons so you can actually ignore it so in reality population count and information preserving weighted population count so even though these are two scalar quantities one is the population count is a scalar quantity and the other one is information preserving population count also a scalar quantity but is weighted differently and it takes more values and so you can have capture more information so that explains the intuition behind the Bill Daleks observation here saying that if once you take into account the different widths you should be able to capture more information but it also tells you how that was just one single weighted parameter so that's one illustration and then another one is here in this case you can have tuning curves again by the same preferred direction but they have different peaks and they have different widths but these differences are induced by changes in the thresholds but not changes in the weighted parameters and in this case the classic population vector is perfectly sufficient so I'm thinking that we can take a small break maybe five minutes and then continue the discussion would that be good? do you agree? okay maybe after five minutes you'll reconvene here thank you sorry, just... not really sorry, just call the others can anyone call the others? recording stopped this simulation argues that you can have some variability in the tuning curves as in the peak and the width and actually ignore it for the purpose of redown so not all variability in the tuning curve matters on the other hand if you have neurons that have now different preferred stimuli so different vectors w but if they're a paradigm then if now if all neurons have the same w different ws so different directions then technically it doesn't matter if they have the same betas or not because they when you add them up in different combinations then it will produce different possible patterns and so there will be no information loss but because these preferred stimuli are not the knowledge of the preferred direction itself has some uncertainty in it so then there will be loss in practice so in this case we are modeling a situation where you have two sub-populations both that are tuned to different preferred stimuli and they have a different tuning curve width and then again there is a difference between a classic population vector and information preserving population vector and you can capture back that difference by just you can capture full information just by taking the beta factors into account so we talked about this graph and another interesting part here that I didn't talk about is that the dots here the dots is a slightly different calculation than the line the line is the full information in whatever the response patterns you get by taking different combinations the dots is bending these response patterns into 15 bends it means that just like in this case of say 2 to the n responses yes technically I'm supposed to capture full information I have to keep track of which of these possible patterns the population vector landed it on each trial but imagine you can now grade it into a coarser grid and ignore difference so merge these two similar possible response patterns into one and the observation is that this leads to a minimal loss of information so another point that I think was raised after the end of the previous lecture what about correlations in neural responses so in this case the the so-called noise correlations meaning the direct links between neural responses and so we have neurons that their responses will be fluctuating together for a given stimulus because of the direct or indirect links between them and it turns out that based on our definition of the sufficient statistic if you can write the joint distribution of neural responses given the stimulus in this form meaning a function of the stimulus a function of response and then this form which we had before which is the coupling between stimuli and T here has neural responses plus the correlation term so if the correlation term is stimulus independent even though it affects the overall amount of information in neural responses but you can read out what is there without taking that into account so the I think exciting aspect of this read out is that you can use this same read out when noise correlations go up and down and but within a current ensemble they have to stay constant sorry may I ask you a question why in these mathematics we have these denominator 10 square root of n oh yes this is just for a simulation so this goes okay so this one one has to separate the general from the specific this is the general formula for where this population works and this was specific to this simulation I see we have to choose something for our population I think of say 10 neurons or something like this and what is also interesting and worth mentioning is that once you put these nice correlations in you know what the tuning curves get all kinds of weird shapes so this is another example for how the tuning curve description isn't obscures kind of obscures the map so you know once I have these tuning curves how do I get out the relevant parameter beta that is hidden in there so if instead of plotting as a GAGA I plot as an LN model then I measure those parameters directly in the experiment and you know and then we have a direct description so you will see papers in the literature that say that diversity of the tuning curve matters and it's true but not all diversity and how to get out the relevant parameters of the tuning curves is not clear here so all we could say if I'm working with tuning curves that you know it all matters the peak matters and the widths matters and the position matters and then you know the width goes matter but once you transform it to LN model you can say well actually it's only this noise parameter beta that matters so now for now on it was all about model responses and now they test some real neural responses so this is the data same data as I talked about that I recorded and same data from the primary visual cortex but in addition to probing natural scenes and white noise we also recorded ratings and so we are now computing information about ratings and you can compute full information because this is a small number of neurons recorded simultaneously in black and the classic population factor is in gray and it loses information compared to the full information because it merges these patterns together and the spike count but it's better than the complete spike count because the W's are not exactly the same okay and maybe looking under the hood of this calculation this is on the left is information preserving the calculation on the right is a classic population factor and each dot is the population vector for a given stimulus but color denotes the same stimuli so you can see that the relative the scales are different and there is more scatter in the classic population factor or more mixing between patterns compared to information preserving population factor any questions about this calculation any question no so another question that I sometimes get here is that in the LN model we changed two parameters threshold and the gain and as a result by changing the parameters you can change the tuning curves in some way but you will be limited you cannot create an arbitrary tuning curve although unless we start adding noise correlations so if you look and in particular when we change these two parameters alpha and data there is a correlation kind of a constraint between the minimum firing rate here and what they call modulation of the tuning curve which is the maximum the difference between maximum and minimum so as the threshold decreases I span the curves this way and as I change beta I also but there isn't basically if you have large the minimum then I then there is a smaller kind of room that I can go in terms of the maximum so it turns out that there is some evidence for this from neural populations that in particular that neurons with strong multiplicative effects have weak additive effectors or vice versa so that's an indirect evidence that the variability in the real tuning curve is consistent with the variability that you would observe by changing only the two parameters alpha and data so in other words we can now summarize that I gave you the intuition about the claim is that neural responses can be read out without information loss and now the plan was to apply this formula for large neural populations and then comparison with retinal arrays so the advantage of this function is that we have a prescription that we are deriving a population vector t that has the same dimensionality as the set of receptive fields which often equals the stimulus dimensionality as because that's how we probe them and so in a way we solve a curse of dimensionality in terms of the number of neurons but what remains is the dimensionality the curse of dimensionality in terms of the stimulus so we say that it's a full information between s and t but that can still in practice be a complicated calculation if s is itself high dimensional so any questions so given the time I will summarize some answers and then we will go through the mathematics in more detail on the next lecture so this is our full information between vector s and vector t sorry because today I have a question yes so in many cases I mean is the dimensionality of the stimulus always known I mean in the sense that there are some neurons that maybe are responding to something that we don't know and so are you only considering cases where how this is known a priori it's not known but it is known the moment I say I want to compute information about a given thing so suppose these neurons are convenient information about visual and auditory stimulus but I ignore the auditory and I only measure visual so then I specify the stimulus it's going to be a visual stimulus I measure their receptive fields with respect to the visual component so the auditory component is averaged out and then we are reading out so the neural responses will have information about visual part but if I am computing information about visual part I can read out that information using this equation so in other words you know mutual information it's a mutual information so so the moment I say I want to read out so the neural responses have information about visual and auditory signal I said I want to read out information between visual stimuli and these neural responses and I can do that using this procedure whereas and then the population factor which is a visual one determined by their visual part of their receptive field so clearly the dimensionality of the vector T is the same as the dimensionality of the stimulus right because you are identified with these tuning curves with the dumb news yes because I am using the stimulus to measure the receptive fields so in principle the receptive field can be more complicated but compared to which I measured it the set of the yeah but you know what I hope to have conveyed today is you know I think it's an interesting result that you go from arbitrary to to the end patterns to a specific prescription of with these neural responses yes thanks and then another question so these essentially if I remember well well does it this depends on your assumption of a linear response of the neurons I mean your equation for T it was the key assumption is this one that it has to be a logistic function if it is not a logistic function we do not have information preserving population vector if it approximates the logistic function then we will be approximately correct and then the key assumption here is that we have sw now you can relax that but then the definition of the information preserving population vector if it is not a linear function then you have to put this nonlinear function here okay thanks so now we want to talk about what to do with high-dimensional stimuli and you know for dimension it is so we are going to skip a lot of math because we only have few minutes and we will go over it next time but I will just give an introduction to the math and the introduction is based on so we want to compute this information between s and t so if s is one-dimensional t will be one-dimensional that's okay so information to analyze variables if I have s is 10-dimensional and t is 10-dimensional it's already problematic because like you know if I use 10 bins for each dimension so it will be 10 to the 10 for the stimulus and 10 to the 10 for t so we can do some approximation one of them is the so-called chain rule of information which I think students have seen it and for now we leave the t as a vector and we evaluate this component by component so the chain rule says it's information between vector t and component 1 plus information between t and component 2 conditional on 1 plus information between t and component 3 conditional on component 1 and 2 and so on so this is a little bit of a simplification but this high order this conditional information that depends on the other components will be difficult to do so we will be ignoring some terms or doing curse graining or renormalization group approaches to approximate these terms I think so and then I will go over these three approximations and maybe this will be an introduction to the next lecture so the full information can be written as a sum of various conditional information terms but they can be simplified so if inputs are isotropic it turns out that it's only component so maybe I go in the opposite order if inputs are independent then you can approximate this between one component and component of vector t and then s2 component t2 component by component then one can add for more accuracy conditioning upon the previous components and if we have isotropic symmetry in the problem then technically it's two-dimensional information between s and t plus the magnitude of the rest of the component and then we will be evaluating these approximations so maybe I stop here and I will thank you for your attention and ask for questions if there are any questions questions yes there is one Tatiana sorry one second sorry it's not actually related to the lecture but more to the exam I just wanted to know how it's gonna be done I mean how it's gonna be about what's the method we use for the exam so I think we it will be a multiple choice set of questions and that's as much as I know at this point and be based on the lectures and okay thank you further questions maybe not ah no there is one Tatiana sorry one just two slides ago you mentioned the exponential property of the model which simplifies the information can you please explain it okay so just a second yes last lecture we talked about sufficient statistic and this was our model and once you have this model here so in other words you can rewrite this as e to the kind of two e to the fc time some denominator that does not depend on r and from this follows this expression here if you can see my cursor so in order to have this exponential form we need to have to have a logistic function and once you have this exponential form you this is a probability distribution between stimuli and r so this is independent of r and we have some other parts that are independent of s so the only coupling between stimuli and responses is this term and therefore this is the this information preserving statistic okay so if we do not have the logistic form we do not have the okay by the way yes I know I wanted to clean them up but I will upload the lecture notes as this but I wanted to clean them up a little bit with those typos that were noticed so that's why I was delayed okay just on this slide I think that if your input if your r is binary you can always write the probability in this way thank you because it's essentially I mean there is only one free parameter so I think you can always write it in this way okay maybe I will think about it I thought that if I have a sigmoid of tuning but not a logistic tuning I thought I would have difficulties in writing it in the pure exponential so if I take it as integral of Gaussian distribution as an error function I was worried that no because essentially the r is just two values r equal one and minus one so there is only one number that you have to and actually they are related by the normalization condition so it's only one number and it's a function of the stimulus so you can always write it in this form what about if it is kind of an error function do you think I can multiply and exponentiate some other function rn times some function of the stimulus for binary stimulus I could write it as exponential to rn times the logarithm of that function well so yes so if you have a binary variable and the probability that the binary variable is one is p you can always write the probability of the binary variable r as p to the r times one minus p to the one minus r if to the okay yes I think that I think I agree so it's longer than if you think it's binary and then I can write it as e to the r logarithm of my function okay thank you okay more questions, observations so if not let's thank Tatiana again thank you bye bye