 Okay, so I think we can start. Good afternoon. Thank you for coming for our ultimate lecture on this topic. And today I will discuss how structure briefly will be a short part in the presentation, how structuring neuronal responses are in such a way that to allow for lossless transformation, which is what we discussed last time, how it leads to hyperbolic geometry. Then the second part is how to quantify information conveyed by large neuron population. And the third part is comparison with retinal arrays. And the last lecture, which we will merge with this one, depending on how many questions are there, will be evidence for hyperbolic geometry at different stages within biological circuits in the stimuli, in perception, and by implication within the neural circuits itself. So last time we discussed that... Well, technically I said that if it is a logistic function of the stimulus, but Mateo pointed out that actually it's for all binary responses, the expression will be more complicated, but will also work, but we can construct an information-preserving population vector of this kind, where in order to read out neural responses without loss, we can take this construction, which is linear in neural responses. It's not linear in terms of the stimuli, but linear in terms of neural responses, you add these binary responses times the argument of what goes on in the logistic function. So if this is not a pure logistic function, then this function will be an approximation to that argument. So taking one over this, subtracting minus one, and taking a logarithm to get to the equivalent of this function. And that leads to a vector, and it has the same dimensionality as the stimulus. So it doesn't grow with the number of neurons, but the number of values it takes grows exponentially with the number of neurons. And now monitoring this vector t allows us to capture whatever information is contained in neural responses. So it's not that there is no loss during the encoding from stimuli to neural responses, but there is no loss for the readout. There is a compression from stimuli to neural responses, but whatever the neural responses contain, you can read out using this vector t. And this is supposed to be an illustration that this vector t will take many, many values if our neural population is large, but you can coarse-grain them into a smaller number of bins, and it will see that that leads to an efficient way of capturing information with a smaller number of bins. In this case, it will be saying 9 bins, 3 along x-axis and 3 along y-axis. So now I would like to discuss the connection between this population preserving vector and hyperbolic geometry. So we discussed that we have stimuli that go between minus infinity and plus infinity, but because we put things through a logistic nonlinearity, which is true for most neurons, then this population vector in general is limited in its magnitude. So what's shown in this panel A, you have a stimulus that technically goes between minus infinity and plus infinity. But if you have a finite number of neurons, so I think in this case it's 16 neurons, then they're all aligned. In this case, the stimulus is one-dimensional. So instead of population vector, we are talking about population count. And how does the population count changes as we vary the stimulus along this axis? Because of the nonlinearity in neural responses, the population count is a nonlinear function of the stimulus. And it is compressed because the infinite range for stimuli is transferred to a finite range in terms of the neural responses. So the larger the number of neurons, the bigger is this range. And another view, we now have multi-dimensional stimuli. So then what is showing on the right is the compression that occurs when we are showing stimuli at different directions. So in this case, this is an example, input distribution, a Gaussian input distribution. And if you show the stimuli along one axis that has larger variance, then the component of the population vector will have a larger amplitude. Then if you are doing it along the shorter, minor axis of the input distribution. But in both cases, you are compressing what is an infinite plane into a finite radius. So now some background about hyperbolic geometry and why this compression resembles what we get there. So the hyperbolic geometry is a nonlinear geometry and there is no perfect model of visualization model for how to visualize it in kind of ordinary plane. However, there are two representations that are used most common and they both bear Poincare name. So the one on the left is called Poincare half plane and the one on the right is the Poincare circle. So we can start with either one of those. But if we talk about panel A, then this illustrates that, so technically you see this grid and the grid is getting smaller as we are approaching smaller values in wine. And the reason and technically the metric, the continuous metric is given by dx plus dy divided by the y squared. As a result, if I want to take or go between two points A and B, the shortest way in terms of minimizing this distance is going inside the plane towards larger wine values. So why is this popular metric? One reason is that, as you can see, I'm trying to make a connection with hierarchical networks, meaning that if we place, if we discretize the space in these squares and place units in the center of each square, then to find the geodesic that goes between two points A and B approximates the distance that we would use if we want to compute along the network that is formed by the centers of the squares. So it doesn't, in this simulation doesn't exactly match. But if, for example, if you talk, take this point here A and between, okay, so between two points on the cursor, if it is the laser pointer, so it moves slower. Does anybody know how to undo the pointer options? No, no, we're stuck with this, but it's just slow. And so between, between two points A and B, in order to go, in order to compute the distance along the network, you have to go up a tree and then go back down. So and technically it will be the number of steps. And it is approximated by this trajectory. And the solution for the Poincare half plane for this trajectory is a circle that approaches the line y equals 0 at 90 degrees. So that's one representation of the Poincare space, hyperbolic space. So it emphasizes the link with the hierarchical network. The one that I think is closer to our neuronal case is this Poincare circle. In this case, it's a spherically symmetric model. And we take infinite plane and we apply this transformation that you take the radius on the plane and you apply a change transformation to get points within a radius of one. So infinity shown here as a dashed circle is unattainable. So that's the, that's the perfection. But that's infinity is a dashed becomes a circle of radius one corresponding to radius equal to one. And the geodesics here, some sample geodesics are shown in the red. And in this case, unlike going straight between points, the line of shortest distance is attracted to the center of the space. And for points that are further separated by some angle, then it's faster to go towards the center of the network and then back out. So the distance between points for most points will be twice the radius. And you can also see that the points are sampled uniformly and in say Euclidean space, but because of the compression, the density of points increases exponentially with the radius. So now it is similar to our population vector transformation, because here, for example, in the on the y axis, we take, there is a compression that is limited by the number of neurons. And the number of different states for this vector will increase exponentially as we approach the boundaries of the space. So one can see that if you have a binary neuron, and because of this, the probability of spiking is limited between zero and one. So in general, there will be some kind of a nonlinearity like this. And then the transformation between similar to this information preserving population vector T will have a compression that is similar to upon career representation. So now I'll stop and I will ask for questions. Can you maybe repeat this last part of the link between the hyperbolic geometry and the compression into the population vector? So what is shown here on the left is an example, maybe to reward it in different ways. In this particular case, all neurons have zero thresholds, so they're all aligned, and they all have this tension on the linearity. So in this case, what is shown on the panel A is, and imagine they all have, so in this kind of simplified case, we said that the neurons are sensitive to the same direction. So they're all receptive fields that are aligned, and the stimulus is also along that line. So it's kind of one-dimensional stimulus and one-dimensional population vector, which becomes population count. So now because of this tension on the linearity, for individual neurons, and they're being added in effect, the population count also inherits this tension on the linearity. And so then there is a compression between the stimulus that goes between technically minus infinity and plus infinity to a finite range. And the range depends on the number of neurons that you have, because the maximum value of the population count is when all neurons are on. So in this case, there are 16 neurons, so the maximum value will be 16. If neural responses, if we work in the case where neural responses take values minus 1 and 1, then the smallest value that the population count can have is minus 16. Now, if you think about what are the possible values that the population count can take, they're not distributed uniformly along this axis. So if we project the density of states along the y-axis, you will see that there is a peak at minus 16 and plus 16. So if you recall, in a few lectures back, Mateo made a histogram, and for a binary neuron, the density of states is just two states, minus 1 and plus 1. If you have more neurons, this spreads out, but still most of the states is in this region. And you can show that the number of states increases exponentially from zero. And the transformation that we have, this Teng's transformation, is the same transformation as in the Poincare construction on this panel B. In other words, what we are showing here with neural responses for one-dimensional stimuli is a cross-section of this Poincare plane along one particularly chosen diameter. Let's have more questions. So I try to represent on the blackboard what you just said, that essentially this non-linearly transforms a PDF on the real axis into a PDF on the response axis, on the vertical axis. And it has more concentration towards the endpoints, right? This is what you wanted to... Yes, that's right. Thank you. Yes, it looks good. So this is for one-dimensional stimuli. And now I mentioned that you have isotropic distribution in say two dimensions. And then we apply this Teng's transformation because imagine that now the receptive fields are uniformly distributed along the angular direction. Then there would be a maximum... So here you go through this non-linearity. And then your points will be closer to the border, right? So that's the connection, that basically the binary nature of neurons, and if you have more neurons, then leads to this exponential concentration of states. And we need this in order to achieve kind of readout. So the readout will be that is information preserving from the neural population takes us to this hyperbolic space. So the summary of this first part is that if we want to structure work with variables that allow for lossless information transmission within the neural circuits, then we have to transform from stimuli to a hyperbolic representation of the stimuli. And it also, I think, overlaps with the previous results that not exactly talked about information theory, but it was a practical observation that if you structure the connections within network according to this hidden hyperbolic structure as an internet, for example, then this allows for efficient routing of information. So there are these series of papers by Klukov and others about the hyperbolic geometry of the internet or hyperbolic geometry of complex networks. And the third reason is that the hyperbolic geometry, as we discussed, describes hierarchical tree-like networks. It's an approximation. So if you think about a tree here, and what we have in our senses is sensory periphery. We receive derivative information of, we do not have access to the underlying tree, but we have a derivative information, either in terms of pixels or in terms of molecules, in the case of smell. And then you would like to compute the distance between them. So a correct measure of distance between them will... One of the measures that you can choose is the one that reflects this underlying tree. So the two molecules will be more similar if, according to how deep into the tree one has to go to find their co-modulator or root cause. So that's one of the connections between hyperbolic geometry and lossless information transmission. Any questions on this part? Anything that should be expanded upon? You would like to know more? A picture of hyperboloid and a surface of constant curvature. Anything else we should expand upon? I can't quite see the audience, but... Yes, no. Everything seems to be clear, not completely. So now another point is what if you would like to ask what is the optimal neural code? So one could make statements. Imagine that you have a given number of neurons and a given radius of the hyperbolic space. So it turns out, so we know that the radius, the number of states goes exponentially with the radius. So one can turn this around and say that the optimal radius of the hyperbolic space should be the logarithm of the number of neurons you have approximately. And the reason for this is that if you have a small number of neurons and large hyperbolic space, then you won't be able to sample it fully. So in order to achieve a reasonable uniform coverage of the space, the number of neurons has to be related to the size of the space. And incidentally, this is also the amount of information that you get when observing discrete responses. So there is an expression. You can look it up in the Bialyx textbook on the maximum information that one can obtain by observing n binary responses. And it grows technically as n times log n plus corrections, but in the limits of large n, it grows exponentially with the number of observations. I would like to discuss practical strategies for quantifying information conveyed by large neuronal populations. So even though we have this nonlinear transformation, the information still can be computed in a way that disregards just two vectors, but we will need to be careful in terms of discretizing appropriately the number of states. But other than that, it's information between just two vectors. So as we discussed last time, information preserving population vector has the same dimensionality as the stimulus, assuming technically as the set of receptive fields, but we are only probing receptive fields with stimuli so they cannot have larger dimensionality. And what is interesting is that dimensionality for this vector does not increase with the number of neurons, only the number of values that this vector takes. So now how to compute this in practice? So we will be computing it for a thousand or more neurons. And so the approximation is as follows. So one, the full information that we are interested in is this information between two vectors. So they have many components and therefore we will leave aside the vector t for now and think about decomposing the vector s. And you can write this as the sum of various conditional information values here. So it's the sum of information between one component and the components of vector t. And when we are talking about the first component, then we don't condition on anything else. So when b is equal to 1, there is no condition. But then it's information between the second component and the components of the vector t, now conditional and s1 and so on. So you add gradually other terms each time conditioning upon the previous components of the input. Any questions about this expression? So everybody is familiar with the chain rule for information or would it be helpful to write it down on the board? So maybe I have write it just explicitly. So this is the mutual information between s1 and t plus the mutual information between s2 and t2, t3, td conditional on s1 plus the mutual information. And you go on like this by components, right? Yeah. The last one is the mutual information between sd and td conditional s1, sd minus 1. And this relation is exact as you can check by the chain rule of probability. OK, so this thing is just the expected value of the log of the probability of the stimulus and s by the probability of t and probability of s. OK, and as an exercise you can show that you can decompose this into all these parts. So for example, you can write this expected value of log p of s even t divided by p of s, right? And then use the chain rule on this. So the fact that, yeah, I'll say, is the fact that p of t and s will be written as p of s1 and t1 times p of s2, t2 given s1, t1, et cetera, et cetera, OK? Right, Tanya? Yes, thank you. OK, so now these numbers of the number of terms. So each of the terms, one can compute, but their difficulty increases. So the first term that is not conditional on anything is relatively easy to compute. But then these progressive terms that have conditional information, they are more difficult because they're based on this conditional probability distribution where you have to sample together many different components of stimuli. So therefore we will be doing approximations. So the first approximation is actually exact if inputs are isotropic. So if we think about our Poincare circle, if the stimuli are uniformly distributed in terms of angles and neuronal receptive fields are also uniformly distributed in terms of angles, then one can show that all of these, one can make one simplification, meaning that instead of the vector components of vector t, what we need are two scalar values. One is the component of t along the stimulus component s that is currently being considered. Plus the magnitude of all components larger than b square roots. So this simplifies the calculation in terms of the number of components in terms of t. So we call it the isotropic approximation. Then the next approximation is to drop this square root. There is no, we couldn't justify it, but we just try it and see how that performs. And then the third approximation is when you drop all the conditional independent, all the conditions. So in this case it is exact if input distribution factorizes. So maybe an exponential distribution in x and exponential distribution in y. So these are the three approximations and what we will see is that they are progressively easier to compute and they are progressively less accurate. But how accurate let's we can examine. So I will skip maybe some slides here. So we have a question. But are they approximation or except when, for example, we have an isotropic distribution. So it should be exact under certain cases. So for the isotropic, they are exact. But if the stimulus distribution is not isotropic, then it will be an approximation. Okay, so we use them also in other cases as approximations. Yeah, so, yeah, right. So in other cases you can use it as an approximation, but you know it should work well for stimuli that approximately are isotropic or the third approximation if the inputs are independent. The other question, yeah, please. No. I think the other question you can ask is when they are not exact, do they provide an upper bound or a lower bound? Can one show whether this provides an upper bound or lower bound? So what do you think? I think in most cases it's a lower bound. But one has to, we will see the simulations. So it then depends on the basis. For example, with natural scenes, if a natural stimuli, you can have a basis where it is in the Fourier components or a principal component of natural stimuli. And the case where there are more closely to be independent as an independent components of natural scene. And we will see that it's a lower bound when we use the kind of independent component approximation, but it can be larger than the information. I think the basis is not independent. Okay. And also in parallel, we also tried another method that is not based on this approximation and is based on full sampling, kind of a direct computation. Maybe I will just go over this calculation. So we call it a Monte Carlo estimator of mutual information. And it only works in the case where the stimuli are conditionally independent. So as we discussed in previous lectures, the information is a difference between two terms. One is entropy of responses h of r and entropy of responses h of r given s. So let's see. Maybe I have an expression here. No, but maybe Mathieu, would you mind writing it down? So we need information between r and s is equal to h of r minus h of r given s. In principle, so the advantage, so there are these two terms. If we assume that neurons are conditionally independent, which is, you know, which can be, you know, this is one of the assumptions that we have been making, although sometimes removing it. So meaning that the neural response depends on the stimulus, but does not depend on responses of other neurons conditional on the stimulus. In this case, this h of r given s is a sum across neurons. So in other words, we are saying that the noise is independent across neurons, and this is then becomes a sum across all neurons. So then it's a little bit easier. So this is this slide here. So when j is equal to zero, meaning when neurons are conditionally independent, then this entropy, there is an exact expression. It's a sum over neurons. And here one can write just the their average firing rate for a given stimulus times the argument of the logistic function minus a constant function of the stimulus. So that is not kind of a reasonable computation. So I should pause here and clarify two things. So one in this approximation, so advantages of this formula is that technically it is valid for neurons that are not conditional independent. And between s and t, they do not have to be conditionally independent. We can have correlation between neurons, a correlated variability even when the stimulus is fixed. But we are trying to compare this expression to the ground truth expression from many neurons. And that expression we can only evaluate when neurons are conditionally independent. So that's why in parallel with this approximation, I'm discussing this Monte Carlo estimator that only works for conditionally independent neurons. I'm hoping that there will be questions that I can clarify about the logic. So this equation is valid when neurons are not conditional independent because this is just a derivation between s and t. As we said, it's information preserving population vector and some mass correlations are allowed. And to augment this, we will discuss any Monte Carlo approximation that only works when j is equal to zero. But in this case, we can say that h of this conditional entropy, there is a relatively easy expression that is not exponential in the number of neurons. And with h of r, well, this one has to work and approximate that this function can be approximated without bias and with an empirical average. And we can check that this, let's see, I will just show you the sampling of the Monte Carlo estimator. So it is unbiased as shown here. So for the estimated bias between our estimator and the true information, and they are centered at zero, meaning that the bias is not there for this Monte Carlo estimator. But the disadvantage of this Monte Carlo estimator is that it is unbiased, it's valid for general nonlinearities, but it requires that the neurons be conditional independent. It also requires precise knowledge of neuronal nonlinearities. So that's why we talk about approximations. I'll skip this slide for now. And these are our approximations that we talked about. So now I will show you the results of simulation. So in this case, the number of neurons can go between, say, 100 and 1,000. And this is for stimuli that are three-dimensional, and the stimuli are uncorrelated, and they are also isotropically distributed on a sphere. You can have the one-career ball not only in two dimensions, but generalization in higher dimensions. So in this case, so the axis for the information is on the y-axis. And the black line is because it's this Monte Carlo estimator that technically should be unbiased as long as the noise correlation says zero. And then we are trying to approximate, check this with our three approximation. The isotropic approximation that should be exact in this case because the stimuli are isotropic. And then approximations to this, which is component conditional and component independent. So one can see in practice that these are less accurate, but still preserve about 80% of the information for a large number of neurons. And once again, the advantage of these approximations is that they now work, they should work in the case of real neurons that have some noise correlation versus this full information will be affected by the presence of less correlations. Any questions about this simulation? Yeah. If I remember correctly, you say that in order to have information preserving population vector, we need to assume that the distribution of the responses condition to the stimuli are uncorrelated, right? Because we say that there must be some logistic functions and then we consider the full distribution as the product of them. So that was the derivation, but actually you can have some correlations. Let's go over this slide. Not sure I have this slide from previous, I'm using somewhat different computer, but so remember I will, I think I will ask Matteo maybe to write on the board. So what we have is, so if anybody maybe can pull it up on Slack, I'm not sure this slide from the previous last year. So we are discussing this, right? And we were saying this is the product over all the neurons of a probability of R i given S, right? And now you are saying what if these are not independent, right? Yeah, no, what I mean is that they, I mean based on what you wrote, they are independent because we brought the full probability as the product of the single probability of each neuron, right? Yes, yes. So we are assuming that they are conditionally independent? Yes, here we are assuming that they are conditionally independent. However, imagine, so if because they are in this conditionally independent, we can write this as a function A of S times a function f of R times e to the, this vector T times vector R. Something like this, something like this, Tanya. Just a second. So that's right, R i f i of S, right? Minus A i of S. Yeah. Something like that. So then imagine and then our, the sum of R i f i of S is our kind of information preserving population vector. This, this, yeah, this part. This is the T i. Yeah, this is T i. I think R i is within, R i is within the T and then we have. Yes, okay. I think S multiplies. I'll leave it as an f i. Okay. Yes, and then some S. So because I would, yeah, okay. So this R i times f i of S is a product or can be written as T dot S vector. Yeah. So this whole sum, sum over R i is vector T, scalar product. Times S. And then now, yes. Yeah. So, so far, these are conditionally independent neurons. Now imagine that to our final probability distribution. Yes. I think now what you're saying is that here you can introduce correlation and interaction between the different R. As long as it does not depend on the stimulus, then this is still sufficient statistics. Yeah. Okay, so the only requirement is that the joint probability, the joint condition probability can be written as in the exponential form as written on the, on the board, right? Minus. But then, then T can take a different form. So C i j r i r j minus A of S. So as long as this C i j does not depend on S, then this is still a sufficient statistics. And so you can also have the responses that are correlated. Right, Tanya? Yes. Okay, thanks. So therefore, right, so therefore we can use these approximations even in, in the case, and also as you can go back to the slides of the previous lecture. Once you have these correlations, this G i j in this form, if I plot the firing rate of one neuron because of correlations with other neurons that have different preferred stimuli, your tuning curve will have very can have a very strange shape. So, if a neuron is correlated with another neuron that is tuned to a different stimulus, then it will be very multi-tune in curve. But one, you know, if you, if you write that in this exponential form, then one doesn't need to be confused by the complicated shape of a single neuron tuning function. So that, thank you for this question. So then we had a question in the chat. Also an interesting question, an observation that if you look here on this slide, the information doesn't quite increase with the number of neurons. Right? I think that, that's, so first of all, you can, there are two comments I can make, maybe, maybe there are more that, you know, one can think of, but this is what two comments that I have. So first information is roughly logarithmic in the number of neurons. That's first statement. Second, in this simulation, we took the probability distribution of stimuli as shown here. So it was uniformly distributed on a sphere. And we took some distribution of neural thresholds. I think they were all set to the same value. So this, while we can read out what these neurons, what this neural population has, these thresholds have not been optimized to convey maximal information possible. So this, this just, this simulation checks the accuracy of how well we can read out whatever information content there is in the neural population. But this does not represent the maximum that these neurons can convey because the position of the thresholds was not optimized for the stimulus distribution. Any other questions? That's another point. And then we can check other approximations. And this is another, I would say, potentially interesting comment is in this case, we have two dimensional stimuli here. And there are only nine neurons, but they are distributions of receptive field is asymmetric. So as we discussed, for example, in natural scenes, the optimal distribution of receptive field is not uniform. So our isotropic approximation, it doesn't, doesn't have to work well. Although in this case, it continues to work well. So in this case, we have this Monte Carlo estimator in black. And this is a full vector sufficient statistics, S and T. And here are the two components. There is no isotropic approximation, but component conditional and component independent. And it also raises an interesting question of how I should order these components. Because in the equation that you see on the board for the information kind of decomposition, we haven't talked about the ordering, whether there is any preference for the ordering. And intuitively, I think one could maybe guess that you should start with evaluating components that have the largest variance. So I don't have analytical results to that supporting this, but we have empirical simulations showing that it is better to start in this graph. The computation with the component of the largest variance. So this is showing here in the two red lines. So the blue line is the component independent calculation. So in this case, it doesn't matter in what order you're evaluating information between S1 and T1, information between S2 and T2, you can order them in any way. But with the component conditional, it does matter. So what is shown here is oops, so we can look at the x-axis, which is the ratio of the standard deviation in component one over component two. So in the case where the first stimulus component has smaller variance than the second one, then it's better to first the dashed line is better. And it corresponds to, I guess, we start with component S2 and then S1. So we first evaluate the information with respect to component of largest variance and then add the component of the smallest variance conditional on the larger variance. So I think there are some themes that are similar to renormalization group approach for those of you who are familiar with this. Such that we first evaluate information about the smoothly varying components in the stimulus and then transition to evaluating information about more fine-brained details conditional upon the value of the smoothly varying component. And then it switches. So when across this line, then the standard deviation of component S1 is larger than the standard deviation of S2. And then it's better to start the computation with S1 continued by S2. Any questions about that? No, it looks like no questions. Yeah, so for example, I often think about these components as free components. If, you know, then S1 will be the lowest frequency and S2 will be the next frequency and so on. So when we think about evaluating information between a natural scene and neural response, we will first say, well, let's take information about the mean luminance and then add information about T, which is the average population count across the array. And then we talk about information between higher frequency and because we add the, well, I will show you this slide. It's like taking a four-year mode of the population array. And then this is, this slide was now higher dimensional stimuli. So natural stimuli, 100 dimensions. So without any approximations, actually we can't do anything except for the independent component approximation because of this high dimensionality. And this is information about real neural responses. We fit the, their receptive fields, fit the nonlinearity. And even though they are not conditionally independent, we try to do this Monte Carlo approximation, assuming they are. And then you can see this question about lower bound and the upper bound. So if the components along which we are evaluating are independent, in this case, they were ICA components, then this is a lower bound. But if I say that I evaluate in the basis where they're not independent as in PCA components with natural scenes, then you can have more overestimation of information. Any questions about this graph? So this is the summary so far and now a practical calculation for retinal arrays where we talk about these components, they will be actually the three components of the stimulus. And the goal of this computation is to show to you that our results that we obtained with two neurons and cell type specialization actually holds in the case with multiple populations of, you know, large neural populations, each of which represents a single cell type. So if you like, we can have a short break and that's up to the group. But if there is a need for a break, then that would be a good place to. So I'll leave the decision to the, to the, you know, like the situation on the ground. So do we want to take a break? Yes. I think there is. We take a break of five minutes or so. Okay. Okay. Thanks. Recording in progress. Should we start again? What do you think? Tanya, are you. Yes, I'm ready. Yes, yes, we can hear you. So the students are getting ready. Okay. Very good. Okay. So we are back in the retina and we will be applying this formula that we derived and now optimize parameters of neuronal threshold, but using like many arrays. So the question is that as we discussed in the retina, there are a stimulus concern and it separated into the channel on channel for encoding increments of light intensity and the off channel for decoding decrements. And then the experimental observation is that the off neurons split into subtypes. They're called adapting and sensitizing. But for the purposes of our analysis, it's a, adapting is a high threshold sensitizing is a low threshold. And they form overlapping mosaics. And the question is why is the on channel doesn't split and the other one split. So previously we only talked about encoding off say two neurons have the same threshold and versus two neurons different threshold. But now the computation is more complicated because I, I don't, I can have the same number of neurons. But in the case of the green mosaic that encodes on neurons, they have slightly different response regions. So they provide higher spatial resolution, but they will not provide. But the advantage of the overlapping mosaic is that although they have lower spatial resolution, they have higher intensity resolution. So we can tell better differences in the light, the values of light intensity. So now which of them is better? And that's the computation that depends on the structure of the stimuli and parameters of the nonlinearities. So to foreshadow some results that we will come obtain, suppose the stimulus is white noise means completely like a TV flicker completely uncorrelated in space. In that case, you might be able to guess that this solution with the single higher density array is better because I'm getting independent information and different pixels. But in the case of natural scenes that are more smooth and more correlated, it's possible that I will have higher overall information with overlapping arrays because the stimulus is correlated and there is not much variance in the spatial domain and instead I can use the two neurons allocated to the same location to better report the actual value of light intensity. So you can think of this calculation as asking the question of the tradeoff between intensity coding versus spatial resolution coding. And now we will be applying our prescription. So in our case, the stimuli are translation invariant and therefore the stimulus distribution factorizes in the basis, in the freer basis. And also natural stimuli have the sparse spectrum of 1 over k squared. So we have a separation of scales. So we know that we should start based on the previous calculation with computing information with respect to the zeroes component and then going to higher frequencies. And here we will, because it's a large computation, so the stimulus is very high dimensional and the only approximation that we can do is independent approximation. So instead of talking about information between s and r, we talk information about stimuli s and this information preserving vector t. But in addition, we say this is just a sum over freer components between the corresponding freer components of the stimulus and the population vector. So the tk is a sum of, what is tk? So the claim is that this is a freer transform of the activity of the neuronal array. But here is the derivation. So it's a sum over neurons. The response of the j's neuron times, as we know, the receptive field of the j's neuron. But what is it? It's a freer transform of a standard receptive field for each neuron, which is maybe a Gaussian, which is known. Times, because their position in each neuron is centered a different position, there will be a freer factor i k times the center of that receptive field. So the case component of the information preserving population vector is, as you can see from this expression, it's a freer component of weighted activity in the neuronal array. Any questions about this statement? So then we talked about that. And now some results. So what we are asking is as a function of the noise level of individual neurons and their average firing rate or average spike in threshold. So now what is shown in color scale is the difference in information conveyed between overlapping arrays and a single higher density array. And there is a curve denoted by white line that separates positive values where this information is better compared to the negative values where the information in the high density array is better. So you can see that the echo of the same result that we had with two neurons, that when noise level is large, the one neuron per location wins compared to when noise level is small. So that's the next step compared to the calculation with two neurons. And now one can take different cell types with different species. So these are in the salamander. These are in primates. And these are in veneer pigs. And they have a slightly different filtering kind of incoming noise levels. So the shape of the separation curve is different depending on different species. But here are some data for various cell types. So this is the correct prediction of, this is the off cell type is here and the on cell type is here. And so we would, you know, one could make separate predictions that these cell types should split into two. So for example, in the primate, the parvocellar off neurons we think should split further into the subtypes. And same thing here and in the case of the salamander, there are actually two different cell types. So the vertical axis means that high means spiking threshold means sparse coding in the sense that the rate of spikes is lower. So as we discussed a few lectures back, we had this picture of information surface as a function of noise. And it was given for, it was plotted for a given mean spiking threshold for the two neurons. But as you vary that threshold, the position of the critical point, which in this case our white line will move around. So when we compare it with data, that's why it's plotted in this way as a function of the average threshold and as a function of the noise level. Okay. All right, so this might be a good place to stop. So what I showed in these simulations that the general conclusions that we're obtained with two neurons generalize to a population of neurons as in retinal arrays. And you can predict new subtypes that as we know there is always a debate how many different cell types are there in the retina or in the brain. And one can make suggestions that to look for additional subtypes within the offtypes based on information theory. So in the overall summary for this lecture today is that, I guess for the past two lectures is that there is a prescription for reading out neuronal information without loss. You can approximate it in three, maybe you can come up with additional approximations. This is just the starting point. And these approximations are different in terms of computing resources that they take or equivalently number of stimuli that you can have because it's not just the computing but also the statistical sampling of the underlying probability distributions. And then the last part from today was the specialization between cell types where we take into account the structure of the stimulus and the fact that the tradeoff is also between spatial resolution and intensity coding. So just to understand, so from the data, from the Salamander, the Guinea pig and from the experiment, what you get are the receptive fields, right? So the WI. And then these plots that you show are essentially a result of numerical calculation that you do on computing the information, the mutual information. So the whole plot is pure simulation, pure analysis of information transmission, no data. The data comes in is when we plot observer average noise level and mean spiking thresholds for various neuronal types. Okay. So in this case, it says, you know, these are the two circles. And there are the two different types of off neurons. But then, for example, the white circle, white triangle, it represents the slow on neurons. And into this description, we think this will be a rare case where the on channel has to split into two. There should be another partner cell type, because it is optimal for it to kind of coordinate its coding. So in this case, this is the on neuron. Notation is that it is a white. It means it's an on neuronal type. And if it is black, it's the off neuronal type. So some of these cell types are known and some of them are not known. Okay, so for example, yeah. Oh, go ahead. Yeah. So for example, here, you know, so far analysis from Steve Bakos lab was either in the salamander or in the mouse. And this nearby middle panel is for a primate. So we see that indeed the off neuron is in this range where they split. The off neuron is in the range where they're not supposed to split. So but given that they are in this region where they're supposed to split, we think that, you know, actually, there will be multiple cell types corresponding to that. Okay, so and what is the Sigma n? Sigma n was the noise. Sigma n was the effective noise after filtering. So these neurons, they have different temporal filters. And so we thought that because they have different temporal filters, the effective noise is different. So in prior to neuronal nonlinearity. Okay, so essentially what changes between these three plot is not only this value of Sigma n, right? It's also something which has to do with the arrangement of neurons or... The temporal filter. The temporal filters. Yes, so if there is a photoreceptor and it has some noise in it, but the temporal filter is different. And so the effective kind of additional noise that comes from the input can be less or more depending on the filtering. And also the salamander and primate and guinea pig, they operate at different temperature. So salamander is a cold-blooded creature. So everything is slower and has larger noise, but it's compensates for it with the filtering. So it's kind of the noise is complicated. So slower, but then on the other hand, when you have higher temperature, there is a higher photon-asomerization kind of spontaneous conversion events. So the primate has to pay for larger rates of kind of photon-short noise. Okay, any questions? So also the next lecture will be the last lecture and I will be showing you hopefully will be maybe less mass, but more pictures of hyperbolic space and evidence for hyperbolic geometry in the natural world and in human perception and also in other parts of gene expression, other parts of biology. So that's the plan for the last lecture. Okay, so thank you very much. Thank you. If there are no more questions, then we meet again on Friday. Thank you again. Thank you. Bye-bye.