 OK. The next lecture is on hyperbolic geometry and information maximization in neural circuits. Thank you. Can we start recording? Okay. Very good. Okay, thank you very much. So as a reminder we covered so far entropy and axiomatics for Shannon entropy and last lecture we started talking about maximally informative non-linearities, primarily in neural non-linearit is primarily neural circuits, so we discussed can a neuron transmit more than one bit of information, and today I will discuss actually the different limits, the impact of noise. We will talk about entropy and information for Gaussian variable, but before that the weak noise limit for estimation of information, and a very important result from Simon Laughlin about the importance of, that the non-linearity should be accumulative distribution, and it is actually applied, has many applications and has many names, because it was rediscovered in different sub-fields of physics, and so then a few applications of this result, and after this so far this will be information transmission by a single on-off channel as a neuron, and then we will discuss information transmission by multiple neurons in the presence of noise, and talk about how this leads to a theory of diversification in biology, so that's the plan for today. So, I would say let's ask about, so we have one neuron, one on-off device, and it has a non-linearity, and we would like to ask what is that optimal non-linearity, and there is different answers depending on the amount of noise and various parameters of the input distribution, so also the question of optimal, many people define optimal in different ways, so for the purpose of today's discussion we will say optimal is the one that maximizes information transmission, but often the non-linearities are similar, optimal non-linearities are similar whether we are maximizing information transmission, or minimal decoding error, or other measures. So, the first derivation says, suppose noise is very small, then as we know, as we discussed information is the entropy of the neuronal response minus the entropy conditional of the neuron for a given stimulus, but if that is much smaller to a first approximation, information is just entropy, so we are maximizing response entropy of the neuron, and it has this expression, so it can be written as an integral, or the sum over the states, but if in the limit of smoothly graded responses, we have this integral p of my log p of one, so we would like to find so the distribution of neuronal responses will be affected by its non-linearities, so the signal is fixed, and the signal distribution is fixed, but when we choose different non-linearities, we will get different resulting probability of the response, and our goal is to maximize it. So, and then when we maximize, we have a constraint that the probability distribution has to equal to one, the chest, a general property of all probability distributions. So, we use the method of Lagrange multipliers, and so that's our optimization function, so we are trying to maximize this under a given constraint that the integral p of y divide is equal to one, so when we try to optimize, so with respect to variation in p, so we will have log p of y, and then when we try to vary this part, we get p of y over p of y times delta p of y, so we get minus one, and then the derivative with respect to this term is just minus lambda, so what we get is that basically p of y is equal to a constant, so I think this is an important result, it says, well we know that entropy is maximal for uniform distributions, but if in other words to restate what, because the question is what is the optimal nonlinearity, the optimal nonlinearity will be the one such that all output states are equally likely, and if you think it makes intuitive sense that if neuron has say, four different response levels, then all of these response levels have to be used equally, otherwise we are not using them effectively. I hope that... Yeah, maybe, Tanja, can I make a plot, so the idea is that you have a stimulus, then this goes through a certain function, and then you have y, and essentially what you want to minimize, maximize is the mutual information, which is the entropy of the output minus the entropy of the output given the input. Now we are working in the case where this is zero, right, where there is no noise. Yes. Okay, and so the question is, so when we maximize this, we find that the p of y, so the p of y should be essentially a constant, so essentially if this f of y, if here you have x and here you have the output, then this function may be something like this. Yes. And then the idea is that so the function should be such that if you have a distribution over x, when you convolute this distribution with f of x, this is f of x, then you should get a uniform distribution, right? Yes. Okay, thank you. Yeah, so I'm just restating that entropy is maximal for uniform distributions, and so then this is the nonlinearity, so that's Matteo wrote, so consider nonlinear transformation y equals, in my slides, g, but in principle f. And now we can derive the form for this distribution using the following property. I find it very useful property whenever you work with probability distributions. So if y is uniquely determined by x, because there is no noise, then p of y dy equals to p of x dx. So if I am in a certain intervals of x with some probability, then this will be the weight for me to get into the corresponding bin dy with the weight p1. Very useful property for many, for generating various random numbers with various properties, but now our goal is to have p of y to be a constant. So instead of p of y, we write a constant. And so if you rewrite this, then we have dy over dx equals to some other constant times p of x. Or in other words, we have that y is equal to dx p of x. So, yeah, so maybe we can write that down. So y should be a constant times the integral, say from minus infinity to x. So this is y is equal to g of x. Integral between minus infinity and x, the x prime of p of x prime. Yes. Thank you. So in other words, nonlinearity has to be a cumulative distribution of inputs in order to maximize information. And in this way, whatever that probability distribution is on the x-axis, as a result of this transformation, you will have a uniform use of the responses. So we have a question in the chair. All right. So we had a question, a useful question. So why are we talking about nonlinearity? And why are we introducing nonlinearity? So the neuron, because it is an on-off device, so it has, it cannot, not just in your sense, but in many areas of biology. If you have the signal, such as the number of spikes that the neuron produces, it cannot be negative. And it also cannot be infinite, because each neuron has, there is a maximum rate with which it can respond. For example, I will draw a little bit of, so as a side note, so this is how the neuron, I thought, no, not that. I'm trying to draw a spike. The spike has, as a function of time, this is time, then this, the neuron, is just the basic property of neuron transmission, and this is voltage, it will have this shape. Some say that, so in other words, so basically in a neuron, it is a voltage sensitive cell. It's not, so to produce a spike in the voltage, it has to open some channel, so there are holes in the membrane, but they're not just always open, they're mostly closed to let for various ions, but then they have to open, and the one sort of ions flows through, and then another set of ions flows out, and then the pores close, and that's the end of the spike. So while this is happening, it cannot produce another spike. So there is a minimum time that it takes to produce a spike and to recover from it, so the neuron cannot produce some spikes at infinite rate, so the typical, the maximum firing rate, it depends on the neuron, but it's not larger than one kilohertz, but from a theoretical perspective, if I think of a neuron as producing some firing rate as a function of the input x, then the x can go between minus infinity and plus infinity in principle, but the firing rate, it has to be zero at some point, and it has to saturate at some level, just by the basic construction. So the neuron responses have to be, have to include non-linearity. So the question is, what is that form of the non-linearity? So the parameters that are adjustable, and which differ depending on the neuron, is where to put a threshold, that's one parameter that you can adjust. You can also adjust the widths of this transition region, which sometimes they call neuronal noise, we'll talk about that, and then there are more complicated neurons, and in principle, one can also encounter non-linearity that go up and go down. And so the theoretical question is, if we want to understand the nervous system, why some neurons choose to operate with one non-linearity, and other neurons choose to operate with a different non-linearity. Okay, so. Do we know p of x a priority? No. Yes, so that's another important question. So p of x is not known, to some extent is known, and to some extent it is changing. So in, I will, let me see if I can find, so the reason but in any case I think it can be learned, right, from past history. So the neuron can adapt to, can learn what the p of x is. Exactly, so what I was going to say is that because p of x is important, and we see that it determines the optimal probability distribution, people study, and kind of in the next few examples, what is a typical p of x for the natural world. And there are certain, there are many regularities, and there is a substantial body of work which says, given the statistics of signals in the natural world, here are the neural response properties that we expect to find. So one particular, but as Matia mentioned, this p of x is not a constant function, because for example, you can leave, go from the outside where the light intensity is very high, to inside the room where light intensity is low. So for a neuron that is in the retina, its inputs are directly changing. And so the neuron has to adapt, and the reason it has to adapt, meaning that it will no longer be efficient. In other words, when p of x changes, the nonlinearity has to change according to this prescription. And in particular, for example, the threshold has to change. So there is, you know, some of the slides that I didn't quite prepare for this lecture, but I have available is talk about understanding adaptation using these principles, because once the input distribution changes, this nonlinearity has to change. And then there is another question, is how soon can the animal detect that there has been a change in the probability distribution? And that's a statistical question. And there are very interesting differences. For example, if, so let me show you an example of nonlinearity that I prepared, and then we will, I'll check also, there is another comment in the, so I will talk about, unfortunately, I got out of this, I don't know how to get back to. So this is an example of analysis. So I told you that this is a Simon Laughlin result, but then there are other people who have done it. So he looked at the neuron in the fly brain. It's called large monopolar cell. And he, following up with this argument that nonlinearity, which is plotted here in black, the nonlinearity for this neuron, so he says response divided by the maximum response as a function of the contrast of natural scenes. And then he says, well, I'm going to measure the light intensity in the natural world, and these are these error bars. So there is no fitting here. And that's why this result is considered a classic result in computational neuroscience, because you put two things together that on the surface are completely unrelated. One is the light intensity in the natural world. The other one is a neuronal nonlinearity, and they match without any fitting. So that's one example. And so I was going to talk about, since we brought up the question of adaptation and statistical estimation, so imagine that the contrast, which is plotted here, we went from the inside the room to the outside the room, and the contrast increased. So it turns out that statistically, it's easier to detect increases in the contrast than decreases in the contrast. So for example, if, so if I have a sequence of signals with small contrast, and then all of a sudden I have large contrast. The moment I see large value, I know that it is extremely unlikely under my current model. So I know I have to change the model. On the other hand, if I go from large contrast to small contrast, when I see the small contrast, it is still consistent with a large amplitude deviation, probability distribution. And it's only after I see a number of samples I will be able to say that the statistics have changed. So this in particular predicts that the time that it takes a neuron to adapt between the low contrast and the high contrast should be faster than the time that it takes the neuron to adapt from high contrast to low contrast. And that's indeed is true. So I'm just stating this. I do not have slides. I can add them into this lecture notes. But this is the work by Michael Davies and Adrian Ferhal, and also Bill Balik, showing that the adaptation happens very fast. But it's nevertheless the adaptation from low to high variance is faster than high to low variance. And one can account for this due to the statistical estimation theory for how we update the parameters. Okay. Okay. So thank you for the questions. Any other questions? So one of the properties of natural scenes is that the distribution of signals is not Gaussian. So in this case, it looks approximately Gaussian. But for other variables, not just light intensity, you can start seeing larger fluctuations that what you would expect to find for a Gaussian signal. And so the nonlinearity should be different from accumulative Gaussian distribution. So another example it's a brief example, but I want to mention that because it relates to this question of probability distribution. So in this case, this is a picture that we will be analyzing in a moment. So this is a picture of the expression level of the gene in a fruit fly embryo. And this is from the work by Bill Ballek in comics. And in general, if we think about transcription factor, the differences between maximally informative transmission was in a cell by transcription factor or in the nervous system by neuron. In the case of the neuron, especially the sensory neuron, as we discussed, the probability distribution is usually set by the outside world. And the neuron has to adapt appropriately, choose the appropriate threshold, appropriate neural nonlinearity to achieve efficient coding. In the case of the cell, the how the transcription factor is activated can is often set by kind of biophysics of molecular binding. But what the cell can control is the distribution of the activating factor. So in that case, one can reverse the question and ask, what is the optimal input distribution of an intracellular signaling molecule for a given activation function? I hope that's OK. So in other words, in the case of transcription factors, so the G is fixed by biochemistry. And then evolution adjusts. What the organist can do is to adjust the probability distribution of transcription factors arriving to the right. Yes. So now what we'll be analyzing is the more. So, so far, this was an example of the derivation was maximum noise. Maximum response entropy. And this was the case where noise was negligible. And we will slowly build up the case for larger amounts of noise in how this is affected. Before doing this, I will go through the case of the Gaussian noise and Gaussian variable. Yes, 30 second. OK. So, no, this is not. There are two Laughlin's. This is not Robert Laughlin, a fraction of quantum Hall, but Simon Laughlin. And he does a lot of work on neuroscience and metabolic metabolic effects in the nervous system. And so, now we will go over through. The case with noise. Let's see. Can I somehow go? I would like to go to full screen if possible. This is the case, right? Yes, just a second. Not this one. Not this one. Not this one. OK. So, we will discuss first entropy of Gaussian variable. And so, this is the slice we discussed. And talk about it's useful to know. If you are familiar with this, we can go faster, but this is a theorist toolbox of what is the entropy of a Gaussian variable. And we will apply it to first linear case, even though we know that neurons are nonlinear. And then evaluate this information transmission in the case where like in this case here. Assuming that noise here is Gaussian. And so, we will have a distribution over p of x, which is assumed uniform. And then we need to know the information that is conveyed by neural responses for a given value of x. And because we plot here error bars, we are approximating them as a Gaussian variable. This is kind of a maximum entropy approximation here. So, the entropy of a Gaussian distribution. So, if you covered it in the course already, then we can go faster. We maybe just recall what the differential entropy of a Gaussian variable is. Yes, so this is a general definition of a Gaussian variable. So, p of z, where it's defined by the average value. And here in the denominator, you have the variance of that variable. So, if the variable has zero mean, then it will be e to the minus z squared over two sigma squared, where sigma squared is this variance. So, now we want to compute the entropy of this distribution. The expression for the entropy is minus the integral over z, p of z log of p of z. And in other words, it's an average because it's an integral over d z p of z. So, it's an average of log p of z. So, now we start by evaluating, because this is the function of which we need to compute the average. We take the log. So, we transform the log two as one over log two natural of two. The natural log and then apply it to first a prefactor and then to the exponent itself. And so, this is a constant. So, when we average this expression to get the entropy here, we will have the, because it's log one over this times minus will be. So, this is just the proportionality constant and this is log of the variance. So, that's important because it says that the entropy of Gaussian distribution, the broader the variance, the larger the entropy, which is useful to remember and agrees this intuition. And then we have the average of this expression, which is z, z minus the average squared over the variance. But what this is, because this is the average of z minus the average squared, so this is just delta z squared. And so, rewriting this first part and then this is cancels and then we just get one half. So, you get a very simple expression that the entropy of a Gaussian variable is one half log two, some constants like two pi e and then the variance of the signal. So, now a few comments about this expression. So, it is useful, but beware that I think we have some, in reality this integral is difficult to compute. Because, so this is just one answer, but in fact there is infinite term that we have omitted. And relates to the delta z. Because, if you can tell the signal to the infinite precision, then the entropy will be infinite. So, but for another property of this expression is that entropy is dependent of the mean signal for a Gaussian variable. But if you have a non-Gaussian signal, so you might have then dependence on the variance on the mean. So, in that case you will have a function that depends both on the mean and the variance. So, now we are looking for information transmission for a Gaussian channel, where for now we are forgetting about nonlinearity. So, a linear system y, the response y is linearly related to signal x, and we would like to characterize the information transmission in the presence of noise. So, Tanya, we have a question from Gibbs. Please, could you go back to the previous slide? Yes. Okay, I just have a question for some precision because when we saw the entropy for the continuous variable with material, it was like an additional term to make sure that the entropy is always positive. But in this case, I don't know what is the range of the average of delta z square, because if it is very less than 2 pi e, the load could be negative. I don't know. Yes. So, we have the logarithm of delta z, the term that is kind of our negative infinity is omitted. So, can I comment on this? So, this is speaking what is called a differential entropy. And then, when you measure z to a precision delta, the entropy of the discretized variable should be this minus log of delta. But log of delta is a constant. So, as we are going to discuss the optimization problem of this channel, it does not really matter. Okay? Yes. So, there is an infinity there, because if delta is very small, then you know, that's right. Thank you. So, information transmission for Gaussian channel. So, we have y that is equal to linear function of x times constant plus noise. And how much does the observations on y provide about the signal x? Because noise is Gaussian, we know that p of y given x is this probability distribution, which is 1 over 2 pi variance of the noise, zi square. And this is, so, if the noise wasn't there, once you know x, you know y as a delta function. You know the y value exactly. But because of the noise, this is one of the useful tricks for a theorist. So, instead of writing p of y over x, I am in effect writing the noise, I am in effect writing the probability distribution for zi. So, this is equal to 1 over 2 pi and the variance of the noise, and then exponent the difference between y and g of x over the 2 sigma of that variance. Any questions about this? Is it clear? Yeah, okay. Just to be sure, it's g a function of x, or just g times x? No, here it's g times x, because we are discussing the linear case. Okay, so, from here you can generalize in multiple directions. So, and today we will take just one direction, but in principle g can be a function of x. And also x, it's an approximation to say that it's one dimensional, so you can also have x vector. So, then this becomes a matrix. And we can talk about the optimal filtering properties of this matrix. Even when it is a linear function, then what is the optimal properties of this filtering matrix? And in addition, you can have many neurons and they talk to and they encode the same x. So, how should their nonlinearity be coordinated, and how they are mixing your filtering matrices should be coordinated. So, so far in the first part of this lecture, we discussed optimal nonlinearity g of x. Now I am just analyzing the linear case, because it's kind of a tool using which we will apply it in other cases. Then later on in this lecture, if I'm planning to talk about nonlinearity, but multiple neurons, but still a one dimensional x. And then next lectures will be x will be multi dimensional. And first you have a single neuron, and then you have multiple neurons. So, this is a starting point from which you can analyze more complicated questions. G can be nonlinear, x can be multi dimensional, and then y can be multi dimensional. Is that okay? Any more questions? So, note that by definition, because this is noise, and it has by definition, it has zero mean. Otherwise, we will just say that there is a constant offset and put it into y. So, then the simplification, which is actually not quite true for natural signals, is to say that the signal itself x is also Gaussian. Then you write p of x in this form. And now we can put two things together to evaluate the noise entropy. So, if you have an input Gaussian signal, and I transformed it through a linear transformation. So, it turns out that the result is also Gaussian. It's an interesting thing that if you think of a covariance matrix represented by the ellipsoid of the covariance, if you cut it at different angles to take different projections, you will always have another ellipsoid. So, that's a useful fact to keep in mind that when we have a Gaussian signal, you apply a linear transformation, you will get a Gaussian signal at the end. But I will have a question for you here. Even for one dimensional signal, so you have a Gaussian distribution for x. And now I do not have a linear function, but I have a simple threshold function for g. So, if it is positive, y is equal to x. If it is negative, y is equal to zero. Will I have Gaussian distribution for my response signal? And I guess, hello. So, if the g is like this, just a step function. Just a step function. What is the distribution of y? So, yes, b-model, yes. It will only have values close to zero or close to one, or close to the maximum, right? It's okay, Tanya? Yes, yes. So, you will not have, but even in another case, what I was thinking that by model, but I was thinking more of a case where it's like a special linear function. Ah, okay. So, a g, which is like this and like this. Yes. Okay, so then, what would be the distribution of y? If you have a distribution of x, which is something like this. Yes. So, yeah, what do people think? So then, for all values of x smaller than this, you will have zero, right? So, there is a delta at zero. And then, for all values of x, which are larger than this, the distribution will be what? No? So, if I plot here the distribution, there will be like a peak here at zero. And then, the distribution for positive values will be what? What shape will it have? So, imagine that I take this to very negative values, what the distribution would be. Hello, Gaussian. Yes, yes. So, here it is a truncated Gaussian. Okay, so it will be like a Gaussian up to this point. Right, Tanya? Yes, thank you. So, in particular, if you put this threshold in the middle of the distribution, you will have half of the Gaussian. And then, your distribution of the output signals will be, I would say, strongly non-Gaussian, right? You only have half of a Gaussian. And so, there are papers where they would say, okay, we have, you know, there's a strong pool or temptation to, you know, just like in our case, say, my, because we know the Gaussian case and we have a solution. So, there are several papers where you say, okay, the signal of input is Gaussian. The signal on your responses is also not Gaussian, but we approximate it as a Gaussian. And you will know that depending on where the threshold is, this may be not a good approximation, because if it is on the tail, then, yes, it's approximately Gaussian, but if it is in the middle of the distribution, then it will not be a good approximation. So, that depends on where the threshold of the neuron falls, if we have a threshold linear neuron. Okay, so, thank you. So, we can move forward. So, in our linear case, the output is also Gaussian. So, now we just, because it's Gaussian, we don't know, we can write it in this form, we know the form, but we don't know the variance just yet. So, does the response have a zero mean? Yes, because if the x has a zero mean and the noise have a zero mean, then y will have a zero mean. And therefore, here we do not have anything, there is no average y. So, but the variance of the signal, I wrote it in a schematic form, because we know it's a Gaussian, it has to have this form, but we can compute it and it's the average of g squared x squared plus the variance of the noise. So, in other words, it is a very, you know, what is, what I like about this example is that we went from, because we know the general expression of the Gaussian probability distribution, we never actually had to take this integral. So, I could have said, in order to get p of y, I will take this p of y given x, this complicated expression, then multiply it by p of x, integrate over the x, I have to use a saddle point approximation method and arrive at p of y. But we have omitted all these calculations, because we know the form of a Gaussian variable that it needs to have, and we just compute the relevant parameters. So, that's when you write your own derivations, I think that's a useful trick to know. So, here we know that what should be here in the denominator is the variance of the signal, so it's g squared x squared plus the variance of the noise. So, the overall information is this joint integral over the x and dy, p of x and y log 2 p of x and y over these probability distributions p of x and p of y. So, now we have expression for p of x, we have expression for p of y, and p of x and y, we will write it as a product of p of x and p of y given x, and we have all the information, all the probability distributions were described. So, we will write this instead of log 2, you write it as natural logs to help us out, because we have exponentials here, and p of y stays and p of x divides this probability to make it conditional. And we have expressions for this probability distribution, because it's a function of the noise, and we have expression for this probability distribution, which is here. So, yes, that's what I said. So, now this is our expression that we want to evaluate, and as a reminder, this is p of y given x is the, has the Gaussian form, and depends on the variance of the noise, because that's the only deviation between y and g of x. And p of y has the Gaussian form with the variance y square. So, in putting these things together, so what we need is the ratio of these two probability distributions under the logarithm. So, first we will have the ratio of the prefactors. And this is this term here, because it's a constant. When we average over p of x and y, it's just a constant. So, it doesn't, well, I guess we have an average here still, but it will be a constant. And then here we have logarithm of conditional probability distribution. So, that's the argument of its exponent, plus, because from p of y, which is the denominator, you get the other one. And we start to average over p of x and y. So, when we average this, this is a constant. This thing, the last term, when we take the average, they cancel out. Should be maybe 2 pi. There is no pi. Yes, pi is missing. And here we have this, but I think they will also cancel. And because the variance of this thing is the variance of the noise, so then these, the missing pi will cancel. And so, we have a very interesting expression, which is useful to remember, that we converted back to base log 2, because we are working with bits. And then the square root gave it 1 half over here. And here you have the ratio of the output to variance to the input variance. So, or in other words, because the vice squared has the noise in it, it's 1 plus the variance in the signal times this scaling factor over variance of the noise. So, this part here is the signal to noise ratio. And so, let's talk a little bit about this expression. First of all, suppose that the g is zero, so the output is zero, so then the information is log of 1 will be zero. And then you have the variance of, and then it goes 1 plus x, where x is the signal to noise ratio between the signal over in the effective units of the noise. Any questions so far? So, sometimes it's useful to rewrite this in terms of effective noise. So, not noise is additive, but because it's a linear system, we can kind of propagate it all the way to the input. Even though the input technically is assumed not to have noise, we will say what is the effective noise that we have to add to the input in order to have the same noise at the output as we had before. So, we just scale this input psi by g, to put it here. And so, then it should be that it should be x squared over the effective noise. Okay, so in other words, this is a difference between two Gaussian variables. Apologize for my typos here. Between the output of the noise, the entropy of the output, which is this part right here, y squared is psi squared plus g squared over x squared, minus the entropy of the noise, which is this denominator here. And why this expression is so useful in the various parts of the engineering, for example, because we said that the information from independent measurements adds. So, imagine that you have a transmitter that works across multiple frequencies, so x can refer to a specific frequency. And then there is a gain at each frequency. Now, if you have multiple frequencies, then you will have the same expression, but we will just integrate it over frequencies. So, that's advantage of the Gaussian approximation. And that this expression will hold actually for multi-dimensional variables, as long as they are independent. And so, now we can go back to this case with Hunchback transcription factor. So, this is the work by Bill Ballek and his experimental colleagues. And we have the expression for the variance here for various measurement. And assuming that p of x is uniform, he computed information. And it turns out that in this case, this expression level of this Hunchback gene gives you two bits of information. And that is important or is interesting because it's greater than one. So, let me give you a little bit of context for this gene. This is for development of the fruit fly embryo. And mother is depositing, I think this gene, in one side of an egg, and then it diffuses to the other end. And then other molecules come back and they convert this into, that is, she deposits a different gene, bicoid. And then this gene Hunchback takes the bicoid as input and picks at different levels. So, it's often thought of as an on and off gene. So, people will say that the kind of common belief is that, because it's an on and off channel, it will convey one bit of information. But when you do the analysis, the variability is such that it is approximately twice more than predicted by this classic on off model for this gene. So, it means that all of these variations in the values of this nonlinearity, they matter for information transmission and we can't quite approximate this as a constant here, transition and a constant there. So, that's an example application of this information analysis. Any questions? So, in this case, I guess the function g is nonlinear, right? Yes, so he does it for various axes independently and then integrates over axes, assuming it is a uniform distribution. The statement is approximately true, but I would say that maybe the curvature of the embryo also makes a difference, but that's an approximation. And because it's a Gaussian approximation for the noise, that means that the information, the true information can be larger because the noise entropy is in the Gaussian case in the maximal compared to other probability distributions. So, now, if we are ready, I wanted to talk about results now for multiple neurons. We are still talking about a one-dimensional signal and how to coordinate thresholds between two neurons. So, imagine that you have an analog signal as shown here as a function of time and I cut it at a certain level and I have a simplified nonlinearity, I have a choice between adjusting the threshold and potentially my noise level is set by metabolic cost. So, in this case, because it is a binary device, I will not have full information about the underlying analog signal. But I will have some information. I will be able, for example, to say whether it is greater or less than the mean. And so, in this case, this is just a cartoon showing how these information signals are encoded in this model in this spike section. If the signal is greater than the threshold, it is one and if it is less than threshold, it is zero. So, then we evaluate the information according to our equations, which is entropy minus the noise entropy, and now we have two neurons. So, instead of two neurons, you have, actually, if that's okay, maybe I will switch to power point. Maybe it will be better, because I think it's Yes, it will be better. Question in the meantime, Carlos. Hi, in absence of signal, usually the neurons behave with noise, with Gaussian noise, I mean, or how you know experimentally if there is actually a signal. So, I guess one could, maybe the question, the way I understood the question is if you keep the signal constant, and I look at the variability in the neural response, will it be Gaussian? Is that the question? So, if the neuron can only, if we are looking at small time intervals, where the neuron produces at most one spike, then the probability distribution is typically binomial, binary or binomial. If we are looking at more broader time windows, where it can produce the larger number of responses, so maybe now actually say integer numbers, then a common model for neuronal noise is the Poisson distribution. And there the variance is equal to the mean. But some neurons are sub Poisson, meaning their variability is less than what would be predicted by a Poisson model. And in some neurons the variability is super Poisson, is greater than what would be predicted by the Poisson distribution. The reasons for these differences are that typically the neuron itself, if it is a healthy neuron, will be fairly reliable. But what we are talking, often when they talk about neuronal noise, they talk about a cumulative effect of all variables, except for the variable that we are controlling. So, if we are recording for one neuron, and we see that its responses go up and down when we fix the stimulus, it could be noise within a neuron, but it could also be noise from other parts of the brain that provide extra input to that neuron. So, often this is how you get a super Poisson variability, because you will have some modulating factor with which the variance goes up and down. For example, with attention, how tired somebody is, but what they are thinking else, whether they are distracted or not, or paying attention to this particular stimulus, this will, these modulatory signals, they provide an additional source of noise, and so the variability is observed to be super Poisson. And this modulating factor is usually stronger, the deeper you go into the brain. So, in the retina it will be very reliable. And the kind of the deeper we go inside the brain, the more of these external, in other words, we are further from the input. So, the neurons themselves are still reliable, but they get many other signals in addition to the input that we are providing. Is that okay? So, I don't hear anything, but... It's okay. Okay, thank you. So, in this case, we have signal X of t, we have the response R of t. You have multiple trials. And because of this, there are cases where neural produced a spike on one trial, but not the other. So, we have variable responses. And our information is a difference between two entropies. H of P of R minus the entropy for a given X times P of X. So, that's the model that we have been describing for one neuron. So, now we have two neurons. And with two neurons, the information, the problem becomes much more interesting. So, and I will show you the main results in a few moments. And then maybe next lecture, we will go over the details and the implications of these results. So, now with two neurons, I have four parameters. Even if they are... I think we have a question in the chat. Okay, no. So, for each neuron, assume that it is a sigmoidal nonlinearity. And there are actually reasons for why... So, I assume it's a sigmoidal nonlinearity. We have mu1 and mu1. Mu is the threshold. Mu is the width of this fluctuating threshold. So, by the way, if maybe it's a question for the students, Matja, maybe, would you mind writing, or maybe one of the students can draw. If I give you the threshold mu and the noise mu in the threshold, and it's a binary on-off neuron, what would be the nonlinearity for that neuron? Maybe that's a question for a student. Maybe somebody can go to the board and draw the nonlinearity. Should I start giving out the exam points, like plus one exam points? Sorry, sorry. So, the signal is this one, like the one in the slide, and the output is one if the signal is above the threshold, right? And zero if it is below. Yeah, so, let's not worry about the signal for now. We just plot a neuronal nonlinearity, if it has a threshold mu and variability in the thresholds. Imagine that the signal f is affected by Gaussian noise, and we have a binary neuron. So, what will be the... How would nonlinearity look? How does it look? So, I'm giving out plus one exam point. So, how does it look like this function? Well, the Gaussian is the distribution of x, right? So, this is the p of x. Yes, but let's forget about x. We just have p of y given x, and I'm asking, you have a threshold-like device. So, the simpler question, p of y given x for a threshold with no noise, that should be okay. One can draw that function. I think we discussed this previously, right? So, I mean, in this case, if you have a simple threshold, how should this function g of x be? Yes, please? Jakobo? Like a step function. Okay, so, it should be something like this, and like this, where this is the threshold mu, right? Yeah. And then the distribution of y over y should be what? For all these points here, y should be equal to one, and for all these points here, it should be equal to zero. So, it's a b-model distribution, right? Okay? Just two peaks. Is it okay, Tanja? Yes, it's okay. Now, if we have a little bit of noise that is added to the values of x, and, you know, we don't have, just it can somebody draw p of y given x. Now, mu is not zero, so. So, you still have this model here? No, no, let's do it as a threshold mu, like you draw it like a step function. But now, y is equal to theta of x, where theta is the heavy side function, but now x has some noise, theta of x plus. Okay, so y, in this case, y is equal to theta of x minus mu, right? Yes. So, now you want to put the noise inside the theta or outside? Inside. Okay. So, then the distribution of y remains just b-model, right? Well, it depends on how much noise I put in. But the values of y are only zero and one. So, whatever is inside, either you get zero or you get one. If you put the z outside, then this thing becomes broader, and this thing becomes broader, right? Yes, but I would like it to be inside. And not talk about p of y, but just draw y as a function of x. Ah, okay. So, if you write, say, y as a function of x, then it can take value zero or one, depending on whether the noise is such that, so if x is here, if I get the noise which is negative enough, I can get zero. If I get the noise which is not so negative, I can get one, right? Yes. So, can we just, can somebody volunteer to draw p of y given x in the presence of z? I think it should still be two delta functions, no? Just one, so p of y given x should be just two delta functions, so one in zero and one in one. And then how? Yes, but I'm hoping not p of y, not the probability solution, but just y as a function of x. Ah, y is a function of x. Yes. A simpler, a simpler. So, you want to compute, maybe what you want to compute is the average of y given x. Exactly, yes. Average of y given x. So, the average of y given x, so when, if this is mu, so it will be zero when x is very small, not because then this will be, but then it will go up and then it will be one, no? Yes, thank you. So, this is average of y given x. Yes. So, this new determines how sharp is this transition is. So, if neuron experiences more noise, it will be kind of broader, and if new is experiencing less noise, it will be sharper. Yes. So, this will be with a lot of noise. Yes. And, well, I don't have less noise, it will be closer to delta function, right? Yes, thank you. Right? And so, now the question is, now I have a question. So, p of y is still by model, is still two values, because it's, well, by definition it's a binary neuron, and the weights are, well, they will be determined by this probability distribution function, right? Yeah, so, because it's a binary neuron, it will only have two values. So, this will be the probability that y is equal to one given x, right? Yes. The probability that y is equal to one given x. So, yeah, so we have a by model distribution, and this nonlinearity on one hand it is average y for a given x, but also a probability that y is equal to one, and so the ratio between these two by model peaks. So, I think I will, in the last, say, a few minutes, four minutes, I will give you the result, and then we will kind of decompose, so it will be kind of to be continued part. So, the interesting part is that if you have two neurons, in each neuron has this nonlinearity, and we only say that the parameters that we can adjust for each neuron is its mean and the noise. The noise is presumably fixed, but we will analyze solutions for different values of the noise. So, and we want to know what is the optimal value of thresholds. But it turns out that the average value of the thresholds of these two neurons is fixed by metabolic constraints, because to the first approximation, the threshold, so if we think about one neuron, if it can, each spike costs a lot of energy. So, the information increases, the closer I place this threshold, the redder of the blue line, to the middle of the distribution. If I place it exactly at the middle of the distribution, I will have one bit of, one bit conveyed. But maybe because of the metabolic constraints, I can't just spike that much, so I will be placed, my threshold will be placed somewhere on the outskirts of the distribution, as close as possible to the middle. So, this is the solution for one neuron. Now, for two neurons, if they have the same threshold, then their average threshold is still fixed by metabolic constraints, and it will sit somewhere at some distance from the mean that is determined by how much, on average, both of them can spawn. But when you have two neurons, you have a degree of freedom, which is the difference between their thresholds, that is very weakly coupled to the changes in the overall spike rate. And that's an interesting, turns out you have an interesting parameter, and has an interesting bifurcation. So, I'll just tell you the result, and then we will go over it in more detail next time. So, in this graph, what is plotted, is information that these two neurons convey as a function of threshold difference, mu1 minus mu2 for these two neurons, and for different scenarios of the noise, because we think that to reduce the noise, one has to make investments in metabolic constraints, so, different parts. So, you see the surface information as a function of threshold difference noise has a very interesting bifurcation. So, and it shows that when noise is large, it's optimal to have zero threshold difference between neurons. So, I often say it's like you have a company, you have two new coming workers, you want to assign them the same task and average the results. And then as they get more competent, like neurons get more competent, you can partition the range of inputs to slightly different values, and then interpret the result. And the bigger, the less noisy they are, the more reliable they are, the bigger is the threshold difference between them. And so, so, this is kind of the introduction to the next lecture. And we will talk about it. So, as my tear drawn for you for one neuron, the distribution of signals is by model zero or one with two neurons, it has four possible values, zero, zero, one, one, one, zero, zero, one. Turns out, for this picture, the response entropy, the noise entropy doesn't matter. So, the main impact of these differences in the threshold is on the balance of these little bars that describe these probabilities to get the pf1 and so on. So, and then this problem has bifurcation, and we will discuss how this has implications for the diversification in biological systems. OK, so, for now, yeah, questions here? Are there questions? OK, so, I think there are no questions. OK. So, very good. So, maybe we stop here and we resume on Monday. Tanya, is this OK? Yes, yes. So, there is another question from Colin. The main book for this part of the one that is coming or the one that was on information for Gaussian channels today. OK, so, the main book will be built by Alex, and for next lecture, it's mostly my paper. OK, so, maybe one idea is that you can add this information to the Slack channel, so that people can access this material. OK, so, thank you very much. Yeah. Bye-bye. OK, have a nice weekend. So, have a nice weekend and see you on Monday.