 What I pointed out last time is that if you look at the average c, it's 1 over k, right? And if you do that at t squared, which is just the variance, it's 1 over k squared so that the average fluctuations are equal to 1. Now, since the alpha is strictly proportional to the time it stays activated, this is the meaning of that, then this will also be, sorry, there's a square here, right? So that's the puzzle, that you get 100% fluctuations when you have something that's exponentially distributed. And the way nature solves this is by having several steps to the activation. So instead of going straight like this, it will go something like this. So instead of having one step, you have k steps. So how does that help you with the reproducibility? So that's the experimental reproducibility you see in response to single photon. This is the output current coming from the photoreceptors. And so what you can see is that it's much less than 100% fluctuations. In fact, it's of the order of 25%. So why does that help you? Because imagine that for each of these, so maybe this is a poor choice of notation, sorry, let's call that small n. So let's call that k1, k2, et cetera, until kn. So each of these are rates, so the chemical reaction rate, so each of the times it will stay in each of the states will also be exponentially distributed. But the time from the most activated states to the completely inactivated states, let's call that t, it will be a sum of this ti. So now you spend some time t in each of the states, some ti in each of the states, and the total time you have to wait to go from the most activated states to the least activated state is the sum of the ti. So you can calculate what you're interested in. We still use a model like this one, where the total activity generated by Rodovsin is simply proportional to the time it stays in any of the activated states. As soon as it gets into r0, then it stops being active. So what I'd be interested in is calculating this kind of quantity, which is the fractional variance, or coefficient of variation. So in that case, it was 100%. How much is in this case? So to this, I'll calculate the two moments. The first one is very easy. It's just the sum of the average times. Each of them is 1 over ki. Now what about the second moment? So each of my ti's are independent of each other. So if you take the variance of a sum of an independent variables, what you end up with is the sum of the variances of these variables. Is that OK for everyone? Probably you've seen this in other courses as well. And remember, I already know the result of this calculation. I did that last week. It's 1 over ki squared. Each of these guys here is the result of some integral. OK. The reason why it's small is because I did it last week, so just to remind you, you don't have to copy it. It's just a calculation of the second moment of an exponential distributed variable. I write it larger. Is this better? It's too low for some of you. All right, so now I'm going to take the ratio of these two quantities. And what I have is this. Can you think of a bound to this quantity? So here, in principle, if I were to design this system, maybe I can choose my ki's. So I still have some degrees of freedom in my ki's. But what is going to be bound by? It would be a lower bound. Anybody has an idea? Mathematicians? Indians? Come again? No. Well, it's larger than 0, yes. But it's also larger than another quantity, no matter what ki you take. 1 over. 1 over. k. What? It's 1 over. There's no k here, actually, right? It's ki. Besides, I mean, remember, this is a dimensionless quantity. Did I have 1? If it were 1, I wouldn't gain very much compared to this, right? So remember, I want this to be as small as possible. I want to reduce. I mean, OK, you can give a wild guess, but it doesn't have to be justified mathematically. But here, we took n steps. Instead of 1 step, it's to reduce the variance. It's kind of always the same story. Like, if you do something over and over again, you reduce the variance. So it's going to be here. 1 over n, right? That's the only quantity there is in this problem. Once I've told you that this was dimensionless, right? So do you know why this is true? Do you know how to prove this? Again, I'm asking the mathematically trained people. Let me rewrite this for you first. It doesn't look any more familiar. Thank you. Yes. Yeah, yeah, yeah, yeah, yeah. Sorry about that. Thank you. Is that correct? OK, maybe now the light bulb will just turn on. No? OK. So this is a trick. And the trick is to use Cauchy-Schwarz. So Cauchy-Schwarz is a mathematical identity that tells you that if you have two vectors x and y, the dot product is only smaller than the product of the norms. So what should be my x and my y to make this work? Well, I'm just saying that my x would be a vector with one of the k's. And my y would just be a vector of ones. And my x, my norm here is this. And my norm of y is just n. So it's fairly simple. Here are my x. Here are my sign any. OK, so that's it. OK, what is this? So I want to make this as small as possible. So I basically want to saturate this bound, which means I want to make this into an equality. Do you know in Cauchy-Schwarz under what condition the two things become equal? Come again? Cosine is small. Right, but here you're thinking cosine, which is already defined from this scalar product. What does it mean that the cosine is one? Maybe you can rephrase this. Yeah, they're pointing in the same direction. So what I want is that I want x to be some proportional to y. And since this is one, I just mean that xi is constant. So in my case, that means ki is constant. So all the steps should be the same. All the steps should have the same residency time. And so if you do this, you achieve this kind of precision. So this was this. So it turns out that one has identified on this rhodopsin molecule six sites, which are sites of chemical modifications. The exact mechanism is not exactly known, but it's believed that these sites may be involved in several steps of the activation. And so to test whether this had any influence on reproducibility, these researchers, this science paper from 2006, what they did is that they, one by one, they introduced mutations in each of these sites where chemical modification can happen. And they tried to knock off one. So a mutation would mean that the site would be made unfunctional, meaning that the chemical modification cannot happen anymore. So it's called a knock off in biology when you make something dysfunctional. So they tried to remove one in one place or another. And then they tried to remove more, I mean, two, three, four, five, and even six. So they did these modifications. And then they did the same experiment as before, which was to try to see how reproducible the response is to a single photon. And so this is the wild type. So wild type is when there are no modifications. And you can see, as before, you get this nicely reproducible response. Already if you remove one of these sites, you disrupt a bit the response. It gets a bit more noisy. And if you remove more and more, you can see that it becomes more stochastic. And you can see here also in this case, this case really get activation. And then, boom, all of a sudden, you get deactivation. Here, activation, deactivation. And this really corresponds to the first situation I was talking about, like the thing becomes active. So it signals. And then, all of a sudden, it becomes an active. So you get really this stepwise decrease. Yes. It's just the nature. There's no difference from the conceptual point of view. It's just they knock off different sites. There are six sites that you can identify along the protein sequence. They're marked by the positions along the protein sequence here. So you can choose to introduce a mutation in each of those. So that is not, you know, it's still biology. I mean, it's not clear why it should be exactly identical. Maybe these sites don't play exactly symmetric roles. We don't know. Maybe the case are not exactly all equal to each other. Maybe there's an ordering in which this happens. We don't know for sure. But what we do know is that if you, so then what you can do is that you can calculate the reproducibility of this response. What you do in practice is you calculate the CV, so a coefficient of variation, and the response as it is measured in picohampus. So you measure this response here. You calculate the coefficient of variation. And you do this for each of the chemical modification. And this is what you find. So they do the coefficient of variation for each of the mutants. So this is wild type. Sorry, this is when they knocked off all of them. And this is the wild type. So the wild type sits here. As I said, you know, I said 25% or it's a bit more like 30%. Yes? If it's not related just to the number of them, they cannot take it as a parameter for the question. It's also dependent on the place of the remote. So that means that you may get different points at different places here. But I think that's OK for the kind of approximation that's made here. This is still quantitative. I mean, it's still quantitative. But if one wanted to do a more precise fitting of the effect of each of the sites, you would need to take much, much more data. And you see that this data is still quite noisy, and still the agreement is not perfect. So this is like a first step. But you should just be aware that in biological systems, it's rare to get really good reproducibility. So you're right. Here, different mutants could have different effects. So here, they took the average. But in fact, here, you see, I think there are two points, and they're very close to each other. So the thing is also like, remember, remember, I mean, these traces themselves are stochastic. So here, you say this one is more stochastic than this one. Well, it could just be a matter of realization. So here, they took the averages over the two mutants, and you see it was pretty close to each other, actually. But the point is that here, you see that the more you remove sites, the more you increase the CV. And this is exactly the kind of intuition we have from here. It's like, if you remove the more intermediate states, you have the more reproducible you are. This is what you see here. But it's even better than that. This CV, they fit it with 1 over n, where n here is 1 plus the modification site. So the idea is that the modification sites act as intermediate states. And then you have the first state as well. OK? And this is not a fit. They just plotted 1 over n plus 1. And I mean, it's not perfect, right? But you can see that, at least qualitatively, it really goes the right way. OK. So there is a catch to this, though, which is, so here, you can see that this is a strategy really for reducing the noise and reproducibility. And I just want you to really be aware that this will actually cause energy to the system. So why is that? It's because when I did this kind of chemical reaction, here I put single arrows from activation to deactivation to activation. But in principle, in any chemical reaction, there can also be the reversible reaction, right? And if I introduce these back arrows, then maybe this will introduce some noise as well. So here, there's some noise because you need to wait for each step to go forward. But then, if you can also go backwards, you will add some additional noise, OK? The reason why in many chemical reactions, we put one big arrow and we basically prevent the other arrows from being there. We call it a reversible reaction. And what this means, typically, is that the energy difference between the two is really too high to be able to go back to the previous state. So the way you can view it is that you have several states, like let's say six states. This is some reaction coordinate. It doesn't really matter. And you go down like this. And the point is the reason why you cannot go back up is because the energy difference between these steps is large enough, right? But you see that in order to be able to do this, you need to be able, you lose energy each step of the way. This is dissipated. So you'd better make sure that you provide this energy to start with to then be able to go down this ladder of energy. Otherwise, you will not have a reversible, you will not have a reversible set of reactions. So here, where is the energy provided from? So it's rhodopsin. What happens to rhodopsin? Where could it get its fuel? What does rhodopsin do? Photon, right? So this is what the energy will come from. It will get excited and then it will go down this ladder. But as I said, this energy is still finite. It's still infinite. So the reactions will only be approximately irreversible. So here, I put as a number of steps like cv equal 1 over n. But let's say that I'm not limited by the number of steps by design. But I could lie on any sort of energy surface. And I get excited to get here. And the conformational state will roll its way down to the inactivated state. And this is some, here on the x-axis is some reaction coordinate. So I'm going down from here down to here in energy scale. Always energy scale. There will be some jitter due to thermal fluctuations. So it's the same kind of jitter that makes that going to go back and forth, right? Just thanks to thermal agitation. Can you guess what will be the upper bound on the cv? So cv is, again, it's the coefficient of variation in the time it takes to go from here to here. Right, so let me cv of t. Well, remember, this was, again, an upper bound. And I was assuming here that it was perfectly reversible. So here, I'm putting myself in a, this was not realistic from the physical point of view, the energy expenditure that was necessary to really get purely irreversible steps is infinite, right? If you want purely irreversible steps, you need an infinitely large energy barrier between one state and the next one. So here, I'm putting myself in a, it's a different bound. I can have as many steps as I want, but now there's an energy constraint. So I'm not going to do the calculation because it's a bit, it's not that complicated, but it's a bit hairy. But can you try and guess the, what the? Yeah, I mean, that's the idea of the calculation. Racial forward rate and backward rate is by micro-reversibility and detailed balance is equal to the exponential of the energy difference divided by KBT, okay? We're not going to do a calculation. Here, I just want an order of magnitude. And it's a simple, you know, you shouldn't try and look for something too complicated. Now n is not a constraint anymore. You just have an energy. So this should, this should depend on energy. How should it depend on energy? Should it go down with energy or go up with energy? The, sorry, the energy, well, the energy that I provide. Okay, so H nu. Okay, okay. So I have one energy and I need to make something that's dimensionless. What else do I have? Come again? No, because here I really have continuous thing of step, you know, I have a continuous, so what is the energy scale that's rather than here? That makes that this will not be completely deterministic. It's temperature, right? It's the fact that it's the very fact that you can go back and forth is because of temperature, right? If you have a zero temperature, you always go down energy. You cannot go back up, right? So the only other energy scale you have is KBT and one can do a more complex calculation. I mean, it's not such a complex calculation, it's just a mean first passage time or in kind of calculation on some energy surface like this with thermal agitation and this is the answer, right? So there's a two factor, but forget about this. It's not so important. You get KBT that will increase the CV and you have the energy you provide, which is H nu that will decrease the CV, okay? So it's a competition between how much energy you provide and how much thermal agitation there is. So now that's interesting because we know these numbers for the rhodopsin, right? So that can give us a bound on how well rhodopsin could do even if it were perfectly designed by evolution. You know, the fact that the photon can only provide so much energy will make a bound on the CV and then ultimately on the ability to detect single photons, right? So KBT is what? KBT at room temperature. I put everything in electron volts. That's KBT at room temperature. And then H nu. So I'll take the typical nu of visible light, light that can be seen by cones. It's about 500 nanometers. So then in that case, H nu, 2.5 electron volts. So 50 is the answer, roughly. So if you convert that into a number of steps, it's like it's as if you cannot have more than 50, sorry, let me, it's as if you have 50 different steps at most, right? So there are six, which is less. But you can see that it does not, you know, in all those of magnitude, it's not that different, right? It's just maybe one or the magnitude different. Come again. Please speak louder. It should be equal to one over 40. Sorry, you mean to get this? Why did I get this wrong? Yes, yes, yes, sorry. Thank you. One shouldn't trust one's nose. So, all right, so, so far we've been focusing just on the photoreceptors and the response to single photons. Yes. It's not a question of realistic here. It's a question of a bound. So what I was interested in here is what's the, you know, again, like in many of these things, I want to know what's the maximally achievable performance as defined by the physical constraints. And here the physical constraint is an, you know, in this calculation I want to say, okay, there's so much energy budget I have. And when I give an energy budget, I can make, I can achieve a certain precision, all right? And this is what this calculation, which I didn't do, is about, okay? So it's not about being realistic or not. It's that you cannot beat that, no matter how you do it. No matter how realistic you do it. If you do something realistic, like what happens in the actual system, you're going to be way above one over 50, right? But even if you optimize and fine tune everything, you couldn't beat that bound, right? So that's the important lesson is that, you know, the physics really constrains the design. Like this is really here. The energy is really constrained on how well you can do. And then the question, the second, the next question is, how does biology, you know, how close does biology get to that? And here, you know, I would argue that it gets pretty close. I mean, it doesn't get to one over 50, but it gets to, you know, if you had to put 50 steps, for instance, it would be really, really complicated. Maybe two big proteins and stuff. So it's not doing so bad, right? At least the order of magnitude is roughly the right one. Okay, so I want to tell you a little bit about later processing. So I'm going to skip that. So in a particular, you already, Alexandra introduced information theory to you. And so here we're going to make use of it to understand what happens in later processing. And so you have the photoreceptors and then fly visual system, this will, these will talk to so-called monopolar cells. And so we'd be interested in just this step for now. And after the break, I'll talk about something completely different, which will be a maximum entropy modeling. All right, so let's call X the signal that comes out of the photoreceptors and Y what goes into the monopolar cells, what comes out of the monopolar cells, right? So you have to think of this as an information transmission device, right? So you have a noisy signal, you have a signal that comes in into these monopolar cells and then you have the signal that comes out, right? So it's the same kind of setting as Alexandra was talking about and what we're interested in is to see whether the information transmission from X to Y, whether it's not maybe optimal but whether it's well-designed, yes? You don't have to, I mean, it doesn't really matter for this purpose. This is like the first, like, you know, the retina works by layers. So this is in the fly retina. In your retina, you have photoreceptors, then you have bipolar cells and then you have later stages of processing. I'm interested in the first stage of processing, okay? So in the fly, the first stage of processing would be performed by monopolar cells. So what they do is that they take input for the photoreceptors and they basically process this information, okay? And then out comes Y. So I wanna study this as a channel, right? In the sense of information theory. I have some X and I come out, you know, input, that would be an input and this would be an output. So what Alexandra already did is she looked at the case where Y was linearly related to X plus some noise. So here I'll do something a bit different. I'll now also assume that G can be a nonlinear function. So G and the magnitude of the noise here will characterize the properties of my channel. And ultimately, the question I would like to answer is to see whether this G of X has a particular form and I wanna justify its form in terms of efficiency of information transmission, okay? But for this, I need to first calculate what information transmission is in this simple system. So this is the mutual information and remember this guy is the entropy. So it's the reduction of entropy from, if I just consider Y, the outputs, minus the entropy of considering Y, knowing the inputs, okay? So it's symmetric, I can do both. For this calculation, this formulation will be more convenient. So I just need to calculate these two guys. I'll make a drawing. So this is G of X. This is the average response. And then if I did an experiment where I fixed X and I measured Y and I put a dot every time I record Y, I would get something like this, okay? The difference between this and that is just epsilon. So this is the uncertainty, is the entropy corresponding to the uncertainty of Y if I fix X, okay? So if I fix X, I look at the set of Ys I can get. It's Gaussian, right? So if I say that the variance of my noise is sigma squared, I just have this. If I'm interested in the entropy of this, it's the entropy of a Gaussian variable. Alexandra told you how much that was and it's, or you can redo the calculation. That's the answer. Okay, so now the second part is to know S of Y which is the entropy of the parity distribution of the output. So you see here the thing is this will depend on the parity distribution of the input, right? So I have a certain distribution of X and maybe see, I can see a different color. So here I put a P of X, not necessarily Gaussian. I mean, I put it like a bell shape but, and here I represent the P of Y given X and green. But then if I just want to know P of Y, I need to draw my axis at random, then add some random noise and then do a histogram of the results, okay? And this will be something wider. So in fact, it's quite, it's not that easy to calculate. If I want to know P of Y, what I need to do is that I need first to draw X. So I put, just to be clear, I just put a label here for X and Y. So first I draw X and then I draw Y and then I sum up all the X's that could have given that Y. That I know. So that's the input distribution. So the noise here is just a Gaussian. So you see at the end of the day, I want to take the entropy of this, which itself depends on the input distribution, right? So the mutual information, and that's an important feature, is that the mutual information always depends on the input distribution. Here what is the input distribution? It is essentially the statistics of the input currents. And ultimately the statistics of the input currents is the statistics of the natural world. So it's the statistics of the levels of immunosities that the fly will see in a natural environment. So in a way, this is, that is fixed, right? You can view this as fixed. So let me just finish this calculation. What we'll do, here we can't really go any further. We'll do a small life approximation, which means that sigma is small. Let me, it's too complicated to, right? So I assume, and in this one, that sigma is very small. I mean, the noise is very small. And what this will lead me to is to can simply approximate this with the Dirac delta function. I do change a variable. What I call y prime equals g of x, okay? So why do I want to do this? It's because sort of dy prime equals dx g prime of x, just my change of variable. And so here I replace this by an integral over y prime. And I put x of g minus one y. So g minus one is the inverse function of g. And then I replace my delta function here. Sorry, I have, so I also have x equals g minus one of y. So here I'm just dividing by this, and come again. X is equals to y prime. Yes, and here y prime, there's y prime everywhere. So but the point is that of course there's a delta function. So all of this gives me that p y of y is just one of the d prime. But you see there's a one to one mapping between p x of x and p y of y. This is just a change of variable. So through change of variable, so it's like the deterministic version. So I have an input distribution p x of x. I can turn it into an output distribution in a small dose approximation where y is just given by this, right? If you do this, you get that. And this is equivalence. So it's equivalence in quotes. So I'll keep this. So to summarize, this is my mutual information where here, this thing here is given from the input distribution by this relationship. So what is, Alexandra already told you about this, but the channel capacity C is when you maximize this dimensional information over the input distribution. So that's from an engineering point of view. Typically you have a channel. So channel here is just, it can be any sort of channel, but in electric communications it can be a cable or something like that. And you have relationship between the input and the output and that's given by the physical constraint of your cable. And then if you're an engineer, you want to maximize information transmission. So what you want to do is that you want to tune the input distribution so that you maximize the mutual information between the input and the output in other words, so that you maximize information transmission. And when you do that, you reach the channel capacity. That's the maximum of that transmission. Here we can ask the same question for this processing device. And as I said, there's a one to one mapping between these two quantities. So instead of taking the maximum respect to Px of x, I'll take the maximum respect to the output distribution because in my small noise approximation, it's the same. So what I do simply is that I take the functional derivative of my mutual information with respect to Py of y and you can see the functional derivative here will be fairly simple because it will be minus one minus log of Py y. Ah, okay, sorry, I should add something to this. Here I need to add the Lagrange multipliers multiplier to make sure my distribution sums up to one. Okay, if you want, so what is this? It's minus one minus log Py of y minus lambda equal zero. And this means P of y of y is a constant, doesn't depend on y. So the best way to maximize information transmission is essentially making sure that all your outputs here are equally likely. So instead of having this sort of distribution I showed here, you would have something like that. Uniform distribution of outputs. So this is interesting because it's something one can test actually in the biological system. So I can translate that back into, so saying that the output distribution is constant tells me something about the input distribution. So this is constant, let me call it alpha. What this means is that if I look at the input distribution it should be equal to the derivative of the transmission function. In other words, if I take the cumulative distribution which is defined as this, this should be proportional to the function g. So the function g is this thing here. So that's nice because it's a testable prediction. So people looked at this. So there's Simon Loughlin in the 80s. What he did is that he went out into the natural world to measure the distribution of inputs that may impinge the retina of the fly. And so he found a distribution. So he measured essentially, he measured this. He measured the distribution of flight intensities. And then he measured the g of x. So for that you take your retina, you do some electrophysiology and you measure the output current, the mean output current as a function of the input current. So you measure this g of x. And then you can plot one against the other. And this is what he found. Here the solid line is the community distribution I just put here, right? And these are the results of the physiological experiments where he measures the voltage of the monopolar cells as a function of the input. And so you can see that it actually follows it very nicely. So what does it mean in other ways of interpreting this? You have this g of x, or rather you have an input distribution. So your light intensity is typically here, right? So light intensity is, x is proportional to the light intensity in that case because it's the output of the photoreceptors. So you have your light intensity here. If you want to have your transfer function, what you want is you want to place basically the dynamic range of your transfer function. Transfer functions typically look like this. They have some, you know, sigmoidal kind of shape and really where the most sensitive one is in the middle. Right? Because if you're sitting here, you cannot distinguish between this input and that input, right? They all give the same output. So what you really want is you want to put the soft spot so it's the place where really you have a lot of sensitivity and sensitivity here is really measured by g prime. This is how much you're sensitive to changes in the input. You want to put it, the place where g prime is maximum, so the inflection point of that curve, you want to put it in the middle of your distribution of inputs, right? So this is the design principle that will maximize information transmission. And this is what's observed here. But there's a stronger predictions from this kind of theory and this all goes into the general theme of efficient coding is that the visual system in general will always try to do something like this. So the thing is that sometimes, you know, you can experience this if you go into a dark room in the beginning you won't see anything and then after some time you adapt, right? So you can think of it this way. When you were in the sun, let's say, your input distribution, so this is your x, this is the light intensity you experience, your input distribution is around here, right? So what happens is that you're set, you're right now adapted to set its response function here. Okay? But now you suddenly go into a dark room. So all of a sudden the distribution of your intensities go here. In the beginning, you don't see anything because your response function is this one. So it's really, it's not doing anything, it's not firing, right, you don't see anything. And what happens during adaptation is that the retina will change its physiological parameters so that this response curve will now move here. And now you start seeing better, right? But it's even better than that. So you can actually show this in the retina that you get this adaptation and the level even of photoreceptors. But it gets better than that, it's like it also works if you change the contrast. So this is adaptation to the mean light level. But now you can also imagine that sometimes you look at, let's say you're looking at just a white wall behind you, right? So if you think about it, the distribution of intensities will be very peak because all you see is white, okay? So in that case, your immune system will adapt, sorry, your visual system will adapt to have a fairly sharp response function like this, right? By virtue of this. But now you go back, you go into the woods. For instance, in a sunny day, then you get huge contrast because you get some places that are lit by the sun, some others are in the shade. And then you get a very wide distribution. And then your visual system will also adapt not just the mean level, but it also adapts its response function to be much more shadow like this, right? And what's amazing is that this also is a prediction for how the retina should respond. And this was done, for instance, these are experiments in ganglion cells, invertebrate ganglion cells. So these are later stages of processing still in the retina. But what they did here is that they changed the contrast of what they were showing, right? So they were showing some stimulus and all of a sudden they changed the contrast. So the contrast is like the width of distribution of this distribution. And when you see that when you change the contrast, the mean is the same, you see, you know, a big change in the activity and then it goes back to some level. Then you change the contrast again and it goes back to the same level. And you can also see this in later stages of processing in the fly. And this is this idea, is that here they did adaptation to also different contrast in that case of speed, because these are cells that are sensitive to speed. And if you look at the response functions, you get these different green functions I showed here. But now the prediction here is that if instead of showing this response function, so the g of x as a function of x, so the g of x here is the green one and the blue, the white one is p of x. So instead of showing g of x as a function of x, you show it as a function of its variance, or the variance squared. So you renormalize by the width of the input distribution, then the prediction of this is that all these curves should fall on top of each other. And this is what they managed to show here, like you see different response functions to different contrasts. And here you can see that they all fall on top of each other. All right, so we'll take a break now, a short one, five minutes, and we're back at 25, and then I'll start on maximum entropy. Let's start now because there isn't much time left. So first a bit of motivation for the last part of the course. So the goal is to learn something from highly correlated data. So what I'm going to talk to you about is general set of techniques. Can I have your attention, please? So we're going to try and learn the collective behavior of complex biological data. And let me just give you a few elements of motivation. So the idea is to really try to understand how many units come together and interact to give emergent behavior. So this is something we see in physics, of course, like in magnets and stuff like that, but also in biology you see it in a wide variety of scales. So even at the molecular scale, maybe some of you guys are familiar with the protein folding problem, so you can view this as a collective emergent behavior, like coming from the interactions between the different amino acids. There's another example, which is allosteric binding, which I think maybe Alexandra told you about when talking about the health functions. But then, you know, if you go up the ladder of scales, you see that even at the cellular scale, like the way different cells interact with each other to form an organism is also a collective behavior. So one example of this, of course, is the brain, where you get many neurons that interact with each other leading to emergent behavior. But even at an even larger scale, at the population scale, how different individuals interact with each other to give rise to collective behavior that you could not predict it by just, you know, observing the individuals one by one. And you can see this, for instance, and this is a termite mound, or an example I'll be talking about is cooling of fish or collective movements of animals. Let me skip that. And, you know, remember I told you about the Bayesian way of thinking about this, and the approach I'm going to take is a different than usual approach, as I said, where you start from the model, you calculate what you should get for the observables. Here I'm going to start from the observables and try and back out and infer back the model. So the general technique for doing this is maximum entropy modeling. And the idea is quite simple. Let's imagine that you have N agents. So you have to think of your agents. They could be anything. They could be amino acid on a protein chain. They could be cells. They could be genes in a gene regulation network. They could be neurons. Or they could even be individuals in the group. Each agent is characterized by a variable xi. So, again, xi could be anything, but you can think of it as a number for the moment. So the state of the system, the collective state of the system is given by the vector of all the states taken together. And in the kind of modeling I'll be doing, one always assumes that the model is characterized by a probability distribution of all possible states. So I assume my model is stochastic. And so what would give me the relationship between these different variables is not a deterministic one, but it's going to be given by some probability distribution. And I want to know how to build this probability distribution directly from the data. So the naive way of doing this, if you're not so naive, would be to you observe your system many, many, many, many times. And for each of the times you observe, you record all your axes of all your data points. And then you draw a huge histogram of that in this very highly dimensional space. The problem with this is what I just said, it's very highly dimensional. So you run into what's called a curse of dimensionality, which is that you cannot simply count the multiplicity of states there. So you need to make simple assumptions. So maximum entropy is a principle by which you will really focus on a few observables of your system and you want a probability distribution that really reproduces these observables, but that is otherwise as random as possible. So let's say you have k observables that you'll call OA and that can depend on the entire state of the system. So when I say observables, you have to think of something simple like OA of X, for instance, it could be simply the first, you know, Xi, right? So that would be a monomial, but you could also imagine that it could be simply the product of two variables if X is real. Come again? It's the function of X vector, the entire state of the system. So here it's these are examples I'm going to derive a general proof. For example, it could just, you know, one of the observables could be just the state, the numerical state I. It could be the second one. You'll see in a second how that comes about. But what you want is you want two requirements. You want the average of these observables within the model to be equal to the empirical value. So let's say I have M empirical samples. So these are my measurements. I measure the state of the system of the entire system M times. So this will be my samples. And what I call the empirical value of this or average of the data is just something like this. This is a big O, not a theta. All right, so I want, this is my first requirement. My second requirement is that it's otherwise as random as possible. So what's the good measure for saying there's something that's random as possible? And the answer is already on the board. Entropy. So I want to maximize the entropy. So I'll define a functional phi, which would just be my entropy and I'll add Lagrange multipliers to enforce this condition and I'll add yet another one to enforce normalization. And so I'm looking for parity distribution. So I'm going to take a functional derivative of this with respect to the entire p of x. So in the first term, get minus 1 minus log p of x. Then if I break this down, what is this? It's just a sum of all x's of p of x's. So if I take the derivative of this with respect to p, I just end up this. And as usual, I have my mu here. So here I use the technique of Lagrange multipliers, of course. And these are my Lagrange multipliers. Sorry, here there's a sum of a. And I can rewrite this in the following manner. So this is the result. The result is that then if I do this entropy maximization subject to this constraint on the observables, I end up with this form, which can write in this familiar form, which is the Boltzmann law. So exponential minus Hamiltonian. When the Hamiltonian, I would just have this linear sum of my observables with these Lagrange multipliers. So in fact, when I've done this, I haven't solved the problem yet. Because as always, when I use Lagrange multipliers, then I need to adjust the Lagrange multipliers so that I can satisfy the constraints. So I will have here, these are the parameters of my model, if you like. And then I need to tune the Landers so that I enforce this. And that's not necessarily very easy. So let's look at a simple example. Let's just say that xi is a binary variable. Actually, let's assume that xi is a classical spin, meaning it can take values plus 1 or minus 1. But it's the same as saying it's a binary variable, just the values are minus 1 and 1 instead of being 0 and 1. And now let's assume that my observables would just be the xi. Okay? So for i going from 1 to n, what this means is that I want my distribution to have the same mean value of xi as in the data. Okay? So it's a constraint on the mean value of xi. So if I use this formula, this will give me that p of x is 1 over z exponential minus... So let's call my Lagrange multipliers the same as before. Sorry. This is the sum of okay. Got this wrong. This is the sum of okay. So here I just redefined for lambda i equals minus hi. You see why in a second. So in fact here what you recognize once you've written things this way is that this is just a system where you have independent spins that each are subjected to their own external field. Like the side-dependent external field hi. So they're independent because I can factorize this distribution. And in fact I can calculate z. z is the normalization constant. It's my partition function in the language of statistical mechanics. It will be the sum of all x's of this product. And here because it's a sum of independent terms, I can factorize this. So to summarize z is a product. So now remember what I want to do is I want to calculate the h's. These are my Lagrange multipliers, right? By virtue of this definition. I want to tune my h's so that I do actually get the right value for these guys, okay? So these guys they spin so I can call them local magnetizations. I call them mi by definition. And sorry, this would be the data one. So how do I do this? Well all I have to do is to calculate within my model xi. And xi is 1 over z product over j sum over x xi product over j. And here you just have essentially two different situations. Either it's j is equal to i or it's not. So it would just be 1 over z exponential hi minus hi. Product, sorry about that. Remember my z was itself this product. So it would cancel out all the terms except for one. And the final answer is two hyperbolic sinus. So my constraint now, what it gives me, it tells me how to calculate hi from what I measure. So this is what I really measure. So it's simply the inverse hyperbolic tangents of my measurements, okay? So this is simple, you know, spin physics because they're independent. But here you really need the inverse of the usual relationship because that's what you measure and you want to know what the h's are, right? To build your model. Because hi in physics is usually how you call a field. This would be, so it's to have an analogy with physics. Yeah, it's just a notation. You see, you can view this notation. This is Boltzmann law. And again, you know, I have to set KBT to 0 to 1 in order to make this work. But okay, so that's the simple case of independent spins. I just wanted to show you when it gets interesting. So I said, you know, the motivation for doing this is to describe emergent behavior and correlated data. There's a question here. Here I gave you an example of independent variables, okay? So if I want to know something about the correlated behavior, what I can do is that I can also add this kind of observable. So this will be the correlation functions, the pairwise correlation functions between two variables. And in that case, the p of x I end up with looks like this, okay? So these are my lambda i, right? So my lambda a, I just call the coefficients, my Lagrange multipliers differently, whether they apply to the first order terms or the second order terms. So here I take my set of observables that's given by the entire set of first order and second order terms. And the model I end up with looks like a nizing model, but it's a disordidizing model because the JHAs can take any value. The HAs, the fields can also take any value. And then, you know, the inverse problem is you measure the Mi and the Cij, for instance, let's call that the connected correlation function, connected. So that's what you measure. And from that, you want to calculate HI and Jij, okay? So that's what the inverse problem is because usually in, let's say, in spin graph physics, you're given some information about how the fields and couplings are distributed, for instance, or what their values are. And then you calculate these observables, right? Here you have to do the opposite. But it's also more, you know, it's kind of a different task because here you're going to get a very heterogeneous system in principle, right? You have biological data. There are many values for these power correlations and these magnetizations. So I think I should stop here. But tomorrow we'll apply this specifically to the case of correlated neurons to see how we can model how many neurons act together. And if I have time, I'll also show an application to collective behavior that's working.