 sheet for now. So we're going to change topic before we do. I'll just finish telling you a bit about this. So these travelling waves just to show to you that this is not something completely from nowhere. This is actually, so this is the cartoon we've been drawing, but what this is is the evolution of the flu from 1968 until now. And the more yellow they or the more orange they get, the fitter they are. So you can actually see the flu strain getting fitter in the world. What fitter means is that they do, so this is the flu, this is a part of the immune system of a host. And they take part of the flu that interacts with the immune system and they actually measure how well they're able to recognize each other. And there you actually clearly see this wave of things getting fitter and fitter. So the colors here changing with time. One quick point I want to make about evolution also. We've mainly been talking about point mutations. Evolution doesn't only happen by point mutations. This is an experiment from Olivier Teneillon from E. Coli where they did 115 parallel populations as opposed to 12 of Lensky, but only for 2000 generations. And then they sequenced bits of it. They have a protocol. But the main interesting thing here is then they look at what happened in these bacteria. And there are point mutations. So on average, every population had, on average, 11, every clone had on average 11 events. And so these are the absolute numbers in the mean. So there are point mutations, but there's also indels which means insertions and deletions. So you get rid of some DNA and you put in some other ones or one of the two. There's insertional sequences, so bits from elsewhere that just get inserted, duplications and large duplications. So basically things that are much larger than point mutations going on. But what's interesting is that at the level of point mutations, it's not very reproducible. But if you look at how many shared mutations these different populations shared. So at the level of point mutations, it's actually very small. But as you get to the higher order differences, so like if you look at all the mutations in a given gene or in a given operon, which is a set of genes as we said during the first lecture, in some functional unit, you actually start seeing a lot of reproducibility. So functionally at the level of function of the phenotype, it starts looking very reproducible, although at the level of actual mutations, it's not reproducible. And then what this shows is that while you haven't saturated all the possible point mutations or mutations, you have all with these generations, they have the close to saturation with the higher order changes. And yeah. No, no, no, so it's a reproducible environment. It's 115 parallel populations. And you ask what's the number of lines sharing a given mutational type? Okay, so how many times a given, so not mutation, but when mutation isn't just a point mutation, it's all of these larger effects too. And so for point mutations, you see that the average number of lines linear just sharing it is one, but like for large deletions, you know, it's over two. So you see that these larger effects are reproducible. And also, so that's one thing. And a separate thing is if you look at reproducibility, you ask, I'm not interested in a specific mutation, but I'm interested in this gene. And I asked in how many lines did I see a mutation that say deleted the function of that gene? Yeah. Yeah, so okay. So here, what this shows is that you've run out of possible changes at that functional level, at the gene level, at the functional unit level, so that in a way you are starting to be well adapted after here, but not in the point mutation, at the point mutation level. So what you can interpret this is that, yes, functionally after this number of generations, you start being well adapted. And what they see here is they do see a decrease in the speed of evolution. And there's another experiment from Michael Desai and Sergei Kaziminski that see exactly the same thing in the east. And they ask, what is the source of running out of these mutations? So here, they put forward the hypothesis that, because they see that most genes have at most one mutation per lie. So they put forward the hypothesis that one beneficial mutations makes any other beneficial mutation in that gene useless. I mean, that you'd be, essentially, it's either deleterious or neutral. So although if it would have appeared first, it would have been good. But if it appears second, it conveys no advantage. So that's called draft. That's, again, your hitchhiking with somebody that's good. And so they see that if you've had one mutation in any given gene, you're not going to have a second one. That's why this is white. This is the number of mutations cross-correlated between genes. But in Michael's experiment, they actually see that this, so this is a hint of, again, these interactions epistasis. In Michael's experiment, they also see epistasis, but they, and they see the speed slow down, but they actually say that it's interactions with anything anywhere in the background. That's the, I got rid of that slide to save time. The other thing they see is that you see evolution in clusters. So a given mutation determines an evolutionary path. So if you get a given mutation, this sort of determines what mutations you'll get next, because the mutations cluster, so organisms that found one will come typically have other ones. So, and again, this sort of funnels you down an evolutionary path. This is what they're saying here. So this is just to sort of put out some of the facts. So you see that there is some higher level predictability coming out of this, just as we saw in the, in the antibiotic resistance example, right? So just to sort of show you some of the recent things that people are doing. This is a very cool experiment, and again, this is from Isabel Gordoslav in Portugal, and Olivier is doing similar stuff in Paris. So what they do, okay, so we talked about fluctuating environments, right? And so far I've been showing you experiments from test tubes. So what Isabel said is, why don't I put my E. coli where they belong in the mouse? Well, you know, mice are similar than humans still. So basically she takes a mice, she cleans out the gut, she sterilizes it, and then she reinfects it with the E. coli that she wants, okay? So she puts in the strain of E. coli she wants into these mice and uses these mice as living test tubes and let them evolve in the mouse gut. So a real, you know, okay, whatever real, right? A closer approximation to a real environment. So because the mouse eat, move, and do all that. And then she looks at the same thing, then she collects the poo of these mice and she sequences it and she looks at the evolution of it. And again, you see the same thing that we saw, we see in test tubes, you see this clonal interference, so strains competing, yet, and you know, like the calculation we did before the break, and then she can compete them in the lab and figure out who's better and do all that. And then she can also, so she can look at the different mice and see whether they evolve the same thing and they fix the same way and look at different lineages. So for example, they all have, they all produce a mutation in this operon responsible for this galacital metabolism. It's different mutations again, but they all have the same function. So the same conclusion that function is reproducible, but not the actual molecular underpinning of it, okay? And so this is a sort of, you know, showing you more of this stuff, but one of the things she's, so one of the things, so then she looks at immune compromised mice, okay? So you've probably heard about this, all this meta meta biome stories, right? That, you know, the bacteria in our guard depend whether we're determined whether we're fat or thin or beautiful and all that, healthy and so on, okay? So basically these are bacteria in the guard and she was interested how interactions with the immune system determine the evolution of the bacteria because that's also their environment. And so she made some mice that have a compromised immune system, an immune system that doesn't work. And the first thing she saw is that evolution is much, much slower, right? In the immune compromised ones than in the non-immune compromised ones. So basically after, if I think we need to go back to see this, here after, you know, after six days you already had things going on in the immune compromised ones, nothing happened after six days. You need to wait much longer for something to happen, but when she looked into this in more detail, it doesn't, it seems that it's not an interaction with the immune system. So the immune compromised behave differently in terms of the speed of adaptation than the normal ones. She could show it's actually the selections that are different and the mutation rate is the same, but it's not the interaction with the immune system. It's actually interaction with different bacteria. So when they then put an immune compromised mouse in the same house, mouse house in the same cage, essentially, as normal mouse, then they started to behave exactly the same as the normal mouse. And so what did she sort of go quickly through it? What it had, what's really happening is it's the interaction of this E. coli with the rest of the bacterial environment, because what you should know is that mice eat mouse poo, okay? So if they're in the same cage, they will eat the poo of the different mice and then the God will have the same bacteria as of the normal mice. And that's enough for them to recover the evolution that you see in a normal mouse, okay? So it's really they form a community and they interact together and it is very important for evolution and that's what this shows. Okay. I'll skip this. And just so in terms of predictability and nonpredictability, this is, these are results from Michael Lessig and Marta Uksha who and Richard Ney and Boris Shryman and Colin Russell did something similar is they basically said, okay, so if we understand all this adaptation, if we understand the dynamics because we've written down all these equations, then we should be able to predict how the flu evolves. The flu is a virus, so it evolves essentially in a similar way to everything we've been talking about. And so they build models based on this and every year in February, the WHO, the World Health Organization, it makes a decision about the flu strain for next fall, for next autumn, okay? And so there's a time when you make a prediction for what will happen. And so you have all the data until there and the question is can you predict the flu strain that will, in February for the North Hemisphere be the dominant flu strain in the next year. And so you can see, so this is a prediction for 2000, made in 2014 and you can see that, you know, they are able to actually on short timescale predict the evolution of the flu strain, okay? It, of course, it breaks down after some time and not only is they do reasonably well, but it does mean that we do understand a little bit about these dynamics now, using exactly these types of equations, okay? And some element of interaction with the environment. Okay, so that's the end of basic evolution and now we're going to switch to diversity and maybe a bit of information. Okay, so let me start from, with diversity. So I'll stay with microbes for a little bit. So we've been talking about this bacteria and now I'm sort of trying to get more to real life. So bacteria are everywhere and very often they form what's called the biofilm, okay? So they live in communities. So biofilm you're probably most familiar with is this one, okay? If you haven't brushed your teeth you can now take your tongue and feel your biofilm. Another biofilm you probably know well is if you go into a river and it's slippery and you have to watch out not to slip because of the slime that's also a biofilm, okay? So biofilms are not, biofilms well are good and bad and from the point of view of us selfish humans, they are road ships so we need to get them off ships, but they're also very needed for plants. Plants form some bioses with bacteria that they form a biofilm against around the roots and help them grow. They're also used in sewage cleaning and many other things. So, but the basic thing is that bacteria form these communities not just with themselves but often with fungi, with other microorganisms and they really, you know, they form exactly what you feel in your mouth right now, right? Some sort of slippery surface. The one in your mouth is bad by the way, that's no, no, no doubt about that. So, you know, so these species interact together and this is a sort of part of, so, okay, so there's some in soil, there's some in hot springs, there's some in different parts of your body and the salmon cheese, okay? So you can, you know, the grind, grind is the sort of skin of the cheese, that is also a biofilm. So really they're everywhere and as, so, okay, so maybe I should give you some historical background. We've been talking about, we've been talking about E. coli, I've mentioned K-12. K-12 is one strain of E. coli, so it's like one type of E. coli. So, where does K-12 come from? Well, K-12 comes from the feces of a patient in a hospital in, I don't know, 1928 or something. It was isolated and then it became a lab pet, okay? From then, from the 1920s it's been put from lab to lab and scientists have been working on this specific K strain, lab strain of E. coli that was randomly isolated from somebody's poo, okay? So it's a completely arbitrary choice. There's nothing special or interesting about it, except that we know way more about it than any other living organism right now. So a while ago, not that long ago, scientists sort of realized, well, you know, it's really by a sampling that we're doing here and there's all these other bacteria and all these other organisms out there in the world. Why don't we figure out what they do? And so they went out and they started sequencing everything that literally doesn't move or moves and they, you know, they went into your bug garden and picked up some soil, they went into the pond, they went into hot springs, they went into every part of your body, they went into cheese and so on. And they started sequencing and what they discovered, lo and behold, great surprise is that the world is way more complicated than you would have thought from studying poor K12 in the lab, okay? So even if you look at E. coli, it's way more complicated. So just why did this become possible? Partly because of advances in sequencing, right? So remember, sequencing is the thing that allows you to read the DNA like you read a book, translates it into letters and there's been some technical breakthroughs that basically went from the fact that for sequencing in the old days you had to piped and you literally did it base pair by base pair letter by letter. So that's one thing. That's the, that's piece that you can PCR the DNA which make, which means you can make many, many copies of it. Well, maybe that's not the right thing, but that is one of the breakthroughs. The other thing is that in order to do sequencing, this is probably the more important thing. You need many, many copies of the material just technically to be able to read it out. And so that's why we would sequence things that we could grow in the lab. And I don't know of the number, yeah, okay. So one percent of bacterial species can actually be cultured in the lab. If you take a species from the wild, any kind of wild, including your mouth, you, you can only culture, so you can only make grow one percent of them, okay, by something. The rest you have no clue about. So if you couldn't make them grow before, you couldn't sequence them because you wouldn't get physically enough material to put into your machine, which was actually your, your hand because you were pipetting things. So basically through some advances that I don't want to go into, but I listed here, people were able to start to just take out the DNA they actually see, amplify it without culturing the whole organism and then sequence that. And so one of the techniques they use is called 16S ribosomal sequencing. So 16S, okay, a ribosome is the thing that makes mRNA into protein. And so every cell needs a ribosome. And the 16S unit of a ribosome, it's just a part of the ribosome, is very conserved throughout many, many different species. So people basically use this as a marker, well, especially in bacteria, it's very conserved. So people use this as a marker of something that should look the same. So they go in, they just look at the specific part of DNA which should look the same, and they ask, does it look the same, okay? And there's other markers like that for yeast and, and fungi and so on. So that's one thing they do. Maybe I won't even mention shotgun sequencing. So the thing that really revolutionized things is called the PCR, the polymerase chain reaction, which is, you know, it's sort of like a nuclear fallout reaction. It basically exponentially amplify, oh, well, not exactly, but two-fold amplifies in each state in a fast way, well, yes, exponentially, amplifies the DNA. So you start off with one double-stranded DNA, and you end up with many more because in a fast way you can amplify it. Okay. So as I said, people have been sequencing different stuff. So this is an experiment where people started to look at cheese, cheese from different parts of the world, and they collected many, many samples, and then they sequenced it, and they found that in the cheese that they looked at, so 137 different cheese by doing the 16S sequencing for bacteria and ITS, which is the equivalent for fungi. In general, they found 14 different types of bacteria and 10 different types of fungi. And per cheese they found about 6.5 bacteria and 3.2 fungi. So basically what this is saying is that there's a lot of diversity even in something as simple as cheese. So I'll skip this just to say that they can then figure out interactions between the bacteria and the fungi, and they can build fake cheese and reproduce how the interactions evolve. But as you see from this, what you need in general is some measure of diversity. You need to be able to put some number on what is diversity. Okay. So now as a physicist, how would you quantify diversity? Wait? Very something. Variants. Okay. Yeah, variants. Okay. So you know, I've told you now there's many, many different species, right? So I just want to get an idea. So I have many, many different species on one cheese and many, many different species on another cheese. And I'd like to say which cheese is more diverse. Yeah. Okay. Of course. So my paid colleague over here gave the right answer, Shannon Entropy. So okay. So you're probably familiar with entropy in general, right? I mean, I seriously hope you're familiar with entropy. So let's, let's define entropy. If we have a probability distribution, say the probability of seeing some bacteria, if I write this out as the entropy in a given sample, it shouldn't surprise you. Minus. Thank you. Minus. Okay. So maybe it should surprise you if I write that. Okay. Right? So if we look at the micro canonical ensemble, this is what you would write down. So as we said before, you know, we're taking the, you know, we're looking at small numbers. So we want to take the log and then we want to re-weigh it by how often we see something to give more weight to things we actually see more often than others. So we're basically re-weighing the frequency. So we're doing more than just counting. Okay. Because since my paid colleague answered before anybody else, nobody said, well, just count the number of bacteria, which would be another thing. But that is more, you know, then you give everybody equal weight, but you may not want to give everybody equal weight. So the only difference with what you're probably used to is I'm going to write the factor of two here, which will give me units, which are called bits. Okay. And so for counting, this is useful and we'll see in a minute why. But otherwise, it's just like what we do in the micro canonical ensemble, we count, essentially what we do in the micro canonical ensemble is we count the number of states, right? And then we go to the other ensembles by introducing some constraints. But in the micro canonical, we just count the number of states. You do this calculation where you write omega and you figure out what omega is, right? So, okay. So let's look at what this is. So let's first assume that all of my bacteria are equally likely. So if all my bacteria are equally likely, then my entropy is simple. I just plug it in. And that's why I need the minus sign to get something positive at the end of the day for diversity. So I get log m bits. So for example, if we have a fair coin that has two outcomes, heads or tails, then if it's fair, then the entropy of that is what? Yes, right? So if Pi is one half, I have a fair coin with two possibilities. Then the entropy of a fair coin is one bit. And it's true for any fair process, right? The entropy of, so this gives you, this maybe should be read this way. If you ask what is one bit? Well, one bit is the entropy of an equally likely process that has two equally likely outcomes, like a fair coin, okay? So that's whenever you see one bit, that means you have two equally likely choices because the interpretation of it is 2 to ds is the number of equally likely states, okay? So 2 to 1 is 2, think coin, okay? So then we can do another case. We can ask, well, what in fact, what if in fact the probability to be in one state is 1 and the probability to be in all other states that are not this state is 0, right? So here I had a uniform distribution and now I have a completely biased distribution. I have a delta function like distribution. So if I do this, well, I have two states. So I sum over all the states that are not my chosen states and they all have probability i, they all have probability 0 and then I have this one chosen state. I'm losing my tools as I go along and they have probability 1. So here I get 0 because log of 1 is 0. So I get 0 bits. So that makes sense and this is why it's a measure of diversity because if I don't have any diversity, if everybody is the same, then I have 0 and if I have the most flat distribution, I have log m. So in general, entropy is bounded between these two numbers. Okay. So let me show you another application. I'll skip this application. That's yet another application but it's, if you're freaked out by some of the biology I've showed you, I won't go there. I'll show you this instead. So we're going back to, to get the gut and to microbes which I think now we're very familiar with and this is a study where they took members of a lab and they collected the poo for a number of days and they made them eat special, they put them on two diets, they put them on a plant-based diet and then they put, you know, so that basically first they were on the normal diet, then they get put on a plant-based diet here and then they put on, back on the normal diet. And then the same people, normal diet for some time, animal-based diet and then and then they're back to the normal diet. And what they looked at is then they sequenced the bacteria from all of the poos and they asked, what is the within sample diversity between different, as a function of time? And what they mean by within sample diversity, so they basically, well, they look at how diverse are the bacteria within a given sample and they see, and so how they quantify this, they say you can see here in small letters, this is Shannon diversity. So they basically calculate this for the bacteria that they sample from these people's species. And then they do, and then you see that it essentially doesn't change as a function of time, neither on the diet, doesn't really depend on the diet. And then they looked at the, I don't know what they said, the baseline later diversity. Then they basically looked at the diversity between the initial time point and later time points. So now they're looking at the diversity between the two samples and here they use the Jensen-Shannon diversity. So remember we defined the Jensen-Shannon diversity as a measure of how different are two probability distributions. So they take the probability distribution from day one and the probability distribution from, I don't know, they hear and they ask how different are the two probability distributions of the bacteria they sampled. And to do that they calculate the Jensen-Shannon diversity because neither of these is any better, right? They're equally likely. So here on the plant-based diet they don't see any difference. On the animal-based diet they see that the animal diet actually introduces, makes the diversity increase. So you get more bacteria, okay? The arrows here show when the diet actually started to interact and influence the behavior of the microbes of the bacteria because they put some dye in the food. So when the bacteria took up whatever came from the food at some point they started to light up in some color that they could detect. So you can see that here. Yeah, so it doesn't actually. So one of these people, this is kind of cruel if you ask me, but one of these people was a lifelong vegetarian and so that person also took part in the animal-based diet. I do think that person stood out a little bit, but other than that, you know, this is like take any lab and you have your sample of people with different backgrounds, different eating habits and it really, the error bars are over people. So there is some differences but they couldn't really correlate it with any behavior. Yeah. So they, I think they ate for five days because it's from zero to four. I think it was, I thought it was a week but from the graph it seems five days. No, because you know, you're going to stabilize somehow, right? But okay, so the animal-based diet isn't just eating meat. It was a very animal-heavy diet. I think it was, you know, they, okay, they have this motivation which is they wanted to model different eating habits around the world. So there are communities where, you know, actually vegetables had to grow and they don't eat a lot to begin with but what they eat, I don't know, I think Mongolia, right? You're going to be eating beef, lamb stew all the time, right? So they wanted to model these extremes. And I mean the, you know, the aspect, okay, frankly there's a lot of problems with this experiment. One of them you point out is that they ate it for five days and, you know, so I wouldn't make a big deal about interpreting this for humanity or diets but I chose this paper because they used both entropy and the instantian in divergence. Whereas a lot of other people like, for example, the cheese paper, they, you know, they do a lot of, oh, this changes like this, will show you what goes up, what goes down but they don't do any quantification. These people actually did try to quantify it in a way that is useful for my pedagogical purposes, okay? And it's a funny story. Okay, so, so let's, you know, then they can correlate the bacteria like, you know, if you want the people that ate on the animal, on the plant-based diet when they tried to look where the bacteria come from, there was a very strong component to a, to a parasite which is found in spinach. So, you know, nothing's safe. Okay, this is entropy. So, so this is entropy of discrete variables which is very useful for basically quantifying diversity but entropy can also be used. I just want to make sure there's nothing else I should be saying about entropy. No, I said this, I said this. Okay, so entropy also can be used for continuous distributions. So, we can generalize entropy for a continuous distribution. Don't forget the base two. So, if we have a probability distribution of some continuous variable C, so we can think, for example, for about a concentration of some proteins such as a transcription factor, something that looks like this, then we can also calculate the entropy of this. And so, just as an exercise and because we will need this result, let's do this for our favorite continuous distribution which is a Gaussian, right? So, I can just plug this in. Log base two, e to the minus sigma c squared minus log base two, two pi, the normalization conditions. Okay, so I wrote out the log and this thing, I can go, well, I get the fact of log normal of two just from having to deal with the base bed differences but otherwise the logs cancel and I get this. Okay, so now I need to, so, essentially, this is a constant, right? So, I have to integrate over a Gaussian distribution these two terms. So, but I know what, I know how to do that because I know everything about the Gaussian distribution. So, the first term I have to do is I have to integrate c squared over a Gaussian, right? So, what is that, right? This is the definition of the variance of the mean square. I mean, to do something easy, if I want to calculate, I need to also calculate the mean. So, this is by definition the mean of a Gaussian. I get this because I would need to get c minus c mean to get this and then the Gaussian distribution is normalized. Okay, so, I mean, I didn't maybe have to write all of this out. Is another way of putting it? Well, actually, I didn't have to write it down because it's just the thing squared. So, I actually know exactly what this is. So, these cancel, the sigma squared cancels and here I'm going to see. And here I have 1 over 2 log 2, which if I translate back is that. So, at the end of the day, this whole calculation, this whole calculation. Yeah, okay. So, I'll, why don't I just write the answer here that the entropy of a Gaussian distribution is I'm now just collecting, sorry, collecting terms. So, I have a negative of log e, negative of that. If I put all of this together, 2 pi e sigma c squared, the minus signs cancel, I get an e into here. So, this is the entropy of a Gaussian distribution. So, if we look at that, we can ask what does it mean? So, as you see, the entropy does not depend on the mean because the, so remember about what's the interpretation of entropy. The entropy counts the number of accessible states. And a Gaussian, whether you put it here or you put the mean here, the number of accessible states or the number of states under this distribution will not depend on where the mean is, right? So, the entropy doesn't depend on where they're located, the states are, the number of states does not depend on where these states are located in space. You're just really counting space. So, that's why the entropy does not depend on c, on the mean of c, okay? But it does depend on sigma, on the variance, okay? So, what does that mean? That it depends on the variance. It means that the number of accessible states depends on the variance. And why is that? So, basically, okay, so what this also says is that the number of accessible states, so the variance has units. So, the number of accessible states depends on the units. So, it depends on the, so the way to interpret it is, so if I measure c now in, so let's say c is, as I said, concentration. So, whether I measure c in, I think I'm probably going to confuse you by saying this, but if I measure it in micromolar, or if I measure it in nanomolar, okay, molars are units of concentration. It's the numbers of moles per volume. The width will change. So, the entropy will change. So, the variance will change. So, what it really means is that the number of accessible states also depends on the binning, because that's in a way what units are, or in another way to put, to think about it is it depends on the precision of my measurement, okay? Because what this distribution in a way is, is some sort of continuous, well, continuous histogram and exactly how many states I put in there depends on how I bin my real space, okay? And experimentally, this is related to the precision of my measurement, but this shouldn't, so in a way we have a problem with the continuous definition of, sorry, with the definition of entropy for a continuous variable. But then think about it, think about it in physics, okay? We can also define entropy and do we ever measure entropy? Does it such a thing as an entropy meter exist? What do we measure? What do we measure when we want to measure entropy, when we want to measure something related to entropy? Maybe you said it, but I'm hearing a general murmur. Temperature. Temperature, you're closed, it's not energy, it's not temperature, it's... What? Microstate. Microstate? Not also not microstates? Okay, what does a calorimeter measure? Heat capacity, exactly, specific heat, heat capacity, that's what we measure, that's what we have machines for. We can only measure specific heat or heat capacity, and what is specific heat or heat capacity in terms of entropy? Yeah, it's a different, right? It's a derivative, ds dt, right? So you change the temperature in your calorimeter, and you measure the specific heat, you know, you measure the specific heat for a given temperature, and then you can, from that, you can go back to entropy. So we shouldn't be surprised that measuring entropy is not possible because in physics we cannot measure entropy either, right? We measure the differences in entropy, and so we can also measure differences in entropy here. Yes, I know you never thought you would be reminded about the existence of a calorimeter ever again in your life, right? Okay, so entropy measures uncertainty, okay? So if I measure the difference of two entropies, I'm measuring a reduction in uncertainty, exactly, information. Information is a reduction in entropy, sorry, it is a reduction in uncertainty, right? I gain information and I reduce my uncertainty. So how do we write this? Where do I have the, okay, so right. So I need something to be reduced by, so now let me consider the thing from the first lecture, don't freak out, sorry, other way around. Don't freak out, but I have some concentration, some transcription factor against C that will now regulate my gene G, and I want to know how much my uncertainty about G, about how much protein I have is reduced if I know how much transcription factor I have. So I can write it down that the entropy, that my uncertainty in the distribution of G, which is my output of my regulatory circuit, how much is it reduced if I've made a measurement of G at a given input concentration. And this is the difference, this is the reduction of my entropy for a given value of C, okay? Where all of them are defined in that way, so I don't think I have to define it, but what P of G given C is, is the conditional entropy of my output given the input in such a way that if I integrated over the input, well, that if I multiply it over the input, I get the joint which describes the full system. So this is the conditional entropy or conditional uncertainty of the output if we know the input. And this is the total uncertainty entropy of the output, okay? So I know nothing about the, I come in, I know nothing about the output. So the entropy, so my uncertainty is very large, but I make a measurement for a given input concentration and that already reduces and I figure out what the entropy is there and that reduces my uncertainty. And then we're not done yet for the entropy, but we can measure this difference now at different input concentrations and reduce our uncertainty even further. And we can, we can average it over different versions of the input concentration that we have, okay? And so this difference averaged over the input concentration is called mutual information. It's also measured in bits because it's a difference in entropy. And this is just notation. So here you've integrated over both C and G. This is notation, okay? It does not mean it's a function, but mutual information only makes sense if you say it's between this and that. So you, typically you say between what two variables it is, okay? Just to remind yourself, yes? Yes. So it doesn't depend on C. It doesn't depend on G because there's a, this is an integral over G. Remember that? We'll write this down in a, okay? So you're integrated over both G and C. And here there's also an integral over G. So it doesn't, this is not a dependence, okay? It's just notation to remind you that you're looking at the information between these two variables. It's like, remember when we wrote down DKL, we wrote down DKL of P and Q. It also doesn't mean it's a function of P and Q. It just means it's the, the cold black libel divergence between the P distribution and the Q distribution. So it's just notation. It's the same thing. So don't think it's a function. This information is a number in bits, right? It's like two or eight or point five. It's just a number. Yeah, it's a scalar number, okay? So what it tells you is how much your uncertainty about the output, well, one variable, the output is reduced by knowing the other one, here the input. But it's mutual. It's a symmetric measure. So maybe before I go into the properties, it's, it's useful to show you why it's a symmetric measure. So okay, so I've written down this explicitly. I can, okay, I forgot the minus sign again. So this is DPG log base to PGC. So if we plug that in the definitions explicitly, then we get DG, DC, PFC, PFG given C for the first term. No, actually this, this is for the second term. Let me do it in order, because otherwise I'll probably confuse you. So here I can write down DPG, PFG log PFG, PFG, PFC, now for the second term, I have PFC, PG, given C log. And then I have PFG given C, but I can write out PFG given C as PFG comma C divided by PFC, right? Everybody agrees that using conditional probabilities, this is just equal to P of G given C. So here I've written out the conditional entropy, and here I've written out the G entropy where I can put in the integral, because if I do this integral, that just gives me P of G. So you see it's the same thing. But this gives me at the end of the day, PFC, PFG, P of G comma C log, I lost signs, right? Plus P of G comma C, P of C, P of G, okay? So this is a formulation, so this is just, you know, moving things around, essentially, and using conditional distributions. But what you see from here is that you see clearly that it's a symmetric measure, that the mutual information, so in other words, the mutual information between C and G is equal to the mutual information between G and C, okay? And you can keep on calculating this further to write it out as DCDG P of G comma C, okay? Of course I've lost my tools, but there's tools everywhere in the logs. P of G comma C, so that's the first term. Then I have DCDG, well it's marketed by DCDG because this is the order I need them, P of G comma C log base 2 P of C, log base 2 P of G. And so this you can rewrite as S of PGC plus P of C plus P of G, okay? So from here you see that it's the difference between the sum of the distributions, entropies of the independent distribution, the marginals, and the joint. And you do this yourselves, but you can also rewrite it symmetrically to what we started from, which shouldn't surprise you since now it's clear that it is symmetric. But what you see from these two formulations, where are they? This one and this one, is that information is maximized when the input or output entropies are maximized. Now we'll talk about that in a bit more in a second, okay? So I'll get back to going through the properties of information after the break. So yeah, let's take a 10 or 15 minute break. Yeah, yeah, the note. So Nargis wants you to find your best friend for the next three weeks of your life, if you haven't yet, okay? Because if you don't find a best friend, like right after this lecture, if you haven't already, then she's going to make a network graph and you're going to be a lone note not connected to anybody. And we all know what happens to those, right? I need various elements of this in a second. So let me erase, okay, okay, okay. So properties of information. So the first one that half of you have already noted during the break is the following, right? So number one is that if two variables are independent, that means that P of G is P of G times P of C, then information is zero. And that is very easy to see from this expression, because if now this factorizes, then this cancels, right? So I'll go back to the, okay. The second one is very important, especially for biological applications, which is that information is reparameterization invariant. Reparameterization, okay? So what that means is that if C goes to some other function of C and G goes to some function of G, as long as these are one-to-one functions, then the information between C and G is the same thing as the information between H of C and F of G, okay? And so why is this important biologically? Because we're usually interested in, say, the concentration of proteins, but we don't measure the concentration of proteins. We measure the intensity of light, right? So you can think about this less intensity of fluorescence coming from your fluorescent protein, and then you don't need to worry about what this relationship is. It can be nonlinear, it just has to be a one-to-one relationship. And this is not true for correlations, okay? So maybe this is also the moment to make this point, is that information is a measure of correlation, right? It tells me how my, you know, if I measure G, how much do I know about C? So how correlated they are? If they're not correlated, I have zero information. So let's do, we have other measures of correlation, like covariance. So we could calculate the covariance between C and G, okay? It's the definition of covariance. And so covariance tells us about the strength of a linear relationship. So if I have some random variables, well, let's call them C and G, and now I'm going to make, I'm going to draw, I'm going to plot random numbers. And I measure a relationship like this, then I have a covariance here is around one, and the information is also large, it's larger than one. If I measure random variables, completely random, okay? As I said, it's impossible to draw random variables. So in this case, information is what? Zero, and the covariance is zero, okay? So for the example now, which is like the height of my drawing abilities, okay? What's the covariance here? Yeah, what's the information? Not zero. So this is an example of something that is clearly correlated, but it's not captured by covariance, but it's captured by information. So it's useful for these nonlinear types of relationships. Yeah, yeah, yeah, yeah, yeah, because covariance is good for linear, yeah. Why it can be bigger than zero, because I'm, I actually have something I calculated numerically, so I have a precise number, and I didn't want to give you, I'm drawing a cartoon of something for which I have a concrete plot. Okay, point three is the data processing inequality. So this is important. So if C regulates G and G regulates K, the data processing inequality tells us that the information between C and K has to be equal or less than information between C and G, which is basically a statement that information cannot be created. So information gets lost or stays the same but cannot be created. If K does not directly depend on C, right? And then this interpretation that 2 to di distinguishable states of G can be reached by changing C. Okay, so this is related to a, this is related to the interpretation we had for entropy, right? That it's a measure of state counting, and now since you're looking at the differences in the number of states, so you're saying, if I change C, how many different states in G, if I change the input, how many different states in the output can I, can I count? So we did this, we did that. Is there anything I need to tell you other than that? Yeah, created. Of G can be reached by changing C. So let me give you an example of that as I kill a fruit fly on my laptop. So I'm going to talk about fruit flies now. So yeah, so the fruit fly. So we mentioned this before that this is a fruit fly. These are these precise patterns that we get that determine that the fruit fly will have its body in all the right places, and it's determined through these, through these networks. So we can think about this as having an input, then this noisy gene regulatory network, and the output, and if the fly is going to have these stripes in the right place, that means it has to transmit information from this bicoid gradient that I showed you yesterday, there it is, to the downstream genes, such as hunchback, right? And then there's this noisy gene regulatory network, so you can think about it as a channel, an information channel, specifically like this. You have bicoid, you have hunchback, you have the input gradient, you have some regulation, this is the first part of this information, and you can ask, how do, how much information do I need to transmit in order for all of this to work? So there's a simple bag of the envelope calculation you can do first. If I find it, here is a version of it. Okay, what do I do this? I want to keep this, and I need to keep that. Okay, let me do it here, it's easy. So if you look at the fly embryo like this, there's these things, they're called nuclei, okay? And when you get these, so they're there everywhere, and when you get the stripe, I mean basically what you're seeing is that the expression in each of these nuclei decides to be high or low, that's what's being stained there. And there are about a hundred nuclei, so these cell-like objects along in this direction, okay? And each of them, at the end of the day, will have a distinct fate. So that means that there's a hundred different distinct fates, okay? At the end of the day, because at the end it gets partitioned so finely that really you have a one-to-one thing. So how much information do you need to, for each of these nuclei, for one row of nuclei to know what to do? That you need to be able to identify that this one here will do something else than the one next door. So you can do this calculation that if you have a hundred distinct states, one over a hundred, that is roughly seven bits of information that you need to transmit. So that means that every time you need to ask seven different questions for the nucleus to say yes or no, okay? And we'll actually see that, well, okay, we'll see, we're not going to see this, we're not going to see this maybe. But we'll see that in the first step from Bitcoin to Hunchback, you actually only transmit, you transmit 1.5 bits of information and maybe we'll get to see that in a bit more detail. But this is just one kind of application of this. Let me now erase this and do another calculation which is of a Gaussian channel, okay? So what a Gaussian channel is, so we're going to calculate, what are we going to calculate? We're going to calculate the information transmitted in a Gaussian channel. So we're going to assume that our output is given by the input plus some noise. So there's a linear relationship here, which can mean it's either really linear or you take something that's not linear and you linearize it like we did yesterday, right? Remember that? That we linearized, or was it two days ago? And we're going to assume that this noise which is what is P of G given C is Gaussian. And we're going to further assume that the input is also Gaussian, okay? So if we know that the input and this conditional is Gaussian, then that also means that the output is Gaussian because you're multiplying two Gaussians, right? And we know what happens when you multiply two Gaussians. If you don't, you can do this at home, well, or in your hotel room. Note here that since things are linear, the mean of G is equal to the mean of C, okay? Sorry, there's a, okay. So now we want to calculate the information between C and G for this Gaussian channel. So let's write it out, P of C, DGP of G given C log of P of G given C minus log of P of G, okay? So I'm writing this as G given C, which means I can take away the log of PC, right? I don't need to worry about it. Okay, so just the definition and then we are going to write it in terms of entropies because I know what everything in my problem is Gaussian. So if I write, I know the entropy of a Gaussian distribution. So if I write everything in terms of entropies, I'm done and I don't have to redo any integrals, okay? So this one is simple. I have just P of G log P of G, that's the entropy of that. Here I have log of PG times, I can collect this to be P of G comma C and then if I integrate this over C, this is just P of G. So this now gives me the entropy of PG plus. So I'm ready to plug in my entropies. So the first one is minus log 2 pi E sigma, it's of G comma C, so that's just sigma. And the integral of P of C, if sigma is a constant, then I can just do the integral so it doesn't matter. And the second one is the entropy of the output, so that's given by the sum of the two variances. So I can now take out the log 2 pi E's because they don't matter from both of them and you see that they will cancel and I'm left with one half from the square roots of sigma plus sigma C over sigma from the other one. So that means that the information of a Gaussian channel is given by one sigma C and this is all so often written as log base 2, one plus signal to noise ratio. I'll write this out too for you. This stands for signal to noise ratio, okay? And it's the ratio of the signal variance to the noise, to the channel variance. So this is the capacity of a Gaussian channel and it generalizes if you have covariance matrices, if you have a higher dimensional signal, this generalizes and just gets replaced by matrices. So then we can ask another question which is for Gaussian additive noise, so for this case, so if this is Gaussian as we just assumed and everything is additive, so that means that this thing is a constant and given by that, okay? What is the input distribution that maximizes the information that maximizes the information if we have a fixed, so we're going to assume a fixed input variance, okay? Fixed input noise and Gaussian P of G givens and we're going to ask optimal input distribution, okay? So the reason I left this out is because if this is Gaussian then this is fixed. I'm going to be optimizing over that, so the way to maximize information is to maximize entropy. You agree? So it's info optimal input distribution which means we want to maximize information, but in fact what we're going to do is maximize entropy, well, again we're going to maximize the entropy of the input distribution. You'll see that it doesn't really matter whether you maximize the information of the input distribution or the output distribution in this case because the two are just related through the Gaussian channel, okay? So I need to maximize the information. I need to write down the functional that I want to optimize, so I want to maximize the entropy of the input distribution, but so that's just saying I want to maximize entropy, but I actually have to do a little bit more work because I want to, I have some constraints on my problem. The first constraint is that my distribution needs to be normalized, okay? Well, this way, right, that I want my distribution to be normalized, so that if I calculate the normalization it has to be equal to 1, but then I also told you that the variance needs to be fixed, well in fact it's the variance, if the variance is fixed so is the mean, so I'm going to fix both of these. So they're fixed to some numbers, it's the same thing as fixing, well okay, the variance being fixed is actually a difference of the two which means that I have to both fix both C and C squared, but they're fixed to some numbers, these numbers don't really matter, it would just give me some number here. I'm going to be taking derivatives anyway, so that's why I don't have to write them, right? So all of these three things, these are my constraints because I want to solve this problem with a fixed variance. So now we're going to optimize it, okay? So what we're going to do is we're going to do a functional derivative. So I do a functional derivative of this functional, from the first term I hit PC, the first PC so that leaves me with the log of PC, then from the second term from the log I get 1 over PC, but I'm left with PC with the delta of the functional that gets rid of everything and then so I have minus 1. From the second term I then have lambda naught, from the third term lambda naught C, from the third term lambda naught C squared. So basically all of these are some sort of constants which I can write as, well, without the one maybe, what I have here, well no, maybe, okay, it doesn't really matter, but I can write this, doesn't really matter. You'll get the idea, lambda 2 C minus lambda 1 is lambda 2 squared with some additional factors, but all of these are constants so essentially what I have is I can put all of this and whatever comes in into a normalization and what I'm left with which to write it in a different way is a Gaussian distribution where you can equate what these things are and so we can select the Lagrange multipliers in such a way that they fit for you, that they fit what needs to be fit. So this is why we didn't also have to worry about whether it's the input or the output because since it's a Gaussian, if the input is a Gaussian, the output is a Gaussian too so everything's fine and so what it tells us that a channel with Gaussian additive noise and fixed variance is maximized when the input and therefore the output are also Gaussian for Gaussian additive noise. This isn't true in general and we had more time and maybe Thierry will show you more examples of things for which it's not true, but the reason the Gaussian channel is useful is that it gives you, so since it's maximized for it, it gives you an upper bound for how much such info can be transmitted under these assumptions. So it's important because if you don't have these assumptions, things go wrong, things are very different and so there's examples, Thierry's going to give you one example of when things are different. The other thing I should say is that the word for a system that has maximum information, sorry the maximum information is called capacity, it's just the word. Okay, so now I'm going to finish with linking this all back to the fly and we can ask the same question but the system is slightly different. So we're going to assume that p of g given c is still Gaussian but we're going to say that now the noise depends on the concentration of the input, okay, which is not what we assumed here because here we assumed it's constant. So this is a difference and we're not going to make any assumptions on the input. So p of c can be whatever. p of c is free to be whatever and there's no assumptions on p of g. So we're just going to have one assumption about the Gaussian but it's going to be more complicated because now it depends on the input concentration. So this calculation does not hold but you can still do the calculation under these assumptions. So you're no longer constraining the variance, you're just constraining the mean, okay. So what this function becomes, this was my cross-out chalk, I just threw it somewhere. So you get rid of this, actually okay, I shouldn't be doing this because it becomes more complicated. So then you can write down the functional that you want to optimize and the functional is more complicated. You actually want to optimize information. So I'm also going to cross this out. I'm going to motivate it, I'm not going to do it, don't worry. I'm just going to show you the answer, okay. So you optimize this given the fact that you want to, given the constraint of normalization, okay, and then you optimize the same way you would optimize this. And if you do that, if you do this calculation, you're going to find that the optimal wealth, it's the mean P of G which is, it goes as 1 over sigma, yeah, this because you want your probability distribution to be normalized. Yeah, you, I can write this, but it doesn't matter because when I'll take the functional derivative, this is a constant. That's why I'm writing. Okay, I will, when you do that, I'll show you the result on the slide. Basically, as I said, you make this assumption of a Gaussian, so the first thing you may want to do is actually verify that it is a Gaussian, so this is an experimental verification for the fly system that you do have, you do have Gaussian noise. And when you do this calculation, at the end of the day, you get that the mean of the output which is the same thing as the input, because it's a one-dimensional problem, so the weights are the same, is inversely proportional to the variance, okay? And you can calculate the maximum information, so the capacity of how much information Bicoid transmits to Hunchback, and you get 1.7 bits. So, what is interesting about 1.7 bits? So, you know, first of all, in order to get it, you need to put in these things from experiments, so these are just the measured noise and the measured mean. And when you put them in, so what does 1.7 bits mean? The most important thing about 1.7 bits is that it's larger than one, okay? Because one, we know what that means. One means two states, and remember how Hunchback looked, right? There's a, there's another picture here. It's, it's stained differently, so maybe you won't like it. But basically, the way Hunchback looks is it looks like this, right? On, off. So, if you look at this, you think it's two states. So, you would expect maybe one bit of information to be enough to be transmitted. When you do the calculation, you find that in fact it's more than one bit. So, if you look at the probability of the output, so of Hunchback as a function of the out, so you make a histogram. You, you take this embryo and you ask how many cells are expressing Hunchback at 0.2 of the maximum level, one half and so on. You make a histogram and you do see the peaks at low and high, so you do see the red and the black, well, red and the black. But you also see that this plateau here is not at zero, okay? And that this plateau at zero actually manages to transmit 0.7 bits of information. From the calculation, you can do it directly from the experiment, you get 1.5. But generally, you get, you get more, right? So, this is the comparison between, yeah, the on and the off. So, this is the theoretical calculation with the experimental mean and variance put in. And this is directly from the probability distribution of this system. You get very similar numbers. So, it does seem that the system is functioning close to capacity. And, yeah, you get more than these two distinguishable states. So, that means that although you just, it seems in Hunchback you're making an on and off decision, you're actually making a more complicated, you know, you're transmitting more information because maybe you need it for something later. So, with that I am out of time. So, we'll stop here.