 Which biologist made the fruit fly a popular model organism in biology? Yeah, so most biologists who study animals study the fruit fly. It's the most well-studied animal in the world. Does anybody know who made the fruit fly a model organism? Any guess? Anybody? There's a person named Thomas Hunt Morgan. And in fact, he has a unit named after him. Does anybody know the unit? There's a fundamental unit in biology named after Thomas Hunt Morgan. Does anybody know what the unit is? It's a very interesting unit. It's a measure of the distance between two genes on a chromosome. So the further apart two genes are, the more the chance that they'll cross over during mating, and both the alleles will be in the offspring. So you can measure how far apart genes are on a chromosome by taking one genetic variant at some position, another genetic variant at some other position in two different flies, mating them, and see what the chance that the offspring gets both. The further apart they are, the higher the chance that this is going on. And so the distance on the chromosome measured in these units is called a centimorgan. It's a 1% rate of actually having both alleles in the offspring. Okay, everybody knows what a fruit fly is. So this, I'm going to start off today's lecture by giving you a really, you know, a little preview of the reading that I'd assigned to you. Really what I find to be a really beautiful paper on how a fruit fly is made from an egg. So before that, so this is one of the papers that is assigned for reading. It's an old paper. It's from 2013. It's called Positional Information in Bits. It's by Bill Bialek and colleagues. And the only thing I want you to look at for now in this lecture. Do read the paper. It's very nice. The only thing I want you to look at for now is this little paragraph that I've highlighted. It's there. So it says, you know, we're going to be talking about embryonic development. We're not really going to be talking about it. This paper talks about it. And it says, you know, how cell by cell does the whole embryo get made and how much does it read the patterns of external information in order to make the whole system. And just this sentence is like a manifesto for trying to understand chemical randomness, right? So it says, answering how an embryo is made from an egg is important in part, answering these information questions in part, because we know that crucial molecules involved in the regulation of gene expression, transcription factors, which I've introduced to you, are present at low concentrations and even low absolute copy numbers. So that expression is noisy. And so the first half of my course has been to give you the tools and the background to understand why gene expression and the regulation of the amount of proteins in a cell is noisy. That's pretty much the goal of your homework over the weekend. Therefore, this noise, noise means fluctuations. It means randomness. This noise must limit the transmission of information. So this sentence sort of encapsulates the philosophy of what I've been trying to convey to you, because cells are small, because they have small numbers of molecules, there is chemical noise, stochastic chemical kinetics, and chemical noise causes randomness, fluctuations, variability in the actual numbers of certain protein molecules over time and in different individuals, right? And yet biology needs to somehow compensate and survive in spite of this noise. Okay, that's what they call noise. Just a caveat. Although my course is called randomness in biology, I've focused primarily on the chemical noise, stochastic chemistry happening inside the cell at low copy number. There are, of course, many other types of randomness that occur in biology. One prominent example is the type of randomness that occurs in how a particular mutant allele, once it is created, will spread through a population. Okay? And there's a whole series of equations that deal with how that happens. The same kind of formalism that I've shown here, right? There'll be master equations and so on. That entire field, I don't have time to give you an introduction, but if you want to go read about it, that entire field about how randomness arises during evolution is called population genetics. Yeah? And you can go read about population genetics. There's many, many good textbooks and very, very accessible textbooks. The source of noise in population genetics is conceptually the same as the source of noise in chemical, stochastic chemical kinetics. It arises because of discreteness. Okay? It arises because genetic alleles are discreet and because populations have discreet numbers of individuals who live or die. So the same kinds of equations we wrote down about chemicals being made or destroyed. You write down for individuals making offspring or getting killed and then you develop the whole theory of population genetics. Okay? So I do urge you to go and read about it. It's another very beautiful example of randomness in biology for which I've given you the tools to understand, but I won't talk about it any further. Okay? So having seen this little manifesto, so this paper was from 2013 and I find it to be a very nice paper because it says how we can start to think about information content in a noisy biological system. However, at the time this paper was written, nobody showed, you know, this paper is a paper about principle. It says in principle there's a certain number of bits of information available in a biological system. This paper didn't show in practice how that information was read out. Okay? It tried to, but it was just a starting point. So it took many, many years until this paper was released and so this one finally has been published just a few weeks ago. It's called optimal decoding of cellular identities in a genetic network. A genetic network is the kind of network that you're analyzing in your homework. It says DDT of some protein concentration is a function of existing protein concentrations and different proteins can influence each other. That's a genetic network. So the authors strongly overlapping with the authors of the previous paper. So let me walk you through this. So I'll start with this nice, it's nice these days papers have these so-called graphical abstracts, right, which is nice for teaching. So a Drosophila fruit fly starts off life as an egg. An egg is a single cell. This single cell is then fertilized by a sperm, right? And that fertilized egg now has a full complement of a genome. The egg then divides into smaller and smaller and smaller cells and then the cells start growing themselves and that creates the whole larva, right? And then fruit flies have this other strange thing that many insects have, which is that, I mean, it astonished me when I first learned about this. You've all seen a fruit fly larva. You've all seen insect larva. Anybody not seen an insect larva looks a bit like a worm, right? It looks a bit like a squiggly worm. It's not a worm. It's the young form of an insect that it has no legs and it moves around by sort of compression and expansion of its body. Anybody not seen a larva? Everybody seen a larva? Anybody bitten into a fruit and seen a larva inside? Anybody not done that? You've not done that? Okay, what's worse than biting into a fruit and seeing one larva inside? No, no. It's biting into a fruit and finding half a larva. What's worse than biting into a fruit and finding half a larva? Finding one third of a larva. What's worse than one third? One quarter of a larva. So the worst case is biting into a fruit and seeing no larva at all. So that's just a little lesson in taking limits appropriately, okay? So that larva, how it's made from an egg is the subject of this discussion. So when you take the larva, you can actually put it on the microscope slide as a living form. You can actually watch these larvae develop over time. And in fact, I wish I had pulled up this movie. Maybe in the next lesson I'll show you this beautiful movie about how the fruit fly develops. Maybe I can even find it now. But what you can do then is, of course, all these sort of preps, they're called preps because you prepare them. All these preps are always in black and white because there's no color to any of this, right? So what you do, what biologists are very good at doing is they're very good at labeling distinctly the thousands of different molecules that are always present inside a cell. So you can pick one of those molecules, like the protein or the RNA corresponding to one gene. And you can label it in one color. And you can label many of these colors. And thereby you can get over perhaps many, many images a map of how these different proteins or RNA are spread over time. Now it turns out that in the fruit fly egg, the first input, so the fruit fly is sort of, you can think of it as sort of the first approximation cylindrical or a spheroid, longer in this direction and shorter radius in this direction. So it's a little lost inch like that, right? And the first symmetry breaking event happens when the mother actually deposits certain molecules which are the, they're called morphogens because they determine the form of the embryo over time. So the first symmetry breaking event happens when the mother decides which end is anterior, which end is posterior and puts her mRNA into these eggs, okay? So these are called maternally inherited RNA. And these molecules then create, they're translated and they create these transcription factors which are proteins and these proteins as I had shown in class, they bind and unbind to bits of DNA, okay? Now imagine this. Every cell in this growing embryo, anterior to posterior, initially has the same genome, has the same genome, but eventually some cells will become the head, some will become the thorax, some will become the abdomen, some will become the posterior. So how does a cell know? How does a cell know whether it's supposed to become the head or the thorax? And later how does it know whether it should become the legs or the antennae? The way it knows is that the cell is sitting in a medium surrounded by all kinds of other molecules. And by sniffing out, by literally like getting chemical affinities and chemical kinetics with these molecules, the cell can figure out that it's closer to the head or closer to the tail, yeah? So it's actually quite easy to decide whether you're closer to the head or the tail because there are two so-called so-called gradients of morphogens, and one happens to be higher closer to the head, one happens to be higher closer to the tail. So you just say whichever one is higher, if the tail type is higher than the head type, then I'm closer to the tail. That's very simple. The question being asked in this paper is how precisely can a cell decide where it is along the embryo? Head versus tail, that's pretty easy. That's just one bit. It's a yes-no question. But can you decide I'm 60% close to the head and 40% close to the tail? Can I decide exactly where I am in that stretch? And that's the question that's being answered. So here's the cartoon. This is meant to represent like a slice across the embryo. This would be, for example, anterior. This would be posterior, right? And these curves are as a function of position, the level of expression. How many of those molecules there are per unit volume? Of so-called gap genes. They're just called gap genes. And in this case, there are four such genes drawn in four colors. There's this sort of red-pink one, which is high and then low and then high again here. There's this orange one, and then there's this blue and this green one. These represent the concentrations of various RNA, which will then become various proteins. And this is just meant to represent how many proteins will come out of that, right? These proteins is discrete, so they've shown it as little dots. And because it's discrete, as you now know, from the first half of this class, there are fluctuations. And therefore, whenever you take many, many embryos, there'll be some variability in how much of this protein there is. So what you can do is plot the variability as a standard deviation around the mean, and that's what those light sort of error bars are around the dark curves. Okay, there any questions about that? Now, by reading this information, there's some other protein, there's some other protein which will be expressed. That other protein is only meant to be expressed in this cell, this cell, this cell, this cell. So you can see this alternating pattern. So how come that cell doesn't express the protein, but this cell does? That cell does, that cell does, these cells don't. The reason, okay, I mean, mechanistically, is because each of those cells is somehow reading the information about these molecules. And this cell, for example, says, well, you know, I have a lot of this molecule and a lot of that one, but none of these, so I'm going to make a stripe. And that cell says, well, yeah, but I have a slightly different combination of molecules, so I'm not going to make a stripe. So you need to imagine yourself, it's a little information game, you need to imagine yourself sitting in each one of these cells and trying to see how many molecules there are and deciding, am I really this cell or am I its navel? Right? And that seems to be a very, very fine decision to make, right? So if there are roughly a hundred cells at this stage in the growth of the embryo and every cell is able to tell, you know, almost exactly which one of those hundred cells it is, then there's sort of one percent precision in how accurately a cell can guess where it is. Yeah, that's the setup to the problem. Now the question is, this seems pretty astonishing, can a cell really tell where it is just by reading these signals or is there some other hidden information which we haven't seen? Because after all, we can't look at everything that the cell knows, maybe there's mechanical information, maybe there are other chemicals we don't know about. Right? Is there something else that's telling the cell where it is? So this paper, the whole point of this paper is to show that in principle, okay, this collection of labels, this collection of input signals is sufficient to actually achieve this level of precision. It need not have been that way, question. So it's good. So a few funny things about this fruit fly. The fruit fly is actually a syncytium. So it's a hollow system and there are cells on the boundary. And this gene expression is the concentration of mRNA across, you can think of it just in the volume. But the RNAs are actually then expressed inside these individual cells, eventually it gets packaged. So yes, you should think of it as the amount of that particular molecule inside each cell. Eventually, that's what it is. It's told by other genes. So some genes are given by the mother and the mother's genes will tell the cell, I think this is there in the other. So I think this is also there in the other paper. So let me see if it... So maybe not here. But basically some genes are given by the mother and then based on reading those genes, the second wave of genes are expressed. So the genes given by the mother are spread throughout the embryo and due to the genes given by the mother, the second layer of genes are then expressed in each cell. That's how you should think about it. So it's a cascade of new and new gene expression. Fine. A little point about genes. So people... If you haven't studied biology very much and you haven't seen many different examples, it's a bit confusing. We are told that genes are how we pass on our traits from parents to offspring. That is true because we give the offspring our entire chromosome. And because I've received entire chromosomes of humans, I'm human, if you receive entire chromosomes of elephants, you'll be an elephant and so on. But genes are also... They have a second life. The chromosomes that you inherit from your parents, they contain copies of all 23,000 genes that you'll ever need through your whole life. But individual cells in your body don't use all those genes. So you should think of genes as sort of the way I like to think of it is if you have an orchestra, the orchestra contains many, many people who play different instruments. And all these different instruments are like the different genes that you have. You have some woodwinds and you have some strings and you have some percussion and so on. Now if I'm playing a particular song, if I'm playing a particular composition, I don't have to use all the instruments. So each cell in your body is playing a different composition. So each cell in your body will play a different subset of the total set of instruments it has received. So genes have two different roles. One role is to specify the organism throughout its whole life. And we're not talking about that here. We're talking about how all these cells, which all have the same genes, some of them decide to use some of those genes and others decide not to use them. That's called gene expression. It's a very subtle point. Are there any questions about this point? So genes have two lives. You have it and then you can use it. And those are two different things. And whether you use it or not depends on what information comes from the outside. Based on the existing information, I can choose to use or not use this gene. That's what's going on here. Question? So the symmetry is broken by the mother and then everything carries on from there. And sometimes symmetry is broken spontaneously. There has to be a symmetry breaking event because initially there's only one cell. But one symmetry is broken even once. So typically there's an anterior-posterior symmetry breaking. And then there'll be at least one other symmetry breaking because we have three axes. And then you make the whole organism. There are some people, by the way, left and right is totally inverted. So your heart is on this side and so on. And they won't even notice. They're perfectly fine. So now let me explain a little further. So here's the setup of the problem. Now I'm going to be spending a little time on this, but later in this class I'm going back to lecturing on the board. But for the moment I'm just going to be showing more and more stuff from this paper. So here's the fly. And as I mentioned, the fly is sort of a hollow little spheroid. So the cells are all on the outside. They're all on the outside. And what you do is you can label different genes with different colors. And then you can sort of pretend that the cell is just an anterior-posterior one-dimensional system. So for the purposes of this paper the whole cell is a one-dimensional system. It just runs from anterior to posterior. So I think they ignore a few of the cells in the highly curved regions at the end. So there are some genes that will tell the cell it's closer to the anterior or to the posterior or sort of to the stale region. And from that, from the genes that it got from the mother in this paper they track four different genes. Which are... So it's these four genes, Knurps, Kruppel, Hunchback and so on. They have funny names. In fruit fly genetics a gene is named by what happens when you knock out the gene. So it's very confusing. So if a gene is called Hunchback then when the gene is knocked out the larvae will look like a Hunchback or the adults will look like a Hunchback. So actually there's a sort of negative sign in the naming of these genes. So these four gene expression levels are derived because already the cells have read out what the mother has told it and have expressed certain things. So at the time this paper starts it's like Star Wars Episode 4. A lot of stuff has already happened and you're just starting at the middle of the movie. The 1976 Star Wars was like Episode 4. I think it's 4. So this is like Star Wars Episode 4. So this stuff has already happened. Luke Skywalker has been born and stuff like that. So now you're at the start of the action. So you have these four gene expression levels. G1, G2, G3, G4. It's as a function of X. And for the purposes of this paper again X is normalized to go between 0 and 1. You just normalize the lengths of all the larvae to standard lengths. And then you find at every value of X you find the amounts of these. G is a function of X. And this G is a function of X can be noisy. You don't always have exactly the same of genes 1, 2, 3 and 4. So there's a certain amount of variation over there. The game we're going to play is if I know at any given position if I know what my amounts of these four genes are can I guess where in the cell I am? Can I guess where in the embryo I am? So can I make a map inferring what is X star given these four values that I read? If I can make such a map, in fact I have to make such a map if a cell is to express the next set of genes and go forward in order to make the whole embryo. Are there any questions about this? So here are the four genes, Knurps, Hunchback, Kruppel and Giant. If I read those if only Knurps and Giant are expressed then I'm going to be here. For example, if only Hunchback and Kruppel are expressed then I'm going to be here. This is a kind of inference problem the cell is trying to do. Any questions? Is the game clear? Yes? Yes. And if you think you're in that part of the embryo then you commit yourself to being a thoracic cell which later will express further thoracic genes and actually become the thorax. Yes? Very good. That's exactly the question. Suppose I have only one of these variables then I can't because that's exactly the question. So hold your question until I get to the punchline of this. Yeah? Okay. So, now ignoring all that, right? Let's just walk through a certain... So, for example, here's a particular gene expression for a gene called Kruppel. It's one of these so-called gap genes. Kruppel is expressed in this part of the embryo. Yeah? And if I just plot its expression as a function of position with variation, this experiment is done over many, many, many embryos and I give one simple result, by the way. This kind of experiment is incredibly difficult to do well. You have to have a lab where you can grow these embryos precisely. You can stop them at exactly a certain position of their growth phase because they're growing very rapidly and you can take exactly that... So there's some synchronization involved and you take exactly that position and then you stain it with exactly the right kind of molecular stain to get this fluorescent image and you do this over many embryos reproducibly without any noise. So you get this curve, right? This curve is basically, as a function of X, the concentration plus variability across many, many embryos. Okay? This particular experiment is from 38 embryos averaged. Yeah? Now, here's a question. If I were to take a cut across that... So here they flipped it to the... This is the same figure, but flipped to the right. I'm sorry, I'm going to come this way. So the same figure, you now flip it to the right, right? So here you have a figure and now you ask... So this figure is position versus gene expression. This figure is gene expression versus position. They're exactly the same figure. But now I ask, if I'm at this gene expression level, which position am I at? And this is not a one-to-one map, right? I've inverted the map and unfortunately inverse is not a one-to-one inverse. If I have, let's say, 0.5 gene expression, I could be at either of these positions, yeah? So at various levels of gene expression, right? It's only really at this level of gene expression that I know precisely where I am, which is roughly in the midpoint of the set. So by just plotting out my guess of where I am as a function of where I actually am, right? Here's the actual position. Here's where I can guess where I am just by reading this gene expression level. And you see this big mess here, what that says is if I'm anywhere here, I can't really guess where I am because the gene expression is not giving me any information. Yeah? But if I'm in one of these regions, I can guess where I am, but with a degeneracy of 2. I can't really guess if I'm in the upward part of a curve or the downward part of a curve, right? And hence you see this X shape. It's only up here where I can really guess very precisely that I'm roughly halfway through the embryo. So this is a map that tells the cell where I think I am as a function of where I am. How did they get this map? I'll explain to you in just a second. But conceptually, the map is very easy, right? So in practice, how is this map determined? It's determined by doing a Bayesian inference. Everybody knows Bayes' rule, so you do a Bayesian inference. This is Bayes' rule. This is Bayes' rule, okay? It says, if I know what gene expression levels I'm supposed to have as a function of position, and I guess this by... I don't guess this. I actually measure it over many, many, many embryos. That's what that original curve was. That peaked curve is gene expression as a function of position. The curve I showed you was for a single gene. In principle, you could get all four genes as a function of position very, very accurately with their variation and co-variation, right? So these are random variables, and I get the full distribution. If I know this, and I have a prior distribution, which is where I think I am before I read any gene expression, what is my prior distribution for a dysophila embryo? The prior is totally uniform. I could be at the head, I could be at the tail, and there's roughly equal number of cells all the way from head to tail, right? So my prior distribution, p of x star, is very nearly uniform in this case, yeah? Then I just simply invert the probabilities, and this is a normalizing factor, so I can guess where I am as a function of gene expression, okay? So everybody's happy with... Anybody not seen Bayes' rule before? Everybody's seen it, right? So just implementing this rule numerically, literally implementing that rule numerically, is how you get this picture, right? So that picture says what is the implied posterior distribution as a function of the actual position? What is my guess for where I am as a function of where I actually am? Any question? Okay, so this is quite nice, and clearly this gene is totally useless to guess where you are outside there, and it's, you know, it's not that useful here, either, because there's two different positions you might be, you might be closer to the head or closer to the tail. So now, what do we do? The question was asked, what happens if we add more genes? So this is where this paper really shines, right? So if I just use Kruppel, then I don't know where I am in those regions, and I have a degeneracy of two in this region. If I now use just two genes, Kruppel and Hunchback, and I do the same Bayesian inference, then you see that all the uncertainty here goes away very nicely, because Hunchback is providing a lot of information there, right? I mean, pretty much linearly, you can guess where you are based on the Hunchback level, right? Until you get to this flat region, right? And it's also giving a nice dose of information there, with a degeneracy of two over there also, because you don't know if it's the up phase or the down phase, right? Then you add giant, right? And you pretty much almost nail the whole thing, except for some variation down here. And finally, you add Knurps, right? And it's beautiful, right? So you've added four genes with their known level of variation and co-variation, and you do Bayes' rule, and that's all you do. And you say, I'm guessing where I am based only on these gene expression levels. According to Bayes' rule, this is what you get, right? So just stare at this for a second. This is pretty astonishing. It didn't have to be this way, right? If for some reason we didn't know the existence of Hunchback, suppose one of these genes was missing, we didn't know about it, we would not be able to reconstruct this. So one of the lessons from this is that we actually have, well, if we and the fly actually have sufficient information for positional determination at this stage in development. Step one, okay? So there's no, the degeneracy is gone. There's only a sort of local uncertainty. There's a global uncertainty has gone away. Second thing, look at the thickness of this band. The thickness of that band is of the order of 1%. I know where I am to within 1% of the actual position from 0 to 1, based on just inverting some sort of stochastic measurement of gene expression. That's pretty astonishing, because turns out 1% is exactly the number you need to tell one cell from the next cell, okay? So by this stage in Drosophila development, there's enough information for every cell in principle to know, is it the first, the second, the third, the fourth and the hundredth. From this stage onwards, you won't even have to explain anything, assuming that the cell actually makes use of this information. I haven't proved that, but assuming the cell actually makes use of this information, the rest of the fly development will work out pretty well, okay? Now how do you prove that the cell is actually making use of this information? You remove something, yeah? So it's interesting, right? So how do you tell a cell is making use of this information? Let's say you remove one of those genes. You remove one of those genes. You can always knock out a gene by mutagenesis, right? But use the same decoding algorithm. So now the cells are all still, they don't know they're in a mutant fly. They're all still decoding things exactly the same way. They're in a mutant fly, and therefore their expression levels will change, okay? There's a problem with this, because this is a finite system. You're just measuring a few hundred embryos, and you're just measuring a few gene expression levels. When you destroy something in a high dimensional space, there could be new gene expression patterns that have never been seen before. And if a new gene expression pattern has never been seen before, how do you know what the cell is supposed to do? Because you don't have a mechanistic model. You just have an optimal model of what the cell should do, assuming it's sitting in a wild type embryo, right? So that's, now one of the things they say in this paper is, well, where we find that even when you knock out these genes one by one, the combination of expression levels that you see is still within the framework of our original decoding. And therefore they're able to use the original so-called optimal decoder to figure things out. So with that caveat in mind, let me go through here and show you an example. So here, zoom in here, okay? I'm not able to zoom in, ah, good. So here's an example. Here's an example where this is the axis of the mutant fly, okay? This is the axis of what should have been the wild type fly, okay? In the wild type fly, we know that a certain amount of information is available to all these cells. Now, how do you prove they're using the information? You prove they're using the information because I know that in the wild type fly, there are some cells up here which express this particular gene, right? And this particular gene is expressed in this sort of stripy pattern. So in particular, I know that a cell that's sitting at about 0.75 should express this gene because it thinks it's sitting at 0.75. Now look what happens in the mutant. A cell that should be expressing that, the conditions under which it would express it have actually been shifted slightly to the right. And therefore these stripes are slightly spread out. And I'm not going to belabor this point, but here's a much more drastic mutation where a cell that ought to have been expressing gene expression levels here really can't figure out where it is. And so the position of the stripe really becomes very noisy and so on. So the reason this paper is going to be already, it's already very influential and it's going to be influential, is they've managed to predict what the gene expression level in mutant flies is based on no molecular information other than just images of flies and gene expression. They've not found which transcription factor binds to which DNA, binding affinity, unbinding affinity, downstream processes, nuclear, nothing. They've just assumed the original encoding is optimal and then they assume the same code works in the mutant and then they see what happens. Here's a very nice one. In this particular case, there's a degeneracy. Once you knock out these genes, there's a degeneracy. So that what would have been a single stripe in the wild type now becomes two stripes in the mutant. This is very, very interesting. It leads to a prediction that both these stripes are determined by the same transcription factor binding to a certain promoter and enhancer region on the DNA. And you can actually go test all these things. So good. Are there any questions about this? So this is where I'm going to stop with the paper. I want you to go read this and find out more about how this thing works. Yes. The assumption is that in order to guess where you are as a function of where you actually are, you need a rule. The rule spits out a number as a function of four input variables. If I give you the concentration of Knurpp's, Giant, Hunchback and Kruppel, you should give me a number between zero and one. That function is fixed. This is that function. I use the same function, but I mess up the amounts of these. And now I don't know where I am. Rather, I still think I am somewhere, right? Because I'm using the same function, but it turns out that in real life I'm in multiple places because I mutated the system. It's no longer operating according to my original assumption. Yes. Yeah. So that's one percent. Oh, there's still variation. It's stochastic gene expression, right? There's plus or minus square root of n. Yeah. Exactly. So the question is, if you want to get the noise lower, you need more molecules. And if you want more molecules, you need more energy. So you can get this uncertainty even lower if you pay more energy. Just energy per molecule, right? One other point, which is a point that they spend a lot of time on in their 2013 paper. Very interesting. Look at the thickness of this line across the whole embryo. The uncertainty is basically equal up and down the entire system. This also didn't have to be the case because the variations up here are sort of very, very different across the whole cell. But the posterior uncertainty turns out to be very close to 1% up and down. Again, this did not have to be this way. It just turns out to be this way. And that's why this is a cell theory paper and not a theoretical biology paper. You could have written down these equations 100 years ago. This is the paper that proves this is what's going on. I don't know why the fly has optimized itself so well. The fly did not have to be optimal. You should never assume optimality. Assuming optimality, you should make a prediction and see if the prediction is correct. But without such evidence, you should not just assume optimality. Any questions? Well, so that's what I wanted to show you from this paper. And if there's nothing else left. So this is just the last thing. So this sort of predicts where the peaks ought to be as a function of where the peaks should be. And this is the final result. It just really shows that cells are extracting position information down to the 1% level that seems to be the limit. It's pretty cool. So please read the paper. It's a very nice paper to read because they've spent a lot of time explaining the background. So now I'm going to disconnect from the projector. And get into the meat of today's class. So all that was motivation. All that was motivation. The motivation is that it's relevant. And obviously in some cases, very useful and powerful to think of biological systems as somehow extracting information in the presence of noise. Now I know that the first day I asked all of you how many of you have taken a class in information theory. Many of you have taken. Can I see a show of hands again? How many of you have taken a formal class in information theory? So that's roughly this half of the. So was this all part of the, as part of a physics course or as a standalone course, in what context was the information theory taught to all of you? Anybody else in the context of electrical engineering or anything? Fine. So I'm going to now, so it's good. It's about three quarters of the class have not seen information theory. And for those of you who have seen information theory, I apologize for the redundancy. But as usual, my caveat is it's always nice to see the same thing with a different view. So please bear with me. So for today, tomorrow, and the day after tomorrow, I'm going to be spending time giving you the basics of information theory. So that you can read papers like this and understand the calculations they are doing. There's going to be, for the next few classes, very little direct biological motivation. All the biological motivation is in this paper and papers of this type. So if you want more motivation, go and find interesting papers of this type. So let's get started. So the central idea with information theory is that there's a sender and a receiver, and there's a channel. And typically there's some message, X, which is sent, which goes through the channel and comes out with some other thing, Y, which could be the same as X. And the receiver has to guess what's going on. In the context of the fruit fly, these X's are actually the gene expression levels, and the Y is actually the little X, which is the position that it's trying to guess. It's doing an inference problem. So that's one way to think about it. There's many ways to think about how to write down a little information diagram for a particular model. Another way to think about it is the sender is sending some X. The channel happens to be a bunch of gene expression levels. The receiver is trying to receive the same X. That's another way to think about it, where then the sender and the receiver are roughly the same amount of information. And there's noise that enters at that level. So going forward, we're going to have a few common letters that always mean the same thing. So here are those letters. M is the number of distinct possible messages that the sender is trying to send. In the fruit fly example, the number of possible messages is 1 to 100. I use cell number 1 or 2 or 3 or cell number 100. That's the message the sender is trying to send. M is also often written as the size of some... This is a squiggly X. This is a squiggly X. It's often written as the size of the alphabet. So if you don't think of a fruit fly example, think of the most standard example where you're trying to send a telegraph message. Has anybody here ever received a telegram in their lives? I actually received a telegram once. I don't even know how it got to me, but anyway. So they used to send telegrams. Information theory was basically invented for the telegram. So the messages you're trying to send are basically individual letters. That's an example. So one case we're going to be spending a little time with today, the message you're going to send could be there's a horse race and there are a bunch of horses that are going to participate in the race and let's say there are 8 horses, they all have different names. The message is you're going to send which of those 8 horses. Now in the case of the horse race or the letter and so on, I'm not trying to transmit the shape of the letter so that they can reconstruct the image of the letter. So those of you who use, all of you, use your emojis on your mobile phone. When I send an emoji from one person to another, I'm not sending you a picture. When I send a photograph, I'm sending a picture. When I send the emoji, I'm just sending which emoji it is in the Unicode list of emojis. So it's a discrete number. It goes from 1 to the maximum number of emojis. The emoji itself could be very complicated. So let's say there are a thousand... How many emojis? Let's say there are a thousand emojis or 1,024 emojis to make it a nice power of 2. And I say I'm sending you the 617th emoji. How does your phone know what to show? If I sent this one, how does your phone know what to show? Because you already have the answer. All you're looking up is a code book. So if there are M distinct messages, these are M messages, then this is called a code book. And the code will be some sort of ones and zeros. A particularly simple code, for example, will be all the messages are numbered all the way. This is a code book. And the information theory problem I'm going to describe today is where the sender and the receiver ahead of time were both in the same room, and they both decided on a code book. So they had the entire dictionary of what messages there could possibly be and what code we're going to use to discuss these messages. Then all the receiver has to do, once the sender has sent information, is to look up the code book and see which message was sent. It's that simple. I'm not going to discuss how the receiver looks up a code in a code book. Sometimes these code books are very, very large. So in some sense, the entire theory of computation is about how to efficiently find which code was sent. I don't care about it. I assume that computationally you're not limited. The sender and the receiver have arbitrarily high computational power. The sender sends some code. The receiver receives that code. And they look up in the code book which emoji was sent or which horse was sent or which letter or which cell in the Drosophila embryo. Okay. I'm going to use k is equal to log base 2 of m. And this is a relevant thing because naively, if I had 1,024 different messages, I have to use 10 bits because there's 1,024 different rows in this code book. So let's discuss a particular event, horse race. And these are the names of the horses. Let's call it horses 1, 2, 3, 4, 5, 6, 7, 8. There are 8 horses that are going to run a race. And the game I'm going to play, there's a beautiful movie with Paul Newman, Robert Redford called the Sting. Somebody seen it? So the Sting has, it's this little Sting operation where they delay the telegraphic message from the horse race track by just enough so that these guys can place fake bets or they can place bets and then they get the delayed message and these guys make money. So the idea is I'm going to transmit which horse won the race by using 0s and 1s. So how many bits do I need to send if there are 8 horses? It's very simple. There are 3 bits. And what's the code? 0, 0, 0 to 1, 1, 1. Now why would I send or how could I possibly send fewer bits if there are 8 horses in this one race? Can I do any better than this? Is there any other code I can use which has a length shorter than 3 bits per race? Any guesses? Suppose I use 2 bits, what will go wrong? If I use 2 bits, what will go wrong? There's degeneracy. It's not decodable. At least 2 different horses will have to have the same by the pigeonhole principle. 2 different horses will have to have the same information. So 2 bits is not going to work. So you need at least 3 bits. So my claim is I can use fewer than 3 bits. So how on earth can I use fewer than 3 bits to transmit information about 8 horses? The answer is because, yeah? Very good. So N is the total number of races or total number of events. Events or messages, transmissions, right? So imagine there are N races. So as soon as there are N races, your options become much more interesting. So the point is if there are many, many races, there are two ways to decrease the number of bits you send per race. If all you're interested in is the length of the code. So let's call this the code. I'm going to shift this bit to the right. K is log M and N is number of messages sent. So this is the length of the code, L sub i. In this case it's 4, 4, 4 of 8 horses, L sub i. So this is i, L sub i, and there will be some number of bits here. So as long as you're willing to wait for many races, for one race there's no way to do this any better than 3 bits for 8 horses. If you're willing to wait for many races, then you can do better in two rather different ways. And today we're going to develop what those two very different ways are. One way is you remove the requirement that all the code words are the same length. Now why would you do that? You do that because you have some prior information about which horse is more likely than which other horse. Or you have some prior information about which English letter of the alphabet is more likely than which other letter. So let's write down the probability of each horse winning the race. Suppose these horses are sorted in terms of the favorite to the least favorite, right? So let's say this horse is one half probability, one fourth, one eighth, one sixteenth. And then let's say the last four are roughly the same. One thirty-two, one thirty-two, one thirty-two, one and thirty-two. Now suppose I knew this ahead of time. I'm not saying how you know it. But let's say this is reliable information. Then what code should I use? What code should I use? So somebody please propose a code for the first horse that's going to win half the time. Zero. Okay. So what should the code be for the second horse? One. Okay. That's it. It's all over. What's the problem with this code? So remember, I'm sitting over here and I'm sending and you're sitting over here and you're receiving and all you see is a stream of zeros and ones. And maybe you know the starting point. But you don't know anything else. You're only allowed to send zeros and ones. You can't send anything else. Okay? If that's the case, what's the problem with this code? You don't have a comma or you don't have a separator. You don't know when the race code has finished. Okay? So as far as I'm concerned, this means horse one, one, two, one, two and so on. I can't tell them. No matter what you write down here, I can't distinguish that from this other one. So this code, for the moment, let's say this horse, you make it zero zero and so on. You can make them all different. It's not degenerate, but it's not decodable. It cannot be decoded. It cannot be decoded. Okay? So can you come up with a better code? This is a terrible code. Come up with a better code. This is awful. Yeah. So what's the problem? If you're using zero for the first horse, then somehow zero, you know, in a sense, has to be commented out. You have to have some exception. So what do you do for the second horse? One zero. Okay. That's a good one. Right? So as soon as I see zero, I say that's it. The horse won the race and I put a little block around it. I see zero again and I say the first horse won. Now I see one and I say, wait a minute, the second code is one zero. So I put a block around both those because I know one zero is a code. And I say one zero again. So horse one, one, one, then two, and then two. What should the next horse be? One one, okay? Let's see. What are you going to do with the rest of it? There's a problem with one one, okay? Well, so you can, yeah, so you could do, okay? So you're looking for a three, a three segment, a three letter code, right? So let's go with one one. Let's see what happens. What happens to the horse four? What happens to the horse four? One one is okay. What happens to the horse four? There's no way it can be one one one, okay? It can't be one, can't be one because that's confusing with the other one. It can't be zero. It can't be one zero. It can't be one one, zero, one zero. It could be certainly one one zero for the moment. And this could certainly be one one one zero, right? Right? This is possible, right? One, two, three, four, five, six, seven, eight. This is a possible code, right? This is a possible code. Zero is the delimiter. Zero is the delimiter. And then you just measure how many ones are up between the zeros. This is a unary code for the horse race. Yeah? Now what's the average length of this code? What's the average length of this code? So it's half times one plus one fourth times two plus one eighth times three plus one thirty two times whatever eight, right? Two, three, four, five, six, seven, eight. Whatever that is, it's a number. And somebody who can quickly do the computation while I'm talking work out what that number is, right? So this is a terrible code. This is a terrible code because you're really using very long letters, very long code words to compress what's happening in the horse race. Yeah? But it does work. Okay? So in terms of the space of possible codes, all codes are here. By all codes, I mean any way that I can write down two rows and ones for a bunch of messages. There could be good codes, bad codes, whatever it is. And inside there is a collection of what are called non-singular codes. The code is not singular if no two rows are the same. That still doesn't mean it's decodable, right? So then there's a further thing in here which is called decodable. And decodable captures this idea that all you know is the starting point of the whole process. You don't know the delimiter. You don't know where to stop. And yet you're able to tell as soon as you've received the code which horse won. So hold on to that for a second. Right? So this is a terrible code. So let's come up with a shorter code. How would you come up with a shorter code? So the answer is that you really want to think a little more systematic. So think about where each of the letters is taking you. This is zero if it's a one. What happens? Right? And so on. And maybe further, right? So what this is is a decision tree. A decision tree. The decision tree starts off with... Somehow you know this is the beginning of the code. And by the end of the decision tree, you should tell which of these eight horses won. One option for a decision tree is one that stops right here. One, two, three, four, five, six, seven, eight. So if a decision tree stops right here, then there are eight horses. And the code must be 00002111. It's no option. Now what we're going to try and do is to make the first horse zero. If the first horse is zero, then all these guys are out of bounds. Because if the first horse is zero, you don't want another code that could be zero plus a bunch of other things. Because you don't know that means the first horse or some other horse that also starts with zero. Okay? So this is horse number one. And then you go to one. And okay, we can make the second horse one zero. So let's do that just as we had before. And then all these guys are out of bounds. Because another horse that's 101 will not be distinguished from a horse that is one zero plus something else. So that's horse number two. This is one. The next horse could be 110, which is what we did. But bear with me, this is going to get more interesting. There are going to be more trees here. Horse number three. We have now horse number four, which is the last one that's of this pattern, which is 1110. So far so good. And now, because all these four horses have the same probability, there's no reason for us to give some of them smaller codes and some of them bigger codes. So what we all really want to do is to give them some code of a uniform size. So what we're going to do is we're going to take 1111. We're going to give them all the prefix 1111. And now how do we make them all different? Well, there's four horses. So there are four suffixes. Is everybody happy with this code? It is a code. Not only is it a code, it's something called a prefix-free code. Yes. I'll explain why this method. It's a bit strange, right? But what I'm trying to do is to give horses that are more likely to win shorter words. And that part of it is quite clear, right? So that I can get the average length to be short. That's what I'm trying to do. However, these four horses are all equally likely to win. So why should I give some of them shorter words than others? If you use the unary code that you used earlier, this would have an average length greater than the average length I've actually given it. So you can do the calculation. And if you notice something, is there a relationship between the length of the code word and the probability of the horse winning in this particular case? Yes, there is. The length of the code, if it's Li, then the probability of the horse winning is like 2 to the minus Li. 2 to the minus 1, 2 to the minus 2, 2 to the minus 6, 2 to the minus. So there's some pattern going on here. I sort of know what the answer should be. I'm going to work out. You've asked exactly the right question, right? So does everybody agree, though, that this is a correct code? It's correct because it's non-singular. I know it's non-singular because no two rows are the same. No two rows are the same. Yeah? It's also decodable. How do I know it's decodable? Because as soon as I get to one of these leaves, I say next horse and then I keep going. So there's no code which is a prefix of some other code. So if you're prefix free, you're uniquely decodable. So that's kind of nice. And this is how you make the code. So here's my question to you guys. If you work out the average length of this code, if you work out the average length of this code, it's actually shorter than the previous one. So here's a code where I happen, by the way, I happen to use li is equal to log 2 1 over pi. In this code, as it happens, li is log 2 of 1 over pi. And in that case, the actual length of the code is sum of pi log 2 1 over pi. Or if I take the minus sign, minus sum of pi log pi. Where all my pi's are going to be log 2. And this is the first case so far today where you've seen a function, an expression, that's derived from a bunch of probabilities. And that function is a famous one, and it's known as the entropy. It'll be p sub m, usually. So let's just pause here for a second. I've given you a code for a horse race. The average length of the code, I've used a trick for reasons I haven't explained to you. Where the length of every code word is shorter for more likely horses. But not in some arbitrary way. In fact, the length of the code word is log of 1 over the probability. If the probability is small, the length is large. I've used that. In this particular case, I've shown that this code word is a prefix-free code because no code word lies downstream of a previously existing code. And I've shown that the length of this code word, in terms of the p i's, is a very famous function which is called the Shannon entropy. Shannon entropy is defined in some sense as p i log p i, minus p i log p i. And we'll get more familiar with this as tomorrow and the day after. Let's pause and look at the Shannon entropy. So Claude Shannon is named after him. He was the person who came up with this idea. Just like I said with emojis and so on. You're not interested in transmitting what the horse looks like or transmitting the genome of the horse or whatever. You just want to transmit which horse won the race from one to eight. So information theory removes the meaning of the message completely. So maybe it's a sort of misnomer. Information theory is not a theory of meaning. By looking at these ones and zeros, you can't understand whether I'm talking about horses or cells or letters of the alphabet, nothing. You're just talking about which entry in a list of previously agreed entities. The previously agreed entities could be entire horses, could be people in this room, whatever. The second thing, in the simple formulation of information theory I'm giving you, if these races are independent and identically distributed, it is an assumption. It's an assumption. The assumption is valid. Then all that matters to calculate the entropy is the individual probabilities. In fact, it doesn't matter the order in which the probabilities are given because this expression is independent of permuting the probabilities. So all you have is a list of distinct numbers and it only matters what that list is. The permutation of that list doesn't matter. So that's the Shannon entry. It's prefix free code. Why? I gave you this answer and you can work out that this is the expected length of the code. How do you know that there's not some other code which is even better than this one? Maybe I could make this shorter. I could make some of these shorter. How do you know it's not possible? In general, the general question is if there are m-horses whose probabilities are given as P1 to Pm, how do you know that there are not other types of prefix free uniquely decodable codes? The answer is interesting and important. So let me hit this chalk but let me see. By the way, let's just pause here for a second. Suppose instead of the 8-horse race I have 4-horses. I'm going to give you 4 kinds of potential codes and I want you to see which one of these categories it falls into. This is code 1. This is code 1. Code 2. Which is code 3. 1, 1, 0. And code 4. These are four possible codings for the same event. Four possible codings for the same event. Now can you tell me which of these... So this code is what? Where in this little diagram does C1 lie? It's here. It's a code but it's singular. What about C2? C2 is non-singular, certainly every row is distinct but it's not decodable. Code 3 is interesting. Well let's look at code 4 first. Is code 4 completely totally fine? Code 4 is certainly decodable because nothing is a prefix of anything else. So code 4 is here. What about code 3? Let's take a look at that. It's very subtle. Take a look at it and see what the problem is with code 3. So suppose I sent 0, 1, then you know which horse won, 3. Right? And then I can send a 1, 0, then another horse won. But the problem is I could send... So the idea... We have to set up something that's very difficult to work out here. There's a problem with this code. If you can see random collections of these may not be decoded for certain reasons. So come up with a code up to a certain point where you don't yet know what the answer is. 0, 1, 1, 0 is okay. You finish 0, you finish 1. Then the next one is certainly 1, 0. 1, 1, 0 is still okay. 0, 1... We seem to be picking very easy cases here. So I'll tell you the answer and then maybe you can construct the string. This code is decodable but it's not decodable at the end of every message. If I finish the message, you have to wait for further information to know that the message is actually finished some time in the past. So we need to construct a sort of string where that thing doesn't work. So let's see if we can construct a string that goes like that. I'll leave it as a little homework problem. It's actually pretty trivial. You just construct a bunch of 0s and 1s here and you'll find that there's still some ambiguity. There's still some ambiguity to what the answer is until you get to the end. So this is called a non-instantaneous code. 0, 1, 0, 0, 1... Okay, fine, fine, fine. Here we go. I'll leave you guys to work it out. Okay. So let's work out now why I cannot compress this system to any smaller than this guy. And the answer turns out it's a bit like you have a balloon that's inflated to some level. If you try and compress one part of the balloon, some other part actually has to expand. Okay? And so the way we're going to do it is the following thing. We're going to take one of these trees up to some level. We'll take that tree up to some level. And let's call the total depth of this tree Lmax. All right? So L is 0, 1, 2, in this case Lmax is 4. Okay? And I'm going to put a bunch of code words on this tree. And I'm going to highlight the code word, let's say in some other color. It's green. Let's say this is a code word. And that code word is at some length L1. Okay? And this code, remember, 0s and 1s, right? 0, 1, 0, 1. Always 0s on top and 1s on the bottom, for example. Right? If 0 itself is a code word and the code is instantaneous, I can decode it as soon as the code word is over, what are all the other code words I can't use? I can't use any of these. So all those code words go out the window. So how many code words will there be at the tip if I have a code word of length L1 inside? Right? It's 2 to the power Lmax minus L1 because that's the number of guys that I'm not allowed to use since I've used this one. Right? So it's 2 to the power Lmax minus L1. That's gone. And let's say I put L2 over here. Right? Then I lose 2 to the power Lmax over L2 in that minus L2. Is everybody getting the logic here? If a code has to be free of prefixes, then as soon as I make a code word, any other code word up to a certain length that has this as the prefix is no longer allowed. Right? This is the number of code words that's no longer allowed. And let's say the last two code words are just these two. Right? For each one of those, I lose exactly one. I lose 2 to the Lmax minus L3, 2 to the Lmax minus L4. In this case, it's 2 to the 4 minus 1. This is 2 to the 4 minus 1, 2 to the 3 minus 1, 2 to the 3 minus 2, 2 to the 3 minus 3, 2 to the 3 minus 3. Maybe in fact I didn't even have four codes. I just had these three. Maybe there's only three horses running this race. So what can you say about these numbers? One thing we know is that the total number of things up to a certain depth is exactly 2 to the Lmax. That's the total number of leaves. That's the total number of leaves. Okay? So it must be that the total number of leaves is greater than or equal to all the leaves that you had to have used in your system. I'll go through the logic again. If I want to make a code for horses, as soon as I use a code word, anything that has this as the prefix is out of bounds. And so using a small code word throws out a large number of subsequent code words. In particular, anything that starts with a zero up to a certain maximum length are not allowed anymore. All these guys, the shorter the length, the larger the number of words that's not allowed. The longer the length, the fewer. In fact, if I use it up to Lmax, it's just that code word which is not allowed. Okay? If I rewrite this and divide by Lmax, I get 1 greater than or equal to 2 to the minus Li. 2 to the minus Li. This looks like a plus. Okay? Is the logic clear about how I calculated this little thing? Okay? So this is called the craft inequality for prefix-free codes. Okay? Are there any questions about this? This code is instantaneous. So for the moment, I'm going to do instantaneous codes where... So there are non-instantaneous codes and those in principle allow more flexibility because you don't use this thing. So for the moment, I'm only going to do the calculation for instantaneous codes. Instantaneous doesn't mean prefix-free, but it pretty much... If it's not prefix-free, it's not instantaneous. For sure. That's true. That's true. There was a code which was prefix-free and not instantaneous. So now we're just looking at instantaneous... So there's another little line in here which is prefix-free. Sorry, not prefix-free. Instantaneous. Okay? So you're asking what is the sufficient condition for instantaneous. So I'm saying prefix-free is certainly a sufficient condition for instantaneous. And you're asking, is it a necessary condition? And I won't treat that case. So let's assume it's prefix-free. Then it's definitely instantaneous, right? Instantaneous code, yeah. I think it's then worth going back to the little codes that I'd set up previously. And let's work out exactly why there's a problem with it. So here's a code. Okay? I mean, I really should have done this in preparing for the class. But here's a code. 1 0, 0 0, 1 1. So this is fine. This is fine. So this is fairly obvious why it's not instantaneous, right? Suppose horse number 3 1, then it becomes 1 1, right? Now, if it's 1 1, so far, I don't know if horse number 3 is the winner or if horse number 4 is the winner. Okay? Even if it's 1 1 0, I don't know if it's horse number 3 is the winner followed by horse number 2 or horse number 4 is the winner. Okay? But the next bit will totally resolve the issue. Right? If it's 1 1 0 0, then I know that it splits like this. Right? If it's 1 1 0 1, then I know that it splits like this. But the next one could either be 1 0 or 1 1. So it's a subtle situation where I have to go beyond the first race and look back to see who actually won the first race. Yeah. So that was the problem. That was mistake MA. So this is the obvious one. Yeah. Yeah. You don't need any more information than just looking at this code, right? Once I've given you a whole string, it will be uniquely decodable for this collection of zeros and ones. There's no further information. All you have to do is to make sure that it is just a collection of legal code words. Yeah? But you can't tell at the end of the race, whereas for a prefix-free code, as soon as the race is over, you can tell. Yeah? So my mistake last time was, I wrote down, I think, something else here, 0 1. Yeah. And that wasn't even uniquely decodable, right? So there was some issue with that. Anyway, so here's an example of a uniquely decodable but not instantaneous code, okay? Now, if it is a uniquely decodable and instantaneous code, then it must be prefix-free. If it's instantaneous, it has to be prefix-free, right? Because otherwise, one word is a prefix for another one and you don't know if it's finished before the other one has started, okay? So in particular, any constraint that applies to any instantaneous code, a prefix-free constraint is one of them, okay? So here's the constraint. For every prefix-free code, I can write it in the following way. 2 to the minus li sum from i equals 1 to m is less than or equal to 1 by the little method of cutting the tree. This is the craft and the quality. And you'll see why this is important in just a second. But let me explain what's going on here. What are the parameters here? The parameters here are the total number of different messages, total number of different messages, yeah? And the length of each message, okay? In general, if you have more messages, you're going to have more terms in the sum, yeah? And if you have more terms in the sum, but they all have to add up to less than 1, then the lengths all have to be longer. So in general, the more messages you have, the longer the code word you have to use for every message. That's the first lesson, okay? The second lesson is even more powerful. If some of your messages are very short, for example, if the length of the message is 1, that already gets you very close to 1. So if some of your messages are short, all the other ones are forced to be quite a bit longer, yeah? Now, your messages can be arbitrarily long and still satisfy this, right? If there are eight horses, but each of your Li's is like 100, right? There's no problem satisfying this. If all your messages are equal in length, if all your messages are equal in length, then Li is something like the ceiling of K, right? If Li is the ceiling of K, then this one works out exactly and adds up to M times a bunch of constants which are like 1 over M, and that adds up to 1, okay? So this really makes sense. This captures in a sense all the internal constraints of making an instantaneous code. You can have them all the same, in which case you need log of the number of messages. You can have them all very, very long. No problem at all. But if you make some of them short, other ones have to be much longer, right? So it's like squeezing a balloon on one side and the balloon expands on the other side. That's what this thing is doing for you. Just feel the mechanics of it, okay? So now we're going to use this craft inequality to actually calculate what the optimal length could have been for the code word we had seen earlier, for the codes we had seen earlier, okay? So in general, I want to minimize L. I want to minimize the expected length of the code, right? Minimize over what variables? Over all the L sub i's. In other words, I want to minimize over all the L sub i's this quantity. If I want to have a code, I mean, information is expensive, right? It used to be. So sending one bit of information costs money. Sending one bit of information to Pluto is very expensive, for example, yeah? And yet we managed to do it. So we want to use as few bits as possible to do the job. So we'd like to minimize the expected length of the code word. Now you could come and say, well, let me just make all the L's as short as possible. Let me make all the L's one. If I make all the L's one, it's not allowed because it's not uniquely, it's singular, right? It's not even uniquely recorded. So this is not sufficient. This doesn't prevent you from making all the L's as small as possible. What prevents you from making all the L's as small as possible is this inequality. Subject to sum i equals 1 to m, 2 to the minus li less than or equal to 1, right? This is the optimization you want to do. This is the optimization you want to do. This piece forces you to make all the L's small. This piece says, well, wait a minute. If you make some of the L's small, other of the L's are going to become big. So you're not allowed to arbitrarily control all the L's separately, right? You're not allowed to arbitrarily control all the L's separately. So in general, how do you solve this problem? You solve it using Lagrange multiplier, right? So you say plus lambda times this is a Lagrange multiplier. How many of you have not seen the method of Lagrange multipliers? You've all seen it. Very good. This is just a constraint optimization method, right? Which allows us to do optimization with simple calculus. So you want to minimize this function of li, this function of all the li's. You want to minimize this function. So let's minimize it. Question? Yeah? So standard way, differentiate with everybody. So df dli will be equal to p sub i because the derivative of this with respect to lj will be 0 if lj is not equal to li, right? And then you get plus lambda. And only one of these terms depends on that particular l sub i, right? So you get lambda times minus 2 to the minus l sub i. Not 2, but you get an e derivative, yeah? Yeah? Yeah. Okay, good, good, good. Thank you. Good point. This is an inequality. This is inequality, okay? Now we've made this an equality. And the justification for that will come a little while from now. But the justification is roughly that this inequality is the best you can do when the li's are all integers. Okay? So now we're making two leaps of faith, or not leaps of faith, but two justifiable assumptions, right? The first one is that let's find the solution over all li's, whether they're integers or not. And that solution must be worse than the true solution, or at least just as bad. So allowing the li's to vary over real numbers is the first piece. And the second piece is this inequality only makes the situation worse. By taking any inequality and making the l's smaller, I can make it an equality, yeah? So the equality, so by doing this, this is surely the best you can do. This puts a hard lower bound on the length. The actual solution will be worse than this for two reasons. Reason number one, the l's are not integers. And reason number two, this is an inequality not inequality. Thank you for the question. So I cheat, right? I cheat it. I do this all the time, right? I differentiate with respect to things which ought to be integers. I did this in the first week also. So the derivative of the second piece with respect to lambda is minus 2 to the minus li natural log of 2. Is this fine? Because it's log 2, right? Log e, yeah? So what do we find? This is equal to 0. This is equal to 0, right? So you find that piece of i is equal to lambda, well, this is what it is, right? Lambda log 2 over 2 to the li, right? What do they put the, yeah, that's right. So now what do you do with the next step of Lagrange multiplier stuff, right? You just have to do the derivative with respect to lambda. The derivative with respect to lambda simply enforces the constraint, right? In this case, therefore, the sum of all the p i's, right? Will have to be equal to 1. The sum of all the p i's will have to be equal to 1. So you then find, okay, did I do this correctly? Lambda is equal to lambda log 2 over 2 to the li, that's correct. And if you add this up, so then you find that lambda, right? So you want these two to be the same. So you find that the sum, so actually I want 2 to the minus li. So let me write it like this. 2 to the minus li is equal to this. 2 to the minus li is equal to that, right? So what you want to get then, the sum lambda times log 2 over p i is equal to 1. Is equal to 1, right? Yeah, mistake is way here. This is correct. This is correct. So where's the error? Is there an error? 2 to the li, okay, let's just write it out, right? p i is equal to lambda 2 to the minus li log 2, right? So then 2 to the li, ah, okay, fine, fine, fine, fine. So 2 to the minus li is equal to log 2 p i over lambda. Thank you. Okay, that's the way it goes, right? So 2 to the minus li is log 2 p i over lambda. The sum of 2 to the minus li is the sum of this, right? And we know the sum over p i is 1, all probabilities, right? And therefore you get lambda is equal to log 2, the lambda is equal to log 2. And so that sort of cancels. And so you get p i is equal to 2 to the minus li, because the lambda and the log 2 cancel each other, right? And this is the reason why I used a code like I did here, yeah? So just look at the logic. I said the code has to be instantaneous. If it's instantaneous, in particular, it has to be prefix free. If it's prefix free, I can't make some length shorter without making other lengths longer. Nevertheless, it is quite useful to make all the lengths different. So to find the optimal collection of lengths, I actually use this little calculation, which shows that if I did a completely, if I assume that all the else could vary continuously, and if I assume that instead of an inequality, I can satisfy with equality, which is the best I can do, then I find that the lengths are 2 to the minus p i, or p i is equal to 2 to the minus li. So the lengths are log of 1 over p i, or p i is 2 to the minus li. That's the best I can do. This code hits that limit exactly. Therefore proving in a sense that the entropy of the distribution is the best description I can have for that distribution. It's the shortest on average. Now this entire thing that I did just now is just to motivate the definition of the entropy. The entropy h of p turns out to be the answer to a very interesting question. It turns out to be the answer to the question of on average what is the shortest code that you would make for some collection of messages that are independent and identically distributed. So what we're going to do at the beginning of next class is two things. One is we're going to show that any real code can only do worse than this, right? But it's only off by one bit. So any real code optimally will only be h of x at most plus 1. And part of that plus 1 happens because the li are not integers. Plus 1 happens because the li are not integers. So there are optimal codes. Question? The second piece we're going to prove. If you want to code a certain distribution and you know all the PIs, right, there's a recipe to get the code and the recipe is the following thing. Are there any questions about the derivation before I erase it? Yes. P log 2 in this one. Did I move things around again? Yeah. So whichever way it is, this turns out to be the correct factor. And this always turns out to be the correct factor because changing the base of the log doesn't change the result of the calculations in general. OK. So I might have put something in the top and the bottom, but easy calculation. OK. So I'm going to give you a recipe for making a code. The recipe is called Shannon Coding, which works the following way. Suppose I'm given a bunch of PIs. Suppose I'm given a bunch of PIs. Next to every PI, write down what length you want to use. OK. And what length would you want to use? You want to use a length, which is basically the ceiling of log base 2, 1 over PI. This little symbol means the lowest integer that is greater than or equal to that thing. So if it's 3.3, it goes to 4. If it's 4, it stays at 4. OK. Now, these PIs, the reason I have to take the ceiling is it may not be or in general it is not true that these PIs are all sort of naturally convenient powers of 2. They're not. So this number will have a little bit of excess. A little bit of excess. So how much bigger is log 2, 1 over PI, ceiling, minus log 2, 1 over PI itself? Right? It's some number. Some number delta I less than or equal to 1. Less than 1. Less than 1. Yeah? So if I write down all these ceiling functions, right? A couple of things happen. If the original PIs satisfy normalization, right? Then these PI, these LIs must satisfy the craft inequality because these LIs are bigger than the individual pieces. Right? So look at log 2 of 1 over PI. 2 to the minus LI will just be PI. And the sum of all the PIs is exactly 1. So 2 to the minus something bigger will obviously be smaller than 1. Right? So craft inequality comes for free with this recipe. Now how much is the excess? The excess is delta 1, delta 2, right? And when you add the sum and you get the expected length of the code, you get the entropy piece. And then you get additional sum, which is the sum of PI times delta I. That'll become the sum of PI times delta max. It's less than that. The sum of the PIs is 1. Delta max is 1. Therefore, whatever excess you do by rounding off the length just adds one bit to the average of the whole code because that one bit is spread over all the code words. Right? So there's no excess. So LI is this. Okay? So let's say the LIs for our horse race was, whatever, 1, 2, 3, 4, 6, 6, 6, 6. That would have been the LIs. Right? If there had been just seven horses, these would have been the LIs. And once I know all the LIs, here's the way you just generate the code. You walk through the tree. Right? You make a tree which is as long as the longest length. Right? In this case, you make a tree which is 1, 2, 3, 4, 5, and 6, whatever. 1, 2, 3, 4, 5, and 6. So you make a big tree. Two more I haven't drawn. Right? And then you merely start. And you say the first code always go on top and you say there's one of length one. And you erase all these. And then you say here's one of length two. And then you erase all these. Then you say I need one of length three. And you erase all these. And then you say I need one of length four, which is this guy. And then you make guys which are length five and length six down there. Okay? And then it's one o'clock. I'm going to stop now, but my point is given a collection of alphabet symbols or horses or whatever it is you want to encode. And assuming you want to decode instantaneously, all you have to do to find a code, a working code, is take all the probabilities. Take the ceiling of log of one over PI. Yeah? At most you pay one extra bit for that problem. Right? And then to get the actual code words, since it satisfies the craft inequality, there's always a collection of code words that will work for this. In particular, you just have to walk through the tree and stop at a position which is the right length. Choose that as the code word. Remove all the downstream ones. Then go somewhere else in the tree up to the correct length. Choose that as the code word. Remove all the downstream ones and keep on going till you've finished. Okay? It's a direct prescription. So this is called Shannon coding. It's a prescription for the code. Okay? So what will we do next time? A few things. What if I was wrong about which horse is going to win the race? What if I think horse one is the one that's most likely to win, but horse two is actually has improved its performance since the last time I made my measurement. So horse two is actually winning more often. Then the actual performance of the code will be worse than my assumed performance. Right? So how much worse will it be? We're going to calculate that. And then we're going to go through the definition of something called mutual information, which tells me given a code word, what is the best way I can infer what the actual thing that was sent. Okay? Since we started about five minutes late, if you just give me five more minutes, I'm going to give you a little roadmap about where this course is going to go for the next two classes, right? So five minutes, okay? Since your whole afternoon is free anyway. So here's the roadmap. If there are M messages, if there are M messages naively, I want to send information to somebody on the other side of the room or the other side of the solar system. If there are M messages, I need K bits, where K is log two of M, right? That's the naive expectation. It's a naive expectation. I need K bits, yeah? Now, this number of bits is decreasing. The number of bits is increasing. What we've shown is that if there are M messages, I can actually send fewer than K bits, okay? I have fully proved it, but I can actually use H of P bits on average. I can use H of P bits on average, meaning after a hundred races, if I divide by the total number of bits I sent by the total number of races, I'll come very close to the average, which is H of P. So I can actually use fewer bits, fewer more. So I can actually use fewer bits on average, okay? And this is interesting. The little calculation we did just now demonstrates that for instantaneous codes, you can't do any better than this. You can't do any better than that. Now, you might wonder for non-instantaneous codes, can you do better? And next time I'll prove for any coding scheme you want, you can't do any better than this, okay? This H of P is an absolute limit on data compression. It's an absolute limit on data compression. And it has to do with the statistics of the original collection of events. It has to do with these piece of ice. Now, here's a question. Why on earth would you want to send more than K bits? That doesn't make any sense. Why would you send more than K bits? So does anybody have a scenario where you might have to send more than that number of bits? Because of noise. So it could be that half your bits are corrupted. It could be that half your bits are corrupted. And therefore, you can't trust that your partner is going to receive any amount of information that you send. Now, this part of it is fairly straightforward. I've been able to prove this in a single class, right? In fact, just half a class. The other side of it is difficult, subtle. I hope I can get to it before the end of the week, at least give you a flavor of how it works, right? In principle, there's some number bigger than K that I represents something about how accurately this channel sends information. If it sends just half the bits, then I will be half. There's always a number less than or equal to 1 if you work in bits and if you use log base 2 and if your channel sends zeros and ones. This is called the mutual information. And I'll define it next time. Whatever it is, it's a measure of how much information you're losing. If the mutual information is low, then the number of bits you have to send is quite high. But to finish this motivating example, suppose I have a channel which, if I send zero, then with probability 90%, you get zero. But with probability 10%, you get one. And if I send one, with probability 90%, you get one and probability 10%, you get zero. Suppose this is the channel, zero to zero with probability p and zero to one with probability one minus p conversely. Suppose this is the channel. There's noise. So if I send a zero, you might get a one. Here's a very simple way to get rid of the noise. Instead of sending one zero, I send five zeros in a row and I call that the true zero. And if I send one one, I send five ones in a row. I call that the true one. The chance of five ones in a row, becoming five zeros in a row, goes to some exponent of p. That's called a redundancy code. Even with a redundancy code, there's still a small chance you'll make an error. So even if I send five zeros with a chance of one minus p to the power of five, it might become five ones. And then you will decode it incorrectly. So making the code longer does not allow you to get zero error. It allows you to get smaller and smaller errors. That much is obvious, right? The magic of information theory is this result from Shannon's original paper. Making a code exactly this long gives you zero error codes. Zero error for finite increase in the number of bits you send. Even with a crappy channel like this, right? And it's sort of pretty amazing, that's even possible. It should shock you. The kinds of inference problem we were looking at with the B.L.I.C. paper and looking at transcription factors, doing inference and so on. In some sense, that inference problem is the reconstruction of the message I wanted to send, even though there was noise. And the point is such reconstructions can happen with zero error. And that's where we're going to go. So this is called error correcting codes. And this is the full spectrum of information theory. It's to make the codes smaller than K or to make them bigger than K. You make it smaller than K if there's some redundancy in your data. You want to compress it. You make it bigger than K if there's noise and you want to use redundancy to recover the original code. So I'll stop here and we'll take this up tomorrow.