 Of course, a lot of that you can find in this textbook by Gardner. It's called the Handbook of Stochastic Methods. And you can also find other material about that in the PDFs, which I've uploaded as part of the reading material, which is linked to the first lecture on the schedule. And this week's material, if you don't want to take notes, and that's fine, you can find literally everything I'm teaching this week in this book, which is Kovar and Thomas, which is called the Elements of Information Theory. It's a really wonderful book. And in fact, they're both very, very good books. And if you want to, these are the kinds of books which you should really buy and keep. You'll keep referring to them for the rest of your lives. Yeah. OK. So let's get started. Sorry. Elements of Information Theory. It's a famous book. And today at 2.30, I'll be giving a seminar, which is an ICTP seminar in this room. Presumably, other people from campus will be here. But I do invite all of you to come, because the seminar is about one particular random process that happens inside a cell, which we have been studying. And in fact, the starting point for that lecture will be the kind of materials that I've been teaching in this course. So we're going to write down a Markov process for a cell. We're going to show that things are very, very random. And then what I'm going to show is how a cell can overcome that randomness to achieve something reproducible, which is a problem that cells confront all the time. By the way, just to know how reproducible are the processes that happen in living cells. When you are, sorry, are any of you identical twins? Does anybody have any twins? Nobody here in this room? OK. That's fine. So if you had an identical twin, you could have compared and done an experiment to check how reproducibly your genome is expressed into an adult human. You can just compare all your measurements with your identical twins' measurements and see how much variation there is. But since none of you have identical twins, you can do the same experiment by comparing your right and left hands. Your right and left hands develop totally independently from the same genome. There's no feedback coupling between the two. And the fact that the genome manages to make such reproducible macroscopic physical objects, and the only variation is some slight variation at the level of smaller length scales tells you how precisely biology is able to overcome the microscopic randomness and achieve final outcomes using the kinds of tricks we saw in the Drosophila paper. OK, so let's get right into it. Yesterday, we covered the case of codes. Remember, we had these all-codes, non-singular codes, uniquely decodable, instantaneous. So a couple of people asked me this definition of instantaneous code. So remember the setup, right? I'm standing here. Somebody else is standing here. You're sending a bunch of 0s and 1s to this person. All this person knows is where the code started. There are no commas. There are no full stops. There are no ways to separate the code words. So the code words themselves have to tell you how to separate them. One very simple way is to make sure all the code words are the same length. Then I know that after 3 bits, that's code word 1. After 3 bits, that's code 2, and so on. Very simple. But if the code words are different lengths, then there has to be some more clever way to figure out when the code word has finished. An instantaneous code means as soon as the code word is finished, you can interpret it, right? Just because it's decodable doesn't mean it's instantaneous. We saw an example yesterday of a code where you have to skip ahead a few places to go back and interpret whether what you saw was the full code word or not. That problem happens when one code word is actually the prefix to another code word. So for a thing to be instantaneous, it has to be free of prefixes. If one code word is the prefix of another code word, you don't know whether I mean to send the short word itself or the long word of which it is a prefix, right? So instantaneous is prefix free. And so we discussed how to do all these things. And we figured out some interesting things. Remember this horse race, right? You had eight horses with these probabilities of winning. Two, three, four, five, six, yeah? And we assigned code words of all these lengths, right? So we had a length of one, two, three, four, six, six, six, six. And we assigned code words in a certain interesting way. If you remember what they were, they were like this. And I went through the theory of how to construct these code words. It involves starting and going up and down a decision tree. And the idea that no code word can be a prefix for any other code word means if you pick a code word here, all the downstream code words are not allowed. This leads to this interesting idea called the craft inequality. If you actually want to minimize subject to the lengths of all the code words, the average length, the expectation value of length, subject to this constraint in this Lagrange multiplier trick, right? Then you find that li is equal to log 1 over pi. And that solves the system, which therefore shows that the expectation value of length is equal to this thing called the Shannon entropy. The Shannon entropy is defined as this. So the reason I went through all this is because it's actually quite an interesting game to figure out how to make codes. And through this interesting game, and doing a little bit of calculus, you actually figure out that this functional form, it's something that is a single number that's built out of a probability distribution, the discrete probability distribution. It's a bunch of p i's. You put all the p i's together. According to this formula, you get a single number whose properties we will discuss in more detail today. But this single number actually is the answer to how many bits you need, on average, to encode the results of this horse trick. And we discovered that through this little calculus problem. And remember the approximation we made, not approximation, but somehow we found the lower bound to the description length because we assumed that the lengths could be non-integer. And over all lengths, we solved this little calculus problem. But since real lengths have to be integers, when you do Shannon coding, you have to just somehow take the ceiling of this function. In other words, you have to round it up. And when you take the ceiling of the function, then you pay one extra bit of penalty because that one extra bit is spread over various code words. And so the expectation value of the length will be, for the optimal code, will be less than or equal to h plus 1. This plus 1 is just because you're rounding off stuff. That's the only penalty you pay. So remember, this was for instantaneous codes because that's where the craft inequality applies. Now it turns out to be a much more general result. And that's what I'm going to prove to you today. And in that proof, we're going to actually use the idea of the true meaning of entropy, which is not that formula. That formula is not the true meaning of entropy. That formula is the result of doing certain calculations of this type. It's not the meaning of entropy. So let me save some space here. And you have to go forward. So before we start, remember, the idea of making this code, you can only do it if you already know the PIs. Once you know the PIs, you've already fixed the length of each code word. And in other words, you've given the most likely horse the shortest code. Suppose you were wrong. Suppose some other horse is more likely to win. Or suppose the probabilities were subtly different from this. Then you're going to have to incur a penalty. And the question is how much of a penalty you incur. So let's just calculate that. Suppose you thought that the probability of horses winning goes as some q of i, which is not equal to p. This is my assumption. This is the actual. Now since this is my assumption, all my code word lengths are built according to q. But since this is the actual, my expectation value of the length will be calculated over p. So let's write down that little calculation. So then the expectation value of the length will be the sum of p sub i, l sub i. But now l sub i is worked out according to some wrong probability distribution q. So that'll be ceiling of log 2 q sub i. And as we discussed, this is less than, this is sum over m. There are m horses in the race. This is less than p sub i times log q sub i plus 1. Because the ceiling function adds less than 1 to every num. And then 1 is spread out through the whole code. Now we're going to do a little trick. And we're going to transform this question. There's no minus sign. 1 over p i, 1 over q. Because the q's are always less than 1. So the code words are always greater than, length greater than 1. So this is equal to the sum. And I'm going to do a little trick over here. I'm going to do a little multiply and divide type of trick. So p i log p over q times 1 over p i. And we'll do this for a particular reason. That's the first term. And the second term is just the sum over p. So the second term evaluates to just, that's this term plus 1. This second term is 1, but this p is normalized. And the reason I'm writing it like this is very important. Because this thing can then be split into two pieces. This is the sum of p i log p i over q i. I'm going to keep that at one piece. And then you get sum of p i log 1 over p i. Because the log of this times this is the log of this plus the log of that, plus 1. So after going through these little mathematical manipulations, this thing we already recognize. This is h of all the p's. That p's, the reason I did this multiply divide trick is that p's is exactly the same as that p's. So when we make a code based on some incorrect assumption about the world, the average length of it does contain one bit, which is what the original length would have been had your assumption been correct if you knew what the p's were. That's this p's. But then there's this interesting new piece. And then that piece is sufficiently interesting, and it keeps on showing up in information theory to the point where we actually give it a symbol. We give it a name. And that name, it's written as d of p two vertical lines q. So let's keep that as a definition. dpq, by definition, is sum i equals 1 to m p i log base 2 p i over q i. And this is called the Kulbach-Liebler divergence. And we're going to look into some properties of it. So for the moment, see what we have here. If you make an assumption that the code word is wrong, that the probability distribution is something other than what it really is, your effective performance of your code is going to drop, because you'll give short codes to the wrong horse, and you'll give wrong codes to the horse that's coming in most frequently. Imagine that you thought this order was exactly backwards, for example. Your code is going to perform very badly. And in performing badly, instead of the original length that you were going to get, you paid this much penalty. So this dpq is a penalty you pay. It has the behavior of some sort of distance, but not really. I'm going to show you in a minute. If p is equal to q, in fact, you can see it directly. If p is equal to q, what is the value of d? It's 0, because you get a sum of 0s, because you get log of 1. So if your assumption about the world is correct, and your q is exactly equal to p, then your code performance is exactly as you had earlier. But if your assumption is wrong, and the q is different from p, and that difference is evaluated according to this very strange-looking formula, that's exactly the extra number of bits that you pay as a penalty for your mistake. So it's an interesting thing that shows up in many, many places, right? So this is like a penalty term. So in fact, I mean, even write the answer over here. So the L will be equal to d of pq plus h of p plus 1. This d, even though I'm calling it a distance, right? It's not really a distance. For among other reasons, it's because it's not symmetric. If I write dq lines and p, then you get a completely different number, okay? It sort of has the flavor of distance, because it's equal to 0, and p is equal to q. But other than that, we don't know, okay? So let me prove something about this, and any questions about this? So let me prove something about this. By the way, suppose you had to work out, suppose this is not a horse race, okay? Suppose I'm trying to transmit some sort of English text. And when I'm transmitting English text, then i equals one to m will go not over eight horses, but will go over 26 lowercase letters, 26 uppercase letters, a bunch of punctuation marks, spaces, and so on, yeah? So let's say there's 100 symbols you want to transmit. So how would you figure out the actual p sub i's for English text? I give you a book, right? And you can just look over the occurrence of various letters, right? Instead of giving you a book, suppose I gave you a single sentence. And from that single sentence, I asked you to estimate the p sub i's for English text. What mistakes might you make? Tell you. Sorry? There are two categories of mistakes, yeah? So I give you a single sentence of English and I ask you to estimate from that single sentence the probability of occurrence of various letters in the English language. What mistakes might you make? That's the major mistake. It's not that the proportion of letters are off. They will be, but mostly e's will be there and so on. The point is that like q's and z's or z's, x's might be totally missing, okay? If they're totally missing, right? Then you have no way to treat them in this system because they're totally missing. Then you're going to assign arbitrarily large codes for those, right? So what do you do in practice with totally missing entries of some expected event? You add what's called a pseudo count, okay? You pretend that it actually occurs once in the whole, all the ones that you don't see occur with some frequency in the whole text, okay? So you can in fact add a pseudo count for everything you see and don't see. So these are all ways in which we estimate p i. The reason I'm doing this little aside is just to show you that in practice we don't always know the p i's. We don't always know it accurately. And therefore in practice, we're going to get close to optimal, but not optimal. But this little calculation tells you how far away from optimal you're going to be, okay? So let's just talk about this dpq. It's an interesting quantity. Any questions about this? Okay, so it's called the Kullbach-Libler Divergence and it's going to show up a couple more times in this course. How do I say it? Yeah, what's it? Well, yeah, well, since I taught myself information theory from this book, I've never heard anybody actually say it, but good question. I say dpq, right? But you can say the Kullbach-Libler Divergence between p and q. Kullbach-Libler Divergence. The Kullbach-Libler Divergence is in general a measure of the difference between two probability distributions. Okay, if I give you a probability distribution p and a probability distribution q, it's an asymmetric measure of the difference between them. Another way you could have defined the distance or the difference between two probability distributions is you could have just defined p sub i as a vector with m coordinates and q sub i as another vector with m coordinates and you could have taken the Cartesian distance between them or you could take the cosine distance between them or some such thing, right? This is one other way of taking the distance. The reason information theory keeps showing up is because this actual formula shows up as the actual answer to various questions. Whereas, for example, the Cartesian distance between two probability distributions while it has some intuitive flavor doesn't actually enter into the answers to useful questions, okay? But let's, the most important thing about this d, I'm just going to calculate now. Let's calculate this quantity minus dpq, right? Which is by definition minus p log p over q, which is p sum of p log q over p. So the reason I'm writing in this manner is the following. Log, the log function, as a function of its argument, looks like this in any base. So this is zero when x is one in any base. So this is what's called a concave function. It's a concave function. And for any concave function, okay? If I take two points, let me exaggerate this a little bit. Let me exaggerate the curvature a little bit to make my point here. If I take two points, let's say x one and x two, yeah? And I move to some intermediate point x, which is let's say alpha x one plus one minus alpha x two. It's some intermediate point, right? The line that joins these two always lies below the actual value of the function, right? The line that joins those two always lies below the actual value of the function, right? So in other words, f of x one plus f of x two or alpha f of x one plus one minus alpha f of x two. So alpha of this plus one minus alpha of that, which is this point, right? Will always be less than f of alpha x one plus one minus alpha x two, right? So the average of the function will always be less than the function evaluated at the average of the two x values. This is just the definition of a convex function and log is a convex function, right? So that being the case, we actually know this term, what does it look like? It looks like the average of the log evaluated at various points, right? It looks like this term and that term must be less than, this term must be less than or equal to the log of the average, okay? Everybody happy with this? So inequalities are always a very important thing for you guys to get used to. You're never taught how to deal with inequalities formally in many, many places. There's no trick to it. You just have to keep track of these things like convexity or concavity. You have to keep track of plus and minus signs. These are all the easy mistakes you can make, right? But the fact that this is less than this is exactly the same statement as saying that for a concave function, the chord always lies under the function itself. That's why there's a less than sign here, right? This is very easy to evaluate because these two cancel. The sum over q is one because it's normalized, right? So this is just log sum over qi, okay? Actually I'm keeping it in this form for the moment. What is this qi? These qi's are all the values of q where pi was non-zero, right? The values of q where pi was zero was never even taken into account, okay? So technically this thing may not even sum to one, but in practice it'll be less than one, right? So this is less than or equal to log of one, which is zero. So because the log is concave function, this d has a very interesting and important property, right? It's always greater than or equal to zero because minus d is always less than or equal to zero, okay? Very, very important. And in fact, if you go through the inequalities over here, you'll find that d of p with respect to q is always greater than or equal to zero and it's only equal to zero when p equals q. So in that particular respect, it behaves a little bit like a distance, right? It's not, yeah. Oh, this sum, yeah. I'll tell you what. Just look at this formula, yeah? So this formula, whenever p, so p will be some probability distribution, right? q will be some other probability distribution, right? This is pi, this is qi. I mean, we'll write down this formula, right? The terms where pi is equal to zero, which are these guys, right? The qi's won't even enter the sum at that point. So these terms, for example, the qi's won't even enter the sum, right? So this sum of qi, you're only adding up this part of the curve. That's the only reason, okay? So everybody happy with this? You should see it once. Other than that, you don't have to see it ever again. All you have to do is remember that this d satisfies that condition, okay? So d of p with respect to q is greater than or equal to zero and dp of q, dp to q is equal to zero if and only if p equals q. So only if the two probability distributions are exactly the same all the way through the whole support, then d of p with respect to q is zero. Otherwise, it's always greater than zero. So this is interesting and important because it'll show up in many, many little proofs that we're going to do. So let's do the first version of those proofs right now. So entropy, Kulbach-Leibler divergence. Let's go back and look at this entropy function. It came up last time through the result of a little calculation, right? But let's look for a very simple case. Let's look for a case when the probability is over just two possible states, right? So let's look at the case of like a Bernoulli variable where with heads you have probability p and tails you have probability one minus p. Just give me a second. So you have h of this probability distribution p comma one minus p and you want to plot that as a function of p. p of course can go from zero to one. So let's work out what this looks like. This is equal to p log one over p plus one minus p log one over p, one over one minus p. It turns out to look like this. It's not a parabola. It's this funny formula. It's obviously symmetric in p. It's obviously zero at the two end points. Yeah? And at the midpoint, what is its value? It's half times log two plus half times log two, which is log two, which is one at value half. So this is often compressed. I might just write this as h of p. Sometimes I use a loose notation as they do in this textbook where instead of saying h of the entire probability distribution, if there's just two states, the probability is p and one minus p, I just use a shorthand and call it h of p where p is the x-axis of this distribution. Since permuting the heads and tails doesn't matter. Yeah? Question? It's not symmetric. That's why it's not a distance. That's all. Yeah? And you can check things like triangle inequality. You may have problems with it. Okay? So don't think of it as a distance. It is a very well-defined notion of the difference between two probability distributions. It's asymmetric, so what? No problem, right? As long as you keep track of which thing is on the left and which thing is on the right, which sometimes I lose track of. Yeah? But it's sort of easy to see, right? So p and q means that p comes here and then you get p over q. The first one is the important one. The first one is the one against which the average is being taken. So here's a question. What is the maximum value of h for any probability distribution over any number of variables? So what probability distribution over, let's say, m variables, what probability distribution over m events, m horses, m letters, or whatever, maximizes the entropy. So you all know the answer, but let's prove the answer. So the answer is probably something like the uniform distribution, right? So let's see if we can work that out. And the way I work it out is by writing the following equation. Let's work out the Kulbach-Leibler divergence between any probability distribution p in any number of variables, in any number of states, right? With respect to the uniform distribution. Looks a little unmotivated, but you'll see why I'm going with this. So let's just put the formula down. This is sum of p i log p i over u sub i, i equals one to m. So this is equal to, so log of, well, okay, so this is equal to sum of p i log one over p i minus. It's this thing, I put the p in the bottom, right? And then you get plus sum of p i log one over u i. Now what is u sub i? U sub i is the probability of getting event i, but it's a uniform distribution. So u sub i must be equal to one over m, because all m events are equally likely. So I can just put log of m here. And since log of m is a constant and the sum of p, this is normalized, I can just remove this, right? And this entire thing here is just h of that probability distribution. And this whole thing, because of the positivity of this, must be greater than or equal to zero. This is the power of defining something like this, because from now on, you can just use the fact that it's positive and all its other properties to prove things. So when the dust settles, what do you find, right? You find that, did I put a minus, is there a minus, is there a minus missing? No, that's right. So this is minus h. This whole thing is minus h. Yeah? So therefore you get h of any probability distribution over m, an alphabet of m letters is less than or equal to log of m, right? So that you can do the calculation again. I'm just erasing it to keep some space on the board. So this is interesting. So this says the entropy, you know, according to this formula, right? It could be anything, but it is maximized by the uniform distribution because we know that this distance is zero only when p is equal to u, right? So only when p is equal to u is the entropy equal to log m. So if there are eight horses, the entropy is equal to three when all the horses are equally likely. And in any other case, the entropy has to be less than three, yeah? And of course the entropy is greater than zero. We know that because every term in this is negative and it comes with a minus sign, right? So we know this. And the entropy is equal to zero in what case? The entropy is equal to zero when only a single event can happen. The entropy is equal to zero when only a single event can happen, okay? And that's this term. Any questions about this? See the properties of entropy. Properties of the entropy formula. Now I want you to set all that aside. Just trying to find the chalk which is likely to, there's a big piece here, there we go. Okay, so if you have previously taken an information theory course, you would have seen stuff like this, okay? That's good. It gives you some familiarity with the formulas. However, those formulas do not accurately capture what entropy really is. So now let me tell you what entropy really is. And it's not that formula. That's what entropy looks like in certain situations. But what entropy is in general is something rather different. Yes, no, it's just log M. It's just, it's like here, it's log two, yeah? Log M is just, it's K. It's what we've been calling K. Yeah, K is the log of the number of messages, okay? Log two M, because for the purposes of this class, I'm going to stick to log two forever. If you like doing other logs, just use the transformation of the base of the log formula. Everything goes through just fine, okay? But if you want to use information measured in bits, you use log base two. It's like a choice of units, okay? It's also a very practical choice of units because electronic circuits typically use two state systems to encode things. That's why bits is a very reasonable choice. Okay, now what is entropy really, right? Remember last time we defined, there were M types of mess, M distinct possible messages. We defined K as log base two of M. And we defined N as the number of times, the number of messages transmitted, right? For the horse race, M is eight, K is log eight, which is three, and N is like at the end of the day, I don't know, 20 races were finished, right? Okay, so now I want you to think about what entropy really is. For the horse race problem, we worked out that with fewer than three bits per race on average, there are eight horses, so a naive code for eight horses is zero, zero, zero, zero, one, zero, one, one, one. That's the naive code for eight horses. Using the naive code after N races, you would have used three N bits. But using this clever code after N races, you can work out what the entropy is. It's less than two, it's less than three. It's like 2.75 or something, if I recall. You use fewer than three bits per race, okay? The other interesting thing about this code you can convince yourself, literally every string of zeros and ones that you send has a meaning. Every string of zeros and ones you can send can be interpreted as the result of some series of races, okay? So let's assume after Q bits have been transmitted. Let's assume that Q bits have been transmitted. And let's assume Q is a very large number. Let's assume we've been transmitting the results of horse races for a whole year or for 10 years or whatever. So there's been like a thousand horse races, right? So after Q bits have been transmitted, how many horse races do you think have been run? Right? So then N must be something like Q over L, right? Because this is the bits per race. And this is the total number of bits, right? So this must be N races. And because of the law of large numbers, right? Eventually, you'll be very, very close to this value, right? So here's the funny thing. How many possible configurations of Q bits are there? There are two to the Q configurations, right? There are only two to the Q messages. Everybody gets that? Because after Q bits, obviously, the first bit could be zero one, second bit is zero one and so on and so on. So after this number of races, there's only this many possible messages, okay? So that means this is two to the Q is equal to two to the N times L. Or if you want to think about it with that formula, it's two to the N times H, capital N, capital N. Now something really weird has happened and maybe you didn't notice it. You didn't notice it. After N, after N races, how many possible kinds of outcomes could there have been? Each race, eight horses are there. Each race, eight horses are there. So after N races, how many possible outcomes could there have been? Must be eight to the N. Is that clear? Right? Any of these horses could have won any of these races, right? So after N races, this is the total number of possible ways the thing could have happened, right? This is also written as two to the N log M is equal to two to the NK, right? So something really weird just happened. Look at these two numbers. And look at this, right? We know that H is less than K because this is not a uniform distribution. So something is wrong. What's wrong? Although this is the total number of ways the race could have happened, this is the total number of messages I could have transmitted to my friend. And the total ways the race could have happened is in principle much larger than the total number of ways I could have transmitted messages to my friend. So can somebody resolve this paradox where I've lost some information? I've lost some information. Where did that extra information go? Sure, but in the limit of large N, this is absolutely much bigger than this. So that's not the problem. In fact, the larger the N, the closer this becomes actually true. So it's not because of a large N limit. Okay, so that's the point, right? So that's what I'm getting to. So the definition of entropy is something like this. The definition of entropy captures the flavor that if you are sitting as an experimentalist at the horse betting track and measuring the results of horses every day for many, many years, right? As a function of the number of races, you plot the log of the number of different results. When I say number of different results, I mean observed the number of different ways in which every horse won every race. One thing that you're hardly ever going to see is this horse winning 1,000 races in a row. Probably because it'll be dead by then, but in practice, right? So remember, we are assuming ideal horses with independent, identically distributed events, right? So horse dying is not an issue. So that event never happens. And various other events never happen. You know what other event doesn't happen? The most likely horse also winning 1,000 races in a row. That never happens. Even though it's the most likely horse, that horse winning every event in a row never happens. And there's a whole bunch of other events that never, never happen in practice. Okay? So if you plot the log of the number of results that happen in practice as a function of the number of races you run, you're going to get some empirical curve, right? You're going to get some actual experiment. You're going to get some empirical number of things, right? Well, it'll actually be just an increasing curve for a real experiment. It'll be some. This is an empirical fact. If I was sitting and watching random events independent identically distributed play out, and in the end I write down the total number of distinct strings I get made up of which horse won the race. Certain strings will never be in my list. And I just take the log of the number of distinct strings I see. That's it, right? The slope of this curve, the slope of this curve is called the entropy. Okay? That's what the entropy really is. That's what the entropy really is. And let me try and explain a little further about this little feature, okay? This is the true meaning of entropy. The true meaning of entropy is if I write down the number of distinct total events that I see, I mean the total combination, where it's broken up into n individual races or n individual letters or whatever it is. And I plot the number of distinct values, log base two, as a function of n. I will tend to get, first of all, a straight line. And secondly, the slope of that straight line is something called the entropy. For the horse race, everything we did yesterday, in fact, went to prove that this was the answer because that's the answer. That's the number of distinct ways that things actually happen. And indeed, the slope of that line, log of that is nh, the slope of that is just h, right? And what is the formula for h? The formula for h happens to be this, right? But that's not the meaning of h. The meaning of h is the slope of this line. In fact, in some cases, just to pick an example, if the events are not independent and identically distributed. For example, the English language itself, just by knowing the probabilities of individual letters, if I predict the total number of strings I have, the total number of English sentences and paragraphs and text up to some n letters, that total number will not be given by two to the n times the entropy of the distribution of letters in the English language because one letter influences the next letter, right? So these are correlated events. And therefore, the formula for entropy might be more complicated than this. But the empirical value of entropy is given by this. And in Shannon's original paper, he does a very beautiful calculation where he works out the entropy of the English language. He works out the true entropy of the English language not by assuming some simple-minded thing like all letters are given by some probability and the letters are independent and identically distributed. He works out the entropy by picking a huge text and just looking at correlations, looking at two point, three point, four point and so on. So you can read his original paper to see how he did this, okay? And this is back in the day when if you had to generate a random number to do sampling, you couldn't just type rand in your computer. You had to buy a big book which was a published list of random numbers. Millions of little random numbers. And people would buy these books of random numbers to do these calculations. Guess what's interesting time? Okay. Why is this a straight line? Can somebody tell me why this should be a straight line? Imagine English, okay? English is the prototypical example of an information system. There are letters. There are individual letters and the letters are strung together to make a long string. And I could imagine all possible English texts of length 1,000 letters. I could imagine this, right? Now how would I actually calculate that? Doesn't matter, right? But in principle, I could calculate it by generating some random string, checking with you if it's really an English text, generating some other random string, checking with you if it's really an English text. Quick aside, how many words do you have to string together in a row to make a proper English sentence in order for it to be an English sentence nobody ever, living or dead, has ever spoken before? How many words? 100? How many words do you have to put together in a row to make a sentence nobody's, a proper legal English sentence nobody's ever spoken before? Right? We can actually test this, right? Because we have Google. And you can just type some random English sentence of yours and see if anybody's ever, you know, at least it's on Google, right? So that's some sort of a repository. So the answer is about three or four, right? You know, if I say my hippopotamus died in pain, I'm guaranteeing that nobody's ever said that before. It's a valid English sentence, okay? So the question of what the entropy of English is contains little questions like this, right? Doesn't matter. But the fact is I can calculate the number of distinct English sentences or paragraphs or texts with n letters where n is like a thousand or 2,000 or 3,000. And I'm claiming that this curve, which is the log of the number of distinct possibilities as a function of n, will be a straight line. Can somebody give me a justification for why this should be a straight line? For large n, for small n, there are squiggles. There are statistical squiggles down here. And there may be even systematic variations from straight line for low values, right? Obviously, for three letters, there's only a few ways you can put English text together. But once you get to large n, this is going to be a straight line. Can somebody explain to me why? Yes. You find this curve by picking pick some n, let's say for English is very hard, but pick some n, thousand, right? And literally count every valid piece of English text with 1,000 letters. And take the log of that. That's all, right? Now, practically I'm not saying how you do it, but if you had a machine, an infinite time, yeah? Okay, so why should this be a straight line? Why can't it be some other kind of curve? So the way to understand that, imagine that I had a big block, right? I had a big block of 1,000 letters and another big block of 1,000 letters, yeah? And suppose I knew the number of ways to write down the first block of 1,000 letters, right? Let's call that f of 1,000. That's the number, that's the quantity of which we are plotting the log here. And suppose I knew the same number of ways you could make the next 1,000, you could make another 1,000 set of letters, right? Then what should f 2,000 be, right? If I had 1,000, if I know all possible ways of stringing together English text for 1,000 letters and know all possible ways, can I guess all possible ways of stringing English text for 2,000 letters? What should the answer be? It's just the product, yeah? The reason is because any conceivable correlation effect or memory or whatever it is of the first 1,000 letters must have decayed and gone away by the time you get to the end of the next 1,000 letters, right? So there's no way that the first word or the first paragraph of a book will influence what counts as a legal English string, you know, 100 pages down, okay? So because of this fact, because in any real system, including the English language, in any physical system, after some amount of time has elapsed, there can be no memory of previous effects, right? And for that reason, for large n, the total number of ways to make something of length 2,000 is the product of the total number of ways to make something of length 1,000, so it's the square of this. And if you take the log, that means that log f of 2,000 is 2 log f of 1,000, which is why this is a straight line. It's a straight line because the possibilities essentially become independent for different subsets of the string. Any questions? Okay, that's a very important fact. It's a sort of empirical fact. I could easily construct meaning systems and ways of generating strings which did not obey this behavior, yeah? And you could easily construct strange mathematical possibilities of generating strings. For example, series expansions or decimal expansions of some number, right? They won't satisfy this property. The decimal expansion of pi doesn't have this property, right? But English language does. And most stochastic processes do. In particular, of course, independent identically distributed events trivially have this property because for every letter, it's independent. So you just get the sum of the logs. Are there any questions about this? So what is entropy? Entropy answers the question of how many. That is what h is. So if you see h, then somewhere in the background, there is a question, how many? And the answer must be, it's not equal to how many. If you see h, the question is how many? And the answer is two to the n h. Whenever you see h, it's a how many question and the answer is two to the n h, yes? No, it's not. That's why there are systematic deviations for small n, right? So the number of six letter words is not the square of the number of three letter words, okay? Because words influence each other. Down here it's not. Down here it's not. But up here, it definitely is. You have to use your eye and you have to use the tricks to see where you've approached an asymptote which is approximately straight, right? You could just do a regression and just fit it and see how bad it is, yeah? But in practice, these kinds of lines are very hard to generate. In principle, the kinds of questions you're asking are irrelevant, okay? So in practice, you need to go up to such value of n whereby any correlation with the past has gone away. Let me just like we did in the first day, with bacteria testing concentration of molecules. You want to go far enough that previous events don't influence future events. There'll be some number beyond which that is true. To a very good approximation. Okay, so is this okay with everybody? So we're gonna move on, okay? I don't know if you've seen this before in the information theory course, but this is the heart of Shannon's idea. The heart of his idea is that any information transmission problem can be worked out if you know how many options there are and to figure out how many options there are, the number of options is always exponentially increasing with the length of the text. And if you want to measure something that's exponentially increasing, you always measure the log of that because those are reasonable quantities and the log of that divided by n is called the entropy. That's it, okay? So we're going to use this idea to give a completely different proof of how to encode independent identically distributed events without using craft inequality or any other thing like that, okay? And we're going to use this fact to figure out how to resolve this paradox, right? Missing events, missing events, right? So we worked out that there's no way I can send as many distinct sets of strings as there are a number of distinct events. The reason is because in practice they never occur and so I just have to work out which events in practice actually do occur and only encode those and not bother with the one where the worst horse won every race or even the best horse won every race, okay? Any questions? May I race and move on? How much time do I have? Oh, I have time. Okay, let's deal with this part of the board now. Now here's an interesting question. Suppose I have letters of the alphabet like I've been talking about and let's assume that they're independent, identically distributed, whatever it is, right? So you have a bunch of letters, right? And I'm going to make a string. I'm going to make a string. This string is a string of letters of length n. Previously I was talking about each one of these being the name of a horse. Now I'm talking about each one of these things being letters of the alphabet, yeah? It doesn't matter, it's always the same. And we're going to assume again, independent, identically distributed. So in particular, this doesn't apply to the English language. Just keep that in mind. But assuming it's IID, yeah? And let's assume, right, that the value of each letter is drawn according to some probability distribution q of x, right? So x is a random variable, right? Each one of these x's has a value that's drawn according to this probability distribution, right? It's the same as this, I just erased it, but it's the same as that q, right? The actual probability distribution. So now I'm going to ask, what is the probability of getting this string? What is the probability of getting that string, yeah? And that's actually easy, right? So that's just equal to the product of i equals one to n, right? q of x sub i, because they're all independent, identically distributed, yeah? And now we're going to do a little trick. We're going to convert this from a product over the total number of letters in the string to a product over the total number of distinct, the total number of distinct possible letters, okay? And how do we do that? Easy. Again, we get q of letter j to the power the number of times symbol j occurred in the string. Is this clear? So this little thing means, so if I have a string, this is the product over 1000 letters, right? But this is the product of 26 letters of the alphabet. It's the chance of getting an a to the power of the number of a's. Times the chance of getting a b to the power of number of b's, and so on and so forth. It's a very simple thing, yeah? So, what we can call it n, little n sub j. Where n sub j equals number of times letter j occurred in the string. So this can be written as product j equals one to m, two to the n sub j log q j. Usual thing. And then if I actually take, if I do some gymnastics, right? This is then equal to, I'm going to do the same thing that I did with these kinds of calculations. I'm going to multiply by, I'm going to multiply by n's, divide by n's and so on, right? And for this purpose, let's define set p sub j is equal to n sub j over big n. So what is this p sub j? p sub j is, well it looks like a probability distribution because all the p's are between zero and one and the sum of all the p's is equal to one. We know that because the total length of the string is n. So the total occurrence of all possible letters added up must be equal to n. So what is p sub j? p sub j is simply the empirical frequency distribution of that string, right? This is the empirical frequency distribution. This is the probability distribution from which the letters are drawn and those are two different things. The empirical, it could be that these are all a's, right? Or it could be it's some line from Shakespeare, whatever it is, but the underlying probability distribution is different from the empirical frequency distribution. So if I do the pluses and minuses and so on, turns out what you get, this is equal to two to the n, two to the big n minus d p q minus h q. Of p, right? So I want you to stare at this for a second. To go from here to here, all you do is replace nj by big n times pj and then you multiply and divide by pj. And the first one of those multiplications and divisions gives you this distance and the second part of it gives you just pj log pj, which gives you that. It's the same as we did earlier, that's why I'm not repeating it, okay? So looking very similar to the kinds of things we calculated earlier, right? That's why these definitions have been so integrated into the study of information theory. So this, stare at this for a second and then let me explain a few things. So suppose I have a string, where the string is, let's say 26 letters long and every different letter occurs exactly once, okay? Then the empirical probability distribution is one over 26 for all the letters, yeah? So that's, so this h will be log of 26. Let's assume there's 32 letters, right? Spaces and commas and so on, right? So 32, so 2 to the 5, yeah? So h will be, because it'll be log of 32, yeah? This thing will be the difference between the empirical distribution, which is uniform and the actual probability distribution from which it was drawn, okay? And remember the empirical one occurs on the left, the actual one occurs on the right. These kinds of things become sort of very difficult to keep track of as you go forward, right? But this is not symmetric. So here's a question. If the empirical, sorry, if the true distribution is truly uniform, if the true distribution is truly uniform, then which empirical distribution is most likely? Which is the single most likely empirical distribution? And the answer is also uniform. It's the single most likely thing that should occur. If all the letters should be equally likely, the single most type of empirical distribution is when all the letters are equally likely. This is something which sometimes in other cases is proved by a sort of law of large numbers. If it's a Bernoulli distribution and you have a one-third chance of heads and a two-third chance of tails, the single most likely string of events of say 900 tosses will be 300 heads and 600 tails. It's not obvious, right? So how do you work out the single most likely thing? So this shows you because this thing then goes to zero. And when that goes to zero, you just get h of p, which is the same as h of q. So two to the minus n h is also in some sense a measure of the probability of certain events occur. So let's keep this formula in mind and I'm going to walk you through a certain series of calculations, right? You might think some of this is sort of fairly obvious, but it's not. The statements I'm making, if you think about them slightly more carefully, you'll see that you've never seen proofs of these things before. But you might have known the answer, but you've just been taking certain answers for granted. In particular, this is exactly the formula that is the heart of large deviation theory. Suppose I have a coin, which is a one-third chance of heads and a two-third chance of tails, okay? And I want to know what is the chance that you get one-tenth heads and nine-tenth tails, right? You just plug it into this formula. It's exponentially unlikely. It's the difference between the distribution 0.1, 0.9 and the distribution 0.3, 0.7, okay? That's large deviation theory is completely contained in this formula. So let's look at this, right? So here's the string, right? This is the string x1 to xn, right? All I've done is then grouped the strings into a bunch of a's and a bunch of b's and a bunch of z's, which might be zero, right? And this is called n sub a, n sub b, n sub z. I know that the probability distribution from which this is, I know that the underlying q that makes this the single most likely thing to have happened is when q is actually equal to the empirical distribution p. So let's assume q is actually equal to the empirical distribution p, which is n sub j over capital N. In which case the probability of this event, as we've already seen, probability of that string, I'm just rewriting the same thing, right? Which is pj, right? To the power nj to the power of nj, product j equals one to n, right? Which is pj to the power of n pj, which is, okay? This is one way of calculating based on this. If you hadn't seen a class in information theory and I said how many possible ways are there of making this string with exactly those letters occurring that number of times, but otherwise permuting it? What is the answer? How many ways are there of making a string? It's just the multinomial coefficient, right? So this is equal to n, n1, n2, n sub m. M is the total number of letters. Now we know that the total number of ways this string can happen times the probability of that string happening, there may be other strings, but the total number of ways this string can happen times the probability of each one of these strings must be less than or equal to one, because the total number of ways all possible things can happen must be less than or equal to one. This immediately shows you that two to the minus nh is an approximation for the multinomial distribution, okay? Two to the minus nh is an approximation to the multinomial distribution. In particular, the binomial coefficient n choose k, right? Is in fact, approximately equal to two to the nh of k over n. These are the same things, okay? This is actually very interesting. It's approximately equal to this. I mean, this is an overestimate, but it's there. And if you actually work out by Sterling's approximation, if you put n factorial over n minus k factorial, k factorial, put in Sterling's approximation, go through the motions, convert log base e to log base two, this is what you get. So it's fascinating, yeah? Letters. Letters in the alphabet. No, in the string. M is always the total number of distinct types of letters in the alphabet. N is always the length of the string. But M is always the total number of distinct letters. N is the length of the string. So this is a little tidbit. So whenever you're using multinomial coefficients in practice, it's always much better just to use the sort of Shannon entropy approximation for the same thing. It's a very useful little trick. Does this make sense? Does this make sense? Let's think about it. Suppose you have a string, which is the string 0, 0, 0, 0. Now let's say you have a string just 0, 0, 0 in three dimensions. And you have a cube, which is the set 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 1, 1, 1, 1. This is a hypercube. This hypercube tells you all the ways in which you can choose some number out of three. You've chosen 0 out of 3, 1 out of 3, 2 out of 3, and 3 out of 3. So the binomial coefficients in this case are 0, 3, sorry, 1, 3, 3, and 1. You start off at one corner of the cube. There are three possible ways to go. There's three more possible ways. They come to the other side. So in general, a hypercube has no options where you start. It has an increasing number of options as you go away. Then it has a decreasing number of options on the other side. And the log of the number of options is exactly this. It's exactly this. This thing gives you a measure of how many spaces there are in a hypercube, which are k steps away from the origin. And it's a very, very good approximation. You know it's not a perfect approximation because at p is equal to half, which is exactly halfway through the cube. In this case, h is 1. So you just get 2 to the n. So this approximation predicts that n over 2 steps away from the all-zeros string, there should be 2 to the n strings. And we know that's not true because there are only 2 to the n strings in total. And that's why this is a slight overestimate. The next step of correction of Sterling's approximation will fix those overestimates. Now all this is just side stuff. It's kind of nice. And when you do a lot of statmic calculations, this little approximation will be very useful for you guys. Any questions about this? This is an aside. It's not the main point. It's another case where the entropy just shows up naturally. It's asking for the probability of a string arising from the very distribution the string implies, using the empirical distribution of a string and saying, what's the chance this string could have arisen from that empirical distribution? So we're going to do now something slightly different. We're going to ask for all possible strings that can arise from certain distributions. For the tutorial tomorrow, another aside I need to do. So these probability distributions are weird objects. They don't satisfy the sort of Cartesian measures that you're used to. They're not normalized to 1 by adding the squares and taking the square root, for example. It's not Cartesian normalization. The normalization is just adding up all the p's. It's L1, adding up all the p's, making sure that the total sum is 1. So the geometry of probability is slightly different than the geometry of standard vector spaces you have worked with. For example, let's look at a distribution over exactly three different letters, a, b, and c. Then there are just three possible numbers. There's p, a, p, b, and p, c. And each one of those numbers can go between 0 and 1. However, the sum of p, a, p, b, and p, c must be equal to 1. And therefore, you must lie on this triangle, which is a simplex at a certain distance. So all probability distributions lie on these kinds of simplexes, which satisfy the summation constraint. When I want to draw an example of a probability distribution, sometimes I'll draw a triangle just like this. And you should be thinking that this triangle is what happens if you're standing here and looking. So it's really one dimension down because the summation constraint removes the damage. So whenever I draw these little triangles, if I ever draw it in the future, this is like an example of a space of probability distributions. Every point in this space is a different probability distribution. Every point in the space is a different probability distribution. Now look at the empirical points in this space. Suppose I have strings of length 4. Well, let's start first of all, what is this midpoint? What is this midpoint? If this is the probability distribution, what is this midpoint of this equilateral triangle? Which probability distribution is it? It's a uniform distribution. So the center of this triangle is the uniform probability distribution, with entropy equal to log 3 in this case. The three tips are the certain probability distributions where you're all A, or all B, or all C. The entropies of those things are 0. So as you get closer to the center of this triangle, the entropy gets higher and higher. And it's a convex function, and you reach the high point. So this is like the two-dimensional analog of this curve. This is like the two-dimensional analog of this curve. Now, an empirical probability distribution, suppose I had just a string of three letters, but one letter in length, then what strings can I possibly have? I can have just A, or just B, or just C. So these are the three empirical probability distributions I can possibly get if n is 1. How many empirical distributions can I get if n is 2? If n is 2, I can have AA, AB, BB, BC, and so on. So where do all those lie? So the AA distribution is here. Where's the AB? This is AB. AC, BC. So if n is equal to 2, I have these six points as my empirical probability distribution. If n is equal to 3, what are my empirical probability distributions? If n is equal to 3, then these are the strings I can have. I can have AA, ABC, ABB, and so on. And there's exactly 27 of those. So what are the empirical distributions? Well, you can have just A's, or just B's, or two A's and a B, or two B's and an A, similarly with C. You can have all three. Am I missing something? Yeah, that's right. So that's the empirical probability distributions if the string is of length 3. In general, empirical probability distribution, the observed frequency of letters in a string will be some sort of lattice in this space. It's a lattice because the empirical distributions always must have a denominator of n. These are rational numbers with denominator n that sum up to 1. And there's three of them. That makes a lattice. The true space of probability distributions is obviously the entire collection of real numbers that fills up this plane. We're very happy with this description. And the middle of that is the uniform distribution. That's the three dimensional analog. Given that I've told you how to do it for three dimensions, I'm going to fall back into a different kind of notation. So how do we define? So suppose I know that my probability, suppose I know that the underlying probability distribution is q. And suppose I know that the underlying distribution is q of x. Which string is the most likely string to occur? I've asked a slightly different question. Previously, I'm looking at empirical distributions. Now I've just given you some q. Which string is the most likely string to occur? For which the answer, you have to look at this. If this q consists of, for example, irrational numbers or some rational number which doesn't have denominator n, then no string can actually match the exact distribution. And therefore, there's always an extra term that carries you away a little bit. So that's one interesting problem that true distributions do not respect the idea of a regular lattice. But they could be anywhere else in this space. And by using this metric, you have to find the closest point that is going to be the most likely one. Because the further away you are from that point, the more unlikely it's going to get. Again, the idea of large deviation theory. So what I'd like to do is to work out what all the typical events are going to be after n horse races or n letters of the alphabet. And only pay attention to those. So let me start you off along that process. I might finish by one. If I don't finish by one today, I'll take it up again tomorrow at the same time. So let me pose my question again. So I have some underlying distribution. One problem with the underlying distribution is it may not be written as some sort of rational number or may not be written as a number with exactly denominator n. And therefore, I may not be able to find a string of length n that exactly matches it on a lattice. But this is not a problem. Because for sufficiently large n, you'll get close enough to it. Because the rationals are sort of thick in this plane. So that's not the real problem. So for example, if my probability distribution over three letters is some three numbers that add up to one. Let's say two-thirds, one-quarter, and whatever the last bit is of that, one-twelfth, I guess. How much is left after that? Yeah, I guess that's right. So suppose I have these three numbers. Let's write them down. So if the probabilities of pA, pB, pC are two-thirds, one-fourth, and one minus two-thirds plus one-fourth, which is 8, 11, 12. So it's one-twelfth. Suppose this is the probability distribution. And let's assume that the length of my string is some very nice multiple of 12. Let's assume the length of my string is 144. Whatever it is, some multiple of 12. Then eventually, let's assume it's even 12. If it's 12, then I'm going to have two As. Well, I'm going to have eight As. I'm going to have three Bs. I'm going to have one C. And some permutation thereof. This, by this little formula, is the single most likely sequence to occur from this probability distribution. What is the probability of this occurring? The probability of this occurring is 2 to the minus n h of the probabilities. Now this is a problem because n is 12 and h is h of this thing. Now suppose I take a sequence of length 24 or a sequence of length 36 or a sequence of length 1,200. In a sequence of length 1,200, I expect 800 As, 300 Bs, and 100 Cs. That sequence and permutations thereof are still the single most likely type of sequence to occur. But the actual probabilities of those distributions keep on falling because the actual probabilities will become 2 to the minus 1,200 times the same h. So your homework problem, which is problem 6, if you've already tried to do it, problem 6 just asks you to go work out these probabilities. Work out the probability of this sequence and work out what the chance of each one of those things is. So there is something that's going to rescue you here. And I want you guys to see if you can see what there is. So what puzzle have I put forward for you? Puzzle I put forward is, here is a distribution I give you. It's 1 third, 1 fourth, 1 12th. It's totally cool. Rational denominator adds up to 1. Everything is fine. It's something in this space. Here's a sequence I gave you. It's 8 As, 3 Bs, 1 C. That's the probability of getting the sequence. It's just a formula, no problem. If the sequence becomes longer but still maintains the same proportion of As, Bs, and Cs, their probabilities are dropping even though it's the most likely type of sequence. Why is the probability dropping? You have many combinations, very good. So how many ways are there of making this sequence? How many ways are there of making that sequence? The number of ways of permuting it is 12, choose 8, 3, 1. So that's going to rescue you, you hope, right? So although the probability of any individual sequence seems to be dropping exponentially, that's because you forgot that you can permute the letters in the sequence. And so you think, ah, great. Not bad, right? If I can permute the letters, then obviously, although the probability of individual sequence is dropping as a function of the length of the sequence, the number of distinct sequences is increasing quite nicely. It's increasing as some n factorial type of thing. So here's the question. If you multiply this, right? If you multiply n, choose 8n over 12, 3n over 12, n over 12, and you multiply this by 2 to the minus nh of 2-thirds, 1-fourth, 1-twelfth, right? This is the total weight of all sequences that have that distribution of letters, right? It's all the ways you can permute it, and it's the probability of each one of those. And now I'll ask you, as n increases, what do you think happens to this number? Yes? There are others. But of course, there are many others in this space. But I'm just asking, let's just focus on the one that we know is the most likely one. Let's just hold on to that. And we know that all the ones that have exactly this number of letters are equally likely. So then they're all degenerate. So you just have to multiply by the degeneracy factor and multiply by the probability. So here's the question. Here's the question. I was being silly earlier. I was saying that the probability of any individual sequence is going down exponentially. But I was being silly because the total number of such sequences is going up nearly exponentially. So my question to you guys is, if I've finally taken this factor into account and I multiply this by that and take the limit as n goes to infinity, what do you think happens to this combination? What is the total likelihood, yeah? Okay, how many people think it goes to one? Okay, how many people think it goes to zero? How many people think it goes somewhere else? Okay, one over pi. The surprising answer is that it goes to zero, okay? The total statistical weight of the most likely sequence in all its permutations, even once you've taken into account this factor goes to zero, okay? So this is a highly surprising statement. It's the statement that my approach, if it went to one, then that means that this is approximately two to the n h. And it cancels to two to the minus n h, right? The fact that sterling's approximation has a higher order correction, right? Is what kills this formula. And that higher order correction drives this thing to zero, okay? And this, now, I mean, you thought everything was going fine. I only have to take, if I know that horses are winning in the ratio two thirds, one fourth, one twelfth, the only races I ever have to encode are the horse races where you have exactly two thirds, one fourth, one twelfth of each horse. That's the only one I have to encode. And you're totally wrong because almost all races in the limit of large n will not look like this. Think about this for a second. It's kind of spooky. So this probability distribution is somewhere, let's see, it's mostly a's. It's a few b's and very few c's. So this probability distribution is somewhere here. No, it's somewhere, it's mostly a's. It's a few b's, yeah, it's somewhere here. Right? That's where it is. That exact sequence, that collection of sequences, the total statistical weight of it, the total weight including the degeneracy of permutations goes to zero. And this has just thrown a wrench in my whole approach of trying to look only for typical sequences because my best definition of typical sequences just let me down. Even though these sequences are the most likely to occur in all their permutations, in the limit of high n, they hardly ever occur. In the limit of high n, there's hardly any race where all the horses win in exactly the right proportion in which they're supposed to win. So now I open it up to you guys. Now that this, I'm glad there was a gasp when I said this thing goes to zero. I was also surprised when I first saw this result. Since it goes to zero, now I need a better definition of typical events other than this very close-minded thing that the typical event is exactly two-thirds, one-fourth, one-twelfth of A's B's and C's. So how do I expand my notion of typical events? Any suggestions? Sorry? This is the probability that maximizes it. This is the very sequence whose probability is maximized. This is exactly that. Yes? Take a value plus minus something, very good. So the idea, the idea is somehow, instead of taking exactly that single point in this space, which is the typical one, we're going to do a little region around there. Okay, we're going to do a little region. You know the reason this happened to us? Because every single point in this space, the statistical weight of every single point, which also includes these permutations, the statistical weight of any single point goes to zero. But as you increase n, what increases inside that little circle? The total number of points, because this lattice becomes denser and denser and denser in this space. Okay? And the hope is that at least the total statistical weight in some small but finite little range does go to one rather than zero. Okay? So now in the 10 minutes that remain, I'm going to prove to you that such a definition can be made and I'm going to prove all the limits that are going to be needed for this purpose. Any questions so far? So the approach is clear, right? What we're going to do? Okay. So here we go. And I need some more space. I think I'll get rid of this. Okay? I'll leave that. I'll leave that. So the trick turns out to be defining the typical set. The typical set is all strings that satisfy a certain property. These strings are not all exactly the same as the most likely sequence, but they're close to it. Okay? For example, you could take the empirical probability and take the predicted best most likely sequence and you could take the Cartesian distance between them. Right? But instead of doing that, we're going to use a slightly better definition. All right? And the definition is the following. Minus one over n. Okay? So definition is something like this, right? The probability of the string, probability of that string, okay? Lies between two to the minus n h plus epsilon and two to the minus n h minus epsilon. Right? So what we're going to say is that the probability of that string is not exactly equal to two to the minus n h, which was the mistake we made earlier. We're going to allow a little variation, but this thing doesn't scale correctly. What we really want to do is to put the n and allow the n to multiply the epsilon. So the epsilon itself becomes n independent. So can everybody see this I've written? Two to the minus n h plus epsilon, which is a small number, and two to the minus n h minus epsilon, which is a bigger number. The probability of the sequence occurring lies between these two values. Okay? This is a very subtle definition of typicality. You could have a sequence becoming more likely. It could be a highly unlikely sequence, and then you have the most likely letter that rescues it, right? So the idea of what a typical sequence is, we're going to explore in the tutorial tomorrow. I'm going to sit down and calculate some typical sequences, right? But if you are curious and interested, look at problem six in your homework and work out what sequences seem to be showing up, right? So is this definition clear? So in other words, probability of the sequence divided by n minus h in absolute value, probability of sequence divided by n, is it minus h is less than epsilon? Did I say that right? Log, log, log, log, log, log, log, right? p is less than one, log of p is negative, minus one over that makes it positive, positive minus a positive quantity. The difference between those two is less than epsilon, less than or equal to if you want to. That's the definition of that, right? In fact, less than epsilon, forget less than or equal to. Okay? So this is my definition of typical set, right? This set is a set of all sequences that satisfy this property and I'm going to call it a and I'm going to index it by epsilon, okay? It's the typical set defined up to some divergence epsilon or a sub epsilon, rather. Okay, this is the typical set. One of the sequences that's going to be in the typical set was the kind of sequence that we had earlier, which was the exact best kind of sequence that we should have, but there's going to be many other sequences in this typical set, okay? So let's try and work out what the geometry of the space looks like. I need to have some more space, so let me try and make a big figure up here, make a big figure up here. On the x-axis, I'm going to plot, I'm going to plot log of the probability of the sequence. The log of the probabilities are always, the logs of the probabilities are always negative because the probabilities are always less than one. That shouldn't bother you. That just means that the zero point is somewhere further to the right. However, more likely sequences are on this side and less likely sequences are on this side, okay? More likely sequences on that side, less likely sequences on this side. For example, in the English alphabet, assuming it's independent, identically distributed, according to the known empirical distribution of letters, the single most likely sequence is what? It's this one. It's all E's. And the single least likely sequence is this one. It's all Q's. So on this axis, right? And so what is this? This will be, it'll be N times log of the probability of E's. And this will be N times log of the probability of Q's. Okay? Now, because I want to take N out of the system, I want to sort of renormalize my graph. I'm going to renormalize it by just dividing this by N. That's why this one over N is here. So that I can keep on drawing my graph with the same scale. So the probability of the single most likely sequence, so let me see, extend this a little bit. Let me label the axis. And the thing that's being plotted is one over N, log of the probability of the whole sequence, right? This sequence is all E's. And so one over N times log of that is actually log of P of E. And one over N times this is log of P of Q. And since P of Q is much less than P of E, this is to the left, that's to the right. Okay? Now, first of all, first of all, how many different, so these are discrete things, right? The string of length N, where N is some number, like a thousand, that there can only be a certain discrete number of values that the probability of the sequence can take on the x-axis because there's only a certain discrete number of strings. Okay? Not all probabilities can be obtained. So the x-axis is actually discrete. And maybe even bunched, maybe even bunched in some regions, unbunched in some other regions and so on, right? So if I wrote down all possible sequences of length N on the English alphabet of capital letters A to Z, and I calculated their probabilities according to this formula, and I just wrote down all those probabilities, every one of these bins will actually have many sequences in it, because all sequences that are permutations of each other have the same probability. So let me just, first of all, stop and say, what will the total number of sequences in each of these bins look like? How many sequences are there in this bin, which is all Qs? There's only one. And in all E's, there's only one, right? Which bin has literally the most sequences? It's the bin which has even equal numbers of all the letters, because then there's many, many more permutations, right? So there's some bin that has the most number of sequences. So if I were just to plot, in fact it'll look a bit like this, if I were just to plot the total number of sequences in each bin, permuted sequences in each bin, it would look something like that. Where the midpoint of that is totally uniform, okay? However, the bin that has the most sequences, which is all equal numbers, has very low probability, right? How does the probability go? Probability is increasing, and if I were just, I'm plotting the log of the probability, right? If I plot log of the probability on the y-axis, what if I plot the actual probability? The actual probability is increasing, and the number of sequences times the probability is the thing we're trying to capture. So it's the product of these two terms, okay? So the total statistical weight of all the bins will be some shape that's slightly shifted to the right. It looks like this. It's the degeneracy of the sequences times the probability of each sequence. And stat mech, okay? It would just be, this is the entropy term, that's the enthalpy term, okay? If you're used to that kind of thing. Doesn't matter. Where is the peak of this? We just proved where the peak of that was. The peak of that is exactly where the probability of the sequences is two to the minus nh. That's what this little calculation told you. So this peak, where you take the total degeneracy, you take the probabilities, you look where you are, this answer is actually minus h. This is uniform. This is single, most likely. Why is it minus h? It's because everything on this axis is negative and h is a positive number. The probability of the single, most likely sequence is two to the minus nh. That's what this tells you, okay? Good. So now we're done, right? All we have to do is to take plus minus epsilon within this range, minus h minus epsilon and minus h plus epsilon. Collection of sequences. It's a total weight of the single, most likely type of sequence. It's the total weight with permutations of the single, most likely type of sequence. It's not the single, most likely sequence, which is all e's. Thank you. Okay, so it's one o'clock. So I'll just wrap up with the following point. We'll take this up next time. But everybody's following me, right? So what I urge you to do is to try and work out just as an exercise and if we have time, I'll take it up in the tutorial tomorrow. Take the probability distribution one-third, one-fourth, one-twelfth, okay? Generate a bunch of sequences and see if you can plot these diagrams as a function of n. Now you'll find that the probability that I plot on the y-axis, the total statistical weight, right? This'll be two to the minus n h times some number. What is the height? So as I rescale by n, the x values here don't change by design. As I increase n, I can still keep plotting the graph with the same axes because I've divided by n over here. I divide it by n just so that I could keep on plotting the same plot again and again and again on the same axes. So what is the general vertical scale of this plot? What's the vertical scale of this plot? The vertical scale will be the total statistical weight of the highest bin. It's the total number of sequences in the highest bin times the likelihood of that value. So the height of this graph is off order one because the approximation is not off by much, right? So anyway, so we don't know what the height is, but let's say it's off order one. But the x-axis is over here. So I urge you to plot this diagram for a simple case of a probability distribution, two-thirds, one-fourth, one-twelfth. Make all the sequences, see where they lie, see if you can get this histogram. And then do it for a sequence length of 100, then 1,000, then 10,000. What do you think is going to happen to the green curve? Once I get an answer to that, I'll stop today's class. So as I increase the length of the sequence, what happens to the green curve, which is the total probability weight of all these sequences? It narrows, very good. So remember what happens? It doesn't just narrow. So because of the idea that this thing goes to zero, the height of this curve is an ambiguous thing. So it's because the number of sequences on the x-axis, the density of those points keeps on increasing. So as you make these plots again and again, if you made a movie of it, I can see the movie in my mind, right? The number of discrete points in the x-axis goes on increasing, but the height of the plot decreases. But the sum of those two still means that the total statistical weight in this range eventually reaches some sort of, right? Everything will be within plus or minus epsilon. Everything will be within plus or minus epsilon. And how many will there be? There'll be lots of sequences in here. There are lots of them, but since the sequences are coming close together, because the sequences are approaching that value, the total statistical weight that they all contribute becomes higher and higher. But the actual probability of every one of those sequences will still be small. It's just that there'll be lots of sequences of those types, not just this many, but many, many, many more. And on the whole, when everything finishes, there's a fixed number in here. And that number approaches one. It's like one minus epsilon. And there's very, very few out there. So I'll stop here. So this is the starting point. The idea of defining a typical sequence as being a sequence that is in some sense close in probability to the best sequence you might expect. This is a form of coarse-graining. It's a form of coarse-graining. This whole theory works only at finite epsilon. It doesn't work for epsilon equal to zero. And so we're going to develop the theory at finite epsilon. And for those of you who like doing epsilons in deltas, I don't know how many of you like doing that, but next time we're going to place some epsilons in delta games to see where this takes us. Okay. 230, if you'd like, you can come back in here for a seminar. Yeah.