 Aware of the decision so I'm going to do the exam tomorrow. Okay, if that's a surprise to some of you. I hope it's not So the idea was that Tomorrow at 1115 Thursday at 1115 will do the exam here that way on Friday. You only have to do one exam It also gives me enough time to grade in fact the material that's in the exam has already been covered by me Okay, so you are already equipped to answer the exam with all the stuff that you've done I will also be doing the tutorial today at At 5 p.m. Is that right? Yeah, so the tutorial at 5 p.m. Which will go over the material from the course and It will cover you know everything that you need for the exam and I'll also answer any other questions You might have about materials that I've covered and the homework should have already been turned in The homework solutions will be posted later today and the exam solutions will be posted immediately after the exam Okay, and I hope to be able to grade it by Friday Good Yes, yes, I'm going to give all the answers to all the questions of the homework will be up on the website soon Okay, so even if you haven't solved the other questions fully, you know You don't have to look at it in your own time. You can see what the solutions are and catch up with the material Okay, so So last time we we did we covered quite a bit of ground We stuck with this idea of Shannon entropy Right, which is we define the Kuhlbach live blood divergence right and the idea we proved that This term is always greater than or equal to zero and it's exactly equal to zero only if and only fp is equal to q It's a sort of measure of the difference between two probability distributions We figured out other aspects of how how this thing behaves and and Then we went on to try to understand this idea of typical sequences Okay, so if you have a probability distribution then what kind of sequence is most likely to show up and we made a figure of this sort So I'm just going to copy the figure out again So on the x-axis we have one over n of the log of Of the probability of some sequence of length n the x's belong to the alphabet And as usual m is the number of distinct messages messages and K is log 2 of m, which is the naive number of bits per message that you expect to send okay, so we have this and We also wrote down a couple of important things. So if for example X was drawn from the English alphabet So m will be like 26 capital letters plus some punctuation marks and so on right the most The single most likely sequence will be all ease E single most likely sequence and the single most unlikely sequence will be all q's Okay, so this is less likely this is more likely Yeah, and I told you guys to go maybe work out a problem Where you could actually see what probabilities you would get for various different sequences So the point is that because this is a discreet problem There's only a certain collection of problem There's only a finite number of sequences you can possibly make out of these letters and each of those sequences has some probability Right, so the probability of a sequence occurring is given by this product right it's the product J equals 1 to m right PJ and The to the power of the number of the letter J that occurs in the sequence Yeah, so that's the probability of each each individual sequence occurring and For example, therefore the probability that you get all ease will be the product of the probability for each letter being e So it'll be p sub e to the power n you take the log of that you take 1 over n So this is just log of P sub e Right, and this one just becomes log of p sub q these numbers are negative because all probabilities are less than one and And then we try to see what other kinds of probabilities will occur and there are Various discreet values that can happen and just for fun if you guys want to work out not in the full English alphabet case but in a very simple case we discussed the idea of a alphabet with three letters, right a b and c and Forget how I wrote it down and you can pick some probability distribution over here So for example, we picked a probability distribution p a p b p c Was I forget two-thirds One-fourth and one-twelfth Something like that and if you take these distributions of three letters and you can make a bunch of strings out of it You'll find that there's only a discrete number of strings and each one has some Probability of occurrence. Yeah, now so each one of these little ticks. I'm making on this axis Yes Don't choose a very large and choose a very small in okay Choose like any is equal to five or any equals ten and see what kinds of probabilities you get Okay, so this x-axis is not a continuum. There's only a certain discreet collection problem So each one of these ticks corresponds to the value of this quantity For some sequence right, but how many sequences will there be in each one of those values? The number of sequences in each one of those values are all the ways you can permute exactly that number of letters in many many different ways Right so for something like this, there's a sort of degeneracy factor, which is in choose in one in two in sub m Right, that's the degeneracy factor. This is not meant to be a product. That's the degeneracy factor Yeah, so all those have the same probability because we're assuming independent identically distributed at all times Okay, and yes Yes Yeah, so it's n factorial over n one factorial and two factorial and three factorial all the way down. It's the multinomial coefficient It's just all the ways you can permute this collection of letters because all those will have the same probability Okay, it may also be the two completely different. Okay. Let me ask a question. Suppose the distribution over three letters is uniform Suppose the distribution over three letters is uniform Then how many ticks will I have on this axis the distribution over three letters uniform? How many ticks will I have on this axis? Just one they all have the same likelihood Right, they're all just one-third to the power n when they're all have exactly the same likelihood Okay, if two of the letters had the same probability and one had a different probability, right? Then there's still there are more Sequences in here than just permuted versions of the same type It could coincidentally be that two completely different collections of letters have the same probability Yeah, so when you when you when you do the numerical calculation, you'll find that the number of ticks here is a sort of non-trivial number Okay, nevertheless Roughly there's some sort of degeneracy to the To the number of sequences that can occur here and unless there's some sort of coincidence of the type I've said unless it's a uniform distribution or two letters have exactly the same probability unless that kind of thing happens Right roughly you can think that the degeneracy inside each tick is given by this multinomial coefficient Okay, and if you think about it that way, then there's going to be some sort of peaked function, right? It says that the single bin that contains the most Sequences will be the bin where all the letters are equally likely Because that that gives you the maximum number of ways to permute that's the maximum way you can evaluate this thing Yeah, so that's just the degeneracy, right? But the probability remember is increasing in this direction of each sequence, right? So the total statistical weight inside each of these bins has The shape of the green curve. I've lost my green chalk. That's okay No, okay. I'm going to use white chalk again. So the total statistical weight that you have inside each Inside each bin which is the probability of each sequence in the bin times the number of distinct sequences There could be will be the product of these two things. I'll try and draw it in bold Right. It'll look something like this, right? It's the degeneracy times the probability Okay, and we proved last time that the peak of this is at a value Which is minus H Okay, the peak of that is at a value minus H in other words if you substitute here the probability of A sequence exactly over there Will be something like two to the minus in H Right, I'm just saying something like because they may not be an exact sequence That has that probability because these sequences have rational denominators, right? So they have a denominator which is in and The actual probability distribution may not be rational, but modulo that That's where we are. Okay, that's the peak and what you had done or what you will do for the homework is you figured out that actually Even if the probability distribution has this very nice rational form as given Right and therefore you can find sequences that exactly match the corrected expected value Right which belong to that bin The actual probability of getting such sequences for large n goes to zero as n goes to infinity And therefore we you know although they do count as typical sequences, they don't cover the space And therefore what we did was we actually expanded so let me get rid of all these curves here And let me just draw the total statistical weight, right, which is that guy What we did was we expanded around H a little region which is minus epsilon and plus epsilon of that And we said that anything in this region Anything in that region is something I'm going to count As I mean it looks asymmetric it should be symmetric around Right anything in that region is something I'm going to count as a typical sequence So every sequence in here is my typical set a sub epsilon And again, I urge you to do this numerically Take the three letter alphabet make this chart for like n equals 10 see where the ticks are Okay, and see what the actual statistical weight curve looks like And then we play the game where we take the limit as n goes to infinity So as n goes to infinity a couple of things happen So these n points don't change That's precisely why we scaled it down by by n. Okay, so the n points do not change What happens though is that you get more and more ticks in here Because for larger n there's more and more ways for the probability to be Because it's just more and more sequences that intersperse this constant axis Okay, that's the first thing that happens the second thing that happens Is also very important is that this Curve gets sharper and sharper and sharper Okay to the point where you know, you'll end up getting something like this Increase in right actually that's not true. Let me let me be very clear That's actually not true. What happens is This curve does get sharper and sharper and sharper, but it actually gets shorter and shorter and shorter So because it's not a probability density of the type you're used to Right, it's a sum of probabilities of a bunch of ticks Right, so it's literally one a discrete Distribution right so the curve gets very sharp, but it also gets very short But what saves you is that more and more and more points come into the zone So as you add up that over the large number of points that actually reaches some finite back Okay, in fact it goes to one. So that's what we're going to prove now Let's just do it very quickly. So remember the definition a sub epsilon is The set of sequences x sub n it's a set of sequences of length n such that Okay, the one over so well such that if you want to write it this way the probability of getting that sequence Lies in this zone And if I just multiply by n and sort of take the exponential you get 2 to the minus n H plus epsilon Is less than or equal to this Or if you just less than And it's less than 2 to the minus n h minus epsilon Okay, that's the That's the definition of the set. Is that obvious it is because I'm just taking h plus minus epsilon The epsilon has to be in the bracket So that as I scale n the width of this in that rescaled axis still is always the same Okay, so that's why the n is outside In this definition and the signs if it confuses you this is a smaller number This is a larger number because this is 2 to the minus n of a big number This is 2 to the minus n of a slightly smaller number Okay, so this is the definition of a typical set How many sequences are in here? That's the interesting question Yeah, so as n becomes bigger and bigger and bigger Sorry So as n becomes bigger and bigger and bigger there's going to be more and more ticks that lie in here So just the sheer number of ticks and how many sequences in each of those little ticks Is one interesting quantity and the second is what is the total statistical weight of this set Out of all sequences in other words multiply the degeneracy by their probability of occurrence And that's what we're going to calculate now. Are there any questions to the setup of this problem? Okay, it's actually quite non-trivial and you need to go like make a little animation of this to see how it plays out Okay, because the naive expectation would have been that the single most typical sequence So the single the sequence that's right at the center is enough to define typicality. It's not Okay, so So let's work this out So now we have to play this epsilon delta games and I'm just going to literally read the proof from the from the book right and To keep all the epsilons and deltas fixed in my mind because it's important that we go through the proof Okay, so the first thing we want to know is What is the total statistical weight of this set What is the probability that any sequence you draw? Will lie in that set, right? So we can write that down in the following way. It's the probability That minus 1 over n log 2 Right, it's a probability of a probability keep this in mind. Okay, minus h for that thing. Okay, so it's a tricky thing It's a tricky thing So this is the probability of this thing happening Something sort of dangerous has just happened This thing is a probability And that thing is a probability and if you're not keeping these things straight in your mind It looks rather confusing. So let me step back and explain the following thing What did I do in principle? I could write down all m to the power n sequences Right, I could make a giant list of all m to the power n sequences In other words, all the alphabet letters in all possible ways they could be drawn over some length n Okay, that's a that's a very large number of You know, if I was just looking at the sequence space, there would then be m to the power n dots in that space Now I'm going to in some sense Coarse grain this right the way I do it is to each dot. I attach A perfectly simple label I could have labeled them by color But I could have labeled them by anything so I choose to attach to each sequence a number to act as its label Right, it's a perfectly simple deterministic calculation. Which number do I associate with that sequence? I associate this number Right, since you already have the piece somebody gave it to you, right for every sequence. I can simply make a number So when I write probability of the sequence, don't worry about it. All I'm calling it is just a label that is hanging attached to that sequence Okay, so now you have a space of m to the n sequences, but each of those dots has a little number labeled And now I just collapse all sequences with the same label. And that's what this is This is in some sense a histogram of those labels. So what I'm asking is what is the chance right that the label Minus h is less than epsilon. How many ticks do I have close to this center point Right, so this is just a deterministic label And what am I asking if I draw a random sequence? What is this is a this whole thing is a random variable A random sequence could lie anywhere here What is the chance that a random sequence has a label that drops it right in that bin? That's what this question is asking Okay, um And I want it to be in that bin plus or minus epsilon Okay, so it turns out that This thing Has to be well, so just stare at this for a second. It turns out that this thing has to be Equal or greater than some number. In other words, the probability gets arbitrarily close to one Right, and we're going to have to work out why that is so let me just work out Let us work out the proof for that and the proof just involves the law of large numbers In other words, if you take an average eventually it comes arbitrarily close Uh to the thing that you're interested in Okay, so let me show you Is there any questions about yeah, yeah, it's minus one by n because P is a less than one So this is a positive number Right Yeah, that's fine. So what I'm doing is taking the I'm taking This Minus this or I'm taking this minus that Right, it still works out. This is the difference between two things, right? So that's why you get the extra minus sign. Yeah, okay Yes, I'm proving that Okay, so what so the question is what is it? question is The question is how much of this histogram Lies within this zone The the total statistical weight of the histogram in this zone is the answer to this question Okay, the answer to this question will depend on epsilon. It'll depend on n And we're going to work out what that is For any problem of interest Is that clear? Okay, so here's how we do it so Since the system is independent and identically distributed, right this left quantity minus one over n log Two of the probability of the whole sequence, right is just Minus one over n times the sum of log of p of xi i equals one to n Because every letter is totally independent Every letter is totally independent Yeah, so this converges To what so this is like a sample average So remember we are always talking about sample averages versus expectation values This is a sample average and the sample average will converge to the expectation value Over this probability distribution itself It's a chance of getting that that sequence anyway Right another way to say this it will then converge to this sum So we're going to convert from a sum over the length of the sequence as usual To a sum over the number of letters Number of distinct letters so it goes from i equals one to n to j equals one to m Right And you get minus pj log pj Is this clear this is I do this all the time right this is just the total This is just the I'm adding up n quantities and I'm dividing by n so I'm taking some sort of average Now instead of doing a sample average I can just take the expectation value of this Over the probability distribution p which is what that is Yeah, and there's a minus sign there And so this is h this is clear right now this arrow is a very loaded arrow What does this arrow say this arrow says in some sense This quantity Which is a random variable every time I run it I'll get a different number for this If I get a bunch of sequences every time I run it I'll get a different number here It claims that for sufficiently large n in fact as In some senses n goes to infinity this number will get very close to that number Yeah, so what is the actual claim using epsilons and deltas? It says you know there exists Some n Not some n naught Such that For all n greater than or equal to n naught So for sufficiently big n the difference between these two things This is the left hand side This is the right hand side the left hand side minus the right hand side Will be less than Some delta that's what this arrow means So if you've done precalculus and you really paid attention to the people doing all these proofs When you write a limit like this a limit is not a vacuous statement a limit is a statement. It's like a game It says look you give me any delta You give me any delta And that delta You know is a sort of guarantee That this number has to be close to that number And you can make delta as small as you want You can make delta point zero zero zero zero one Whatever right and then I go away and I come back and say look I found an n Which could be a billion whatever it is and for all n's bigger than that number This thing will be close to that Okay, that's what that arrow means. Are there any questions about this? Yes n zero will depend on delta So the the smaller you make delta the bigger I'll have to make the n to make this whole business work This is your standard epsilon delta game If you're not used to it just stare at it for a second, right? Therefore I can make this Greater than one minus delta for all n greater than some n naught Which is a function of delta That n is if you actually take a course in math that n is very difficult to find you have to go and play some games and see See how to make that n naught as small as practical, right? But for sufficiently large n it'll work out So if it doesn't work for some man make it bigger make it bigger make it bigger at some finite value you will find it In practice how to find it is the whole clever game of doing limits In precalculus, which many of you may have forgotten But the first time you do limits you spend a lot of time trying to find the n's for certain deltas And trying to find like the best n for certain deltas and it's a very fun game Right how to find the n for the delta totally depends on the structure of the distribution And on various other things Question Imagine that some letter does not appear. Oh, I see you're saying probability is zero Yeah, yeah, so if something doesn't occur, but that the chance of that happening over large n is also approaching zero So so that's what that uh that is trying to imply Okay, so Anyway, it's a subtle point, but yeah Okay, so as long as you trust this it's just saying that I can get the left hand side close to the right hand side Another way to interpret this thing Another way to interpret but remember Sorry, I wrote something down. Okay, so in fact This is not Sorry, let me make it clear your question is absolutely right I'm saying the probability that the left hand side Minus the right hand side is less than delta Right goes to one Right for sufficiently large n right right goes to one for sufficiently large n Okay, thank thank you for the question. Of course, there are sequences that lie outside this range, right? It's a it's a statement about the chance of that happening and the chance of that happening can be made arbitrarily small Right. This is not the difference. It's the chance of the difference happening can be made arbitrarily small Is that sorry I made a mistake I made a mistake in the original formulation that I wrote down and thank you for the question Right originally I made a I made a wrong claim Okay, I made a claim that this side and this side can be made arbitrarily close That's obviously not true because some sequences will have wrong probability The claim I'm making is that this thing or one minus that thing can be arbitrarily close to zero Okay, that's the claim I'm making okay for any delta. I can find an n that makes this work It is it is it's a standard law of large numbers It's a standard law of large numbers Right, that's what the law of large numbers is it says that in probability This expectation value this sample average will converge to its expectation value In probability for sufficiently large n the probability goes to 1 for sufficiently large n how close to 1 delta close to 1 If you give me a delta I'll find an n suppose your delta is Point zero zero zero one Okay, so you want point nine nine nine of all the sequences that you draw to come close to this Then I'll find you an n such that that's true. I'll only be wrong point zero zero one of the time So that's done right So so this is this is a given So in other words What is what have we what have we found here? Let me just write all the Summary of all the things that we found in other words The total weight The total weight of the typical set Right is greater than 1 minus delta For sufficiently large n for sufficiently large n this histogram Most of its weight will be between this plus and minus epsilon Okay, where the n you find depends on the epsilon and depends on the delta So I'm sorry for the confusion, but I'm now I've said the exact correct statement Okay So and it's just the picture the picture is that this histogram The tails of it the total tail is delta and I can make delta as small as I want For sufficiently large n where of course the n will depend on the delta and the epsilon For some n for all n greater than n naught For n naught is a function of epsilon and delta So this is important because you know we're not doing this in some arbitrary limit. We're actually doing it For you know For finite cases so the information theory will actually be valid for finite cases of the system Okay, so the second thing So that's the first property Okay, the second property which is obvious right, which is what is the probability of every sequence in here The probability of every sequence in here is merely given by this thing, right? So the probability of every sequence so for all sequences If x1 xn belongs to this typical set Then the probability of x1 xn I'm just saying something totally obvious is less than 2 to the minus n h minus epsilon And greater than 2 to the minus n h minus epsilon h plus epsilon This is just by definition But the probability of each sequence in there is equal to this by definition So now what we want to do is to find and this is interesting We want to find the total number of sequences in here, right? We found the total statistical weight of this shaded region And we found the probability of every sequence in there between certain limits And now what we want to do is to find the total number of distinct sequences in there How many ticks are here and what's the degeneracy of each tick now? How would I do that given that those two statements are there? How would I do this piece? How would I find the size the number of sequences in there? You know the probability of every sequence within two limits Yeah Yes So you just have to add up So now I want you to work out with me so that you can engage with the inequalities Work out with me the inequalities. I'm interested in the size of this set These vertical lines means how many sequences are in that set, right? So remember I'd labeled all the sequences and I'm going to call the sequence typical if its label lies close to h I just want to know how many sequences are there totally so the strategy you outlined is perfectly correct So let's work out the Inequalities here so the size of this set is given by what how would you do it? So let's say one is the total probability At the total probability of everything must be one so now I want you to split that up into which is the sum Of all possible sequences, right? All sequences Right, but I'm not going to add up over all sequences must be greater than or equal to the sum Of the sequence belonging to the typical set Of the probability of the sequence. Okay, so follow up. We're doing I want to know how many sequences are in here and I'm going to do it by forcing it to lie Within some limits because I have some inequalities that I can really hold on to But I have this inequality which is guaranteed because n is big I have this inequality which is guaranteed because it's by definition So I have two inequalities in hand, right? so The total probability the total weight of this whole distribution must be one The shaded weight The shaded weight Must be less than or equal to one Right, that's all I know So now what should I do to continue maintaining inequalities in this direction? Based on everything that you have on the board. I want to write some other quantity here Because I know that this thing lies between these two limits So what should I fill in over there two to the two to the which direction? It should be the small one, right? Because I want to I want this to be even smaller, right? So it's two to the minus n h plus epsilon, right? So I'm assuming that every sequence in here I'm assuming the worst possible case that each of them are highly unlikely Right and since this is a constant this is must be equal to the size of the set Right times two to the minus n h plus epsilon. So then I've found that the size of the set The size of the set if I flip the inequality Right must be less than or equal to Two to the n h plus epsilon Okay, that's how I did it now. I want to do the other side. Okay, which is The only remaining bit. That's how do I do the other side? How do I do the other side? Okay fine I want to make a set of inequalities that go in the other direction, right? So I've got one minus Delta Okay, that's equal to what okay still tricky now. What do you do? Okay, this is equal to the sum of all sequences belonging to a sub epsilon Right of the probability of the sequence And that's just equal and now The other one which is equal to the size of the set Right times two to the minus n h minus epsilon Okay, now by the way, this delta is just a label, right? I could even use epsilon for delta Right if you happen to choose the same number epsilon and delta in your original game It's perfectly fine with me. I can always find an end to make it work. So you find this term So that's it. We're done We're done This completely characterizes the typical set and this is called the asymptotic Equipartition It's written as a e p Okay, so we got a little lost by doing epsilons in delta So I just want to bring it back and explain what we found Epsilon and delta are both small numbers you gave me So if you gave me the same number for both, I can use the same number Yeah So I had a math professor who always said, you know, pick a small, you know In fact, he would say pick a big number. Let's call it epsilon and then people will get totally confused But these are just labels So epsilons and deltas are just small numbers. So in particular, I can just choose delta equals epsilon It's some small number you gave me The whole point is for that small number There's some n hanging around in the background That's guaranteeing that all these statements are true And that n could be a billion. It could be 10 to the 23. We don't know how big the n is Right, but in any particular case, you will find that value of n Okay, so what have we found? We found something very important First of all the probability of all the sequences in the set By construction their probabilities, in other words, their labels are all numerically very close to each other Right, the labels of these sequences are all numerically very close to each other That's what it means to be in this band In other words, all these sequences Right are equally likely The point is That all other sequences which are not in this band Hardly ever even occur So in the large n limit roughly You have two to the n h sequences That each have probability two to the minus n h Okay, so just step back digest this It's it's very interesting So when you run a bunch of horse races, for example, and you get a bunch of races that turn out a certain way, right At the end of the day All the races that you're actually ever going to get With extremely high probability it's highly unlikely you can make this delta very small extremely high probability Every kind of outcome that could happen is equally likely And the likelihood is two to the minus n h And the total number of those is two to the n h and the fact that they multiply to one Is by design because there's very few that lie outside The rest of these little epsilons it is to take care of that Okay, that's why it's called the asymptotic equipartition property In the large n limit any event that can happen is just as likely as any other event To extremely high precision So This is the real proof of data compression. Remember I asked you yesterday how come when you had a horse race Right, you were actually managing to use two to the n h bits To run the horse race. Whereas you should have actually been using two to the n k bits Because that's the total number of events that could happen By k is log m But you're actually using two to the n h bits The reason you were doing that is all the other events lie here All the other events lie in these tails and those tails become arbitrarily small Okay, so this suggests a method of coding Which is different and new and here's the method of coding we're going to use Previously the method of coding I was using made guarantees On the expected length Of the code it made guarantees on The average number of bits you use per message in the limit of large number of messages Right and that was an error-free code remember for this horse race for the instantaneous code for the horse race We can write down that little table again You could run n races and you would never make an error, right? You would always give the right answer completely. There are no errors here But I'm only making a guarantee about the expected length Sometimes the expected length could be greater than this Now I'm going to ask you to do a completely different kind of code and here's the game and I want you guys to tell me what the answer is I'm only going to allow you n h bits I'm only going to allow you n h bits And with n h bits. I want you to encode n horse races Every time that and I want you to tell me how I'm going to manage I'm only giving you n h bits previously I gave you any number of bits and sometimes you went over and sometimes you went under but on average you got n h bits I'm not allowing you to do that I'm only giving you n h bits and I want you to encode the horse race Outcome after n races. How would you do it? Can you do it first of all? You can't do it because the total number of horse races is large Right, so what do you have to give up? Last time my guarantee was a guarantee on this was a length guarantee Last time I had a length guarantee now. I have to make a different kind of guarantee I'm going to have a customer who's buying my telegraph machine for the horse race transmission Previously I said I'm just going to guarantee the average number of letters and this person doesn't like statistics So I don't like averages. I want some certainty. I'm only going to let you do this many bits So you go back and say well, of course, then I cannot solve the problem perfectly Because the total number of ways the races could happen is much bigger than the total number of options of sending n h bits So I have to give up something. What am I going to give up? Well that way that we have to do of course So you only label the races in here, but then There is a chance that something will go wrong Okay, so the what guarantee am I going to make to my customer? Not so inhibition just tell them just tell them that the Well tell them there's an error guarantee Okay, you tell them look that person says I want Zero error you say there's no way that's possible right because the total number of ways races can run Is much larger than what you can fit in n h bits Is this part of it clear? K is the log of the number of horses the total number of ways horses can run as m to the n In other words two to the n k That's the total number of ways anything could happen The actual number of ways I could possibly transmit is just two to the n h because I'm just using n h bits And I know h is less than k So something has to give and the guarantee that you give to your customer is You tell me an error And I'll give you a code that works With that error Right the fraction of days of the year in which I'm going to fail is less than 1% right that's what you tell them But you keep the length constant So we've gone from a Expected length guarantee To a fixed length code with an error guarantee And this is going to be very important when we get to channel capacity So fine now that we know how to do it what code should be used So what we're going to think about is We're going to have some large set which is all sequences And how many sequences are there there's m to the n sequences right and in that set There's this set which is typical sequences Something epsilon and there's a bunch of sequences A bunch of sequences by the way by the way Just just to step back for just a second When I had the horse race example and I said that I didn't get the probability distribution correct I made the code based on some Probability distribution q but the actual probability distribution was p And how did the code fail my guarantee remember the answer to that? So remember what the answer was Right, let me just pull out the exact answer here so if If I had a probability p which is the true Okay, previously if I assumed probability p A probability qi and I made li Is equal to the ceiling of log 1 over qi Right, but actually Probability was pai Right, then my length actually suffered a little bit and my expected length actually went to h of q h of q or h of p Uh, it's h of p. It's a true distribution Plus one plus a penalty that I incur Which is this code bar to library divergence between p and q Penalty for being wrong Okay, so remember so the previous case We made a code We made a code just like we've always been making it We put a bunch of bits and we use that code. It's an instantaneous code. We transmitted But for some reason I was wrong Right, I made my code assuming q But the actual distribution is p But it's not catastrophic. The only thing that happens is that the expected length Goes up a little bit What will happen here If I assume the wrong probability distribution And use the wrong probability distribution to define the typical set Well, that's true in every case, but now I'm asking what happens if you decide the wrong typical set I make the typical set definition based on the wrong distribution q whereas the true distribution is p I've made the typical set here based on the true distribution What happens if I make the typical set based on the wrong distribution? What happens to all these guarantees? If you've been following the argument so far, you should be able to tell me. Yes This goes to zero No, but this is this is there's no such a thing here. Okay, so Except for very very yeah, so this d is different from from other things, right? So Let me step back and say it again. Everybody with me everybody remember how we derive this This is simple calculation, right? What what is this the answer to this is the answer to how long will my code be on average? If I built the code using q, but the actual value is p The only penalty you pay you don't pay any error penalty Your code still works What penalty do you pay? It's just slightly longer Right now I'm proposing to do something slightly different. I'm proposing to only code the guys in this zone I know how many there are there's two to the nh So I just need nh bits and I'm proposing not to even encode the guys outside the zone Any guys outside the zone will cause an error And my claim is that that error can be arbitrarily small So now I'm asking you what happens if I got my p wrong and I assumed it was some wrong q Some shift or how big is the shift how profound what will happen to my error? What will happen to my error? Everything will go to the wrong place. So if my bin is in the wrong location My probability of error goes to one right because this curve is guaranteed to bunch up close to the true h Okay, if I get h off by even a tiny tiny tiny little amount Okay, my error is guaranteed to go to one as n goes to infinity Okay, so that's two different ways in which these kinds of coding can behave right Okay, so everybody clear previously it was rather benign If I made a mistake sure I made a mistake and I pay the cost of that mistake by just having You know the length went from 2.5 bits to 2.6 bits and I'm still able to perform Yeah, so that was a guaranteed length and you know, I didn't meet my length guarantee But it only went up a little bit here If I'm only going to pay attention to the guys in here and not to the guys outside I run the risk of a catastrophic error If I get my q wrong if I get q is not equal to p this h will be here And the peak will all run to that and nothing will be in this bin because this h was built on the wrong q So it's very very important when doing this kind of coding to know exactly where the typical sequences are All right, so the statement is if I get h correct with the right Distribution my errors will go to zero as n goes to infinity If I get h for the wrong distribution, my error will go to one as n goes to infinity And there's literally no halfway zone between these two things Any any questions about this? So now I'm actually going to work out the error. So let me work out the error Let's go through it through the calculation Okay So what do I want to do? I want to make a code for this thing And the code I'm going to suggest Right, so let's make a code if the whole sequence Belongs to the typical set If the whole sequence belongs to the typical set There's a bunch of sequences like that in my code book Right, this is the code book How many such sequences are there that belong to the typical set right at most some number Right, so at most what? Right, so this is less than or equal to 2 to the n h plus epsilon That's how many there are Right, then I'm going to encode them somehow. What's the easiest way to encode this bunch of things? I have a bunch of events And I want to make the standard the code in the old-fashioned way, right? So this is gone back. You could think of all these as horses, right? I have all these events happens to be a very large number But I have all these events and I want to make a code for it What code should I use? Shannon code, okay, so what's the Shannon code in this case? Homogeneous, why is it homogeneous? These are all equally likely so it's so my coding problem has become rather trivial Right, it's not that any one of these sequences are more or less likely than any others So there's no need to be clever about Right, so if I'm trying to encode this number of things Then what code should I use? So Not uniform distribution. What actual code do I use? There's no probability. I want you to make a code There are this many rows. What code should I use? If it's homogeneous, what does that mean? All the all the All the code words are the same length, right? Yeah, that's it. So you just zero zero zero zero zero zero zero zero one zero zero zero one zero Okay, is that clear how many bits are here? in H well plus epsilon Okay, why am I allowed to do this? I'm allowed to do this because these were all uniform Right, I shouldn't waste time trying to make more likely events have shorter code words Because they all have the same likelihood if they all have the same likelihood and I want this many distinct ones I just line them up now in practice in practice. How would I do this? For a practical code I could write these on alphabetically These are all strings of letters, right? I could just write these on alphabetically and my code is just in the typical set I write down all the sequences alphabetically and I just give the index of that sequence in that list That's my code question. Well n h plus epsilon n h plus epsilon. So that's that's one piece, right? Yeah, yeah, so so if it's not in the typical set, so this is in the typical set Right, this is the typical set and this is the complement of the typical set Right, and then we just send an error How do we send an error? Maybe we just reserve the zero zero zero code for an error Right, and we just send zero zero zero If there's an error Right, so we're done. What we now have is an error guaranteed code If somebody says I want an error of epsilon or delta Then I find the ends that make this work and I simply encode the sequences in the typical set Using this many bits because that's that's how many there are Okay, that's one way. Is this fine? So this is a zero error code and of course it won't work If my h is wrong because if my h is wrong Almost all the time I'm going to be sending an error because nothing is going to be in this set in practice Okay, so let me ask now a final question Suppose I didn't want to make an error guarantee code, but I again want to make a length guarantee code Using these ideas. How would I do it? I don't want any error. I want zero error I want zero error, but I still want to transmit information using this typical set idea Assuming I got the h correct. How would I do it? Yes Adding bits at the end. Okay. What code would you use for all? Are all the others equally likely to each other? So even among the others there's going to be some more likely less likely and so on So what code would you use for all the others? Okay, you could use the Shannon code and then you'd have to work out What that is because you'd have to then work out the probabilities of all these things But we don't actually know right. We don't want to work out the probabilities of all these things Try something even easier Try something even easier. How would I do the rest of them? So for these use about nh bits, right? And for these how many bits do you want to use? What's the worst case for all the others? All are equally probable. So how many bits do you need for all the others? One no, no, no You need nk bits You need nk bits for all the others Right, so we're going to pretend that we don't know anything about the statistics of all the others We're going to pretend they're all uniform and if they're all uniform we have to use nk bits We're going to hope that we have to use very few events of nk bits We're going to have okay. So what is the expected length of this code? We've just made up a code. So what's the expected length of it? If x Does not belong to ace of epsilon then just use nk bits Right, so what's the expected length of this code? Let's work it out So the expected length of the code Is the sum over all sequences Right the probability of the sequence Times the length of the sequence Right, and we're going to split it up into two classes Which is within and outside the typical set Right and these two obviously cover the space they're mutually exclusive Right of the same thing px lx px lx Right and now we're going to work out what this is. So the chance that it belongs to the typical set is sum That it belongs to the typical set Right px and how much how many bits are we using for those? We're going to use n H plus epsilon Okay and We have to add one Why do we have to add that one because you always have to add one when you're doing this kind of coding because these may not be integers Right, it's the same one the sum For x does not belong to the typical set Right probability of that And that's going to be times In log m Or nk Okay, plus one Now there's something missing here This is not yet an instant in this code These guys are going to be all the binary numbers, right? We're going to use all the binary numbers to encode this So how do you know upfront whether I'm encoding something inside or outside the typical set? But when I'm when I'm decoding it I have to know whether to stop at nh or nk So how do you tell the user that? Well, there's only two sets. So in fact, you can just prefix it, right? You can just add zero to start with and if your code starts with the zero it means you belong in here And if your code starts with the one it means you belong outside And just append that to the previous code that we had Right because I have to know how far to stop if it starts at a zero I know I'm reading nh plus epsilon bits if I start with the one. I know I'm reading nk bits So it's an instantaneous code Right, so if I'm adding one to every Thing then there's going to be a plus two here Not just plus one plus two because I have to add zeros and ones which is okay, and And then we have to just go through all the motions, right? So this Then becomes The total probability of a typical set Which is like This is less than or equal to One minus epsilon n h plus epsilon plus two If I got that right, if I get that right, or do I just add up the whole thing? Oh, it doesn't even matter. Okay. Okay. Okay. So this is the total probability of the typical set Times this quantity and the total probability Of the complement of the typical set Times this quantity, right and the two just drops out. So this must be equal to Just the n h plus epsilon Right, how do we do that? We just the total probability of typical set is certainly less than or equal to one Right, so we can just do that. I mean, this is a very very sloppy calculation Right, and then you get plus The probability of the non-typical set is epsilon Right, everything else is of uh in in the tails of this distribution Right n log m Plus two because you get two from here two from here this plus that must be equal to one Right, that's the l guarantee and if I Shuffle this around you in fact get n times h plus a number epsilon prime all the epsilons come together Right where epsilon prime is equal to basically epsilon Right plus n log m Plus two over m Plus two over n So this epsilon prime can be made as small as you want. It's the usual it's the usual game So suppose you tell me that you want a length guaranteed code Instead of an error guaranteed code. I can still do it using this kind of compression so I'm going to stop with aep for now and Just summarize what we found if you don't want to go through the epsilon delta proofs doesn't matter It's a very simple idea when you have a large number of independent identically distributed events Then at the end of the day the kinds of events that are guaranteed to happen They're all equally likely There's two to the nh of them and the likelihood of each is two to the mine is of each one is two to the minus nh And based on that idea you can work out everything else Okay, all I've shown is that these things don't even have to work in the infinite end limit For finite end. They actually work For any practical Type of error or length guarantee you give me I can find some finite end where these codes are actually going to work So it's not a large end limit idea If you're happy with 10 error Right, the end may not even be that large. Okay. Are there any questions? I'm going to move on. Yeah Yes, so for this inequality to hold The smaller you make epsilon The larger the n zero will be For this whole thing to work out Okay, but in practice for any practical case if you give me an epsilon I'll work out carefully and give you an n. It's guaranteed to make it work. I've said two things The first thing is the very simple thing When you have a bunch of independent identically distributed events There's two to the nh of them and they all have probability two to the minus nh And this works with what I said last time whenever somebody says What is entropy? Entropy is always the answer to how many H is always the answer to the question how many are there And if I ever ask you how many are there you should just come back and say there's two to the nh Okay, that's the first thing The second thing is this is actually a practical theory. It works even for finite end Okay, those are the two lessons from what we've got. Are there any questions? Yes This inequality. Oh, so so far this is all correct Yeah, so all I'm doing is being very sloppy the probability of the typical set must be less than or equal to one So that's part of it Okay, the probability of the non-typical set must be less than or equal to epsilon So that's that I'm just being very sloppy. I could I could make this even smaller, but I'm just being very sloppy Okay, I'm going to move on So this is important and it's going to be important on friday when I give you the proof of Shannon's channel capacity theorem Okay, which is wrong not going to do today. You don't need it for your exam So I'm going to leave it for the last day. So I'm going leaving the sort of Cherry on the ice cream sundae for the last day, right? So The Shannon channel capacity theorem relies on exactly the same kind of ideas that we've got today Let me unpack those ideas. The first idea is that the code will work Only if we're willing to have codes that work for You're willing to wait for n races before you give the answer for any of them Okay, so the true extraction of these kinds of compression ideas Doesn't work for instant if you want to have an instantaneous code in the presence of error There's no way to do it Okay, so Shannon's Number one most important idea is that you'll you'll be happy to wait a long time to decode a large number of events All right, so that's why it's called a block code You might want to decode a block of a thousand events Or a block of a million events And it's only by willing to wait for that entire block That you're extracting the usefulness of the system of these kinds of results. Okay, that's the first thing The second thing in Shannon's theorem is that The theorem gives an error guarantee if you come to me and say I want the error to be less than 0.001 Then the theorem turns around and says here's the n That works for that Okay, so it allows small error for large n That's how that system is going to work out now what What is really interesting about the capacity theorem and what's really interesting about this result, right? But about the capacity theorem the really interesting thing is of course If I Want to Take a code word and I send it through a channel and there's going to be errors And we're going to go over this on friday But I'll just say it now if I have a code word and I send it through a channel and there's going to be errors Some zeros will flip to one some ones will flip to zeros that kind of stuff happens, right? Of course, I can reduce the error by sending the same code word again and again and again and again Right then the error just goes down Geometrically if there's an error probability You know p then if I send the same code word twice the error probability is p squared p cubed and so on So shannon's theorem is not something so trivial It's not saying that if I make the total number of bits per message very long The error goes to zero Okay, that's obvious Shannon's theorem says I can keep the number of bits per race Constant I can keep the number of bits per race constant and still go to zero error And then you'll say well, what is the large parameter that allows you to go to zero error? The large parameter is not the number of bits per race It's the total amount of time you're willing to wait before you decode all the races Which is the kind of idea behind this coding. Are there any questions about that? That's how the magic actually works out Okay Any questions so we'll go over this again on Friday and so now I'm going to start Defining the idea of mutual information Which is the key to defining the channel capacity and working out the theorem. Yes No, the typical set is fixed if p is fixed Yeah, I mean, of course it's possible in real physical systems that it's going to change Right, so I mean so the point about information theories is a theory Information theory is a theory of limits. It tells you what you can and cannot do in a certain number of bits It's not a theory of coding There's an entirely different discipline called coding theory, which tells you how to make how to make these codes Okay, typical set coding is a real painful experience Right because you have to find all these sequences find out who's over here Make the list share it with your friend. I mean, it's a really painful thing Okay, so this is not how you do it in practice What we proved here is you can do no better than two to the nh Right, you can do no better than nh bits. That's for sure You could always do worse And in some limits you can actually approach nh and those limits might be in principle or practice rather very hard to decode but so You know shannon did not spend time thinking about the efficiency of decoding that whole complexity of decoding is part of The modern theory of computation complexity classes algorithms and so on Uh, I'll leave this here just because we took a lot of time to derive it And Now we're going to move on to slightly different thing so remember Entropy is the answer to the question How many that's what the entropy is entropy is the answer to the question how many The formula for entropy it may be some of p log p or whatever it is, but that's not the important thing Uh, and the answer to the question how many will be something like two to the nh And therefore when you make these plots Log of how many Will go linearly and the slope of that is it I went over this last So now we're going to get to the idea of the sharing of information So this is really the meat of the information theory course I've spent a lot of time on entropy and coding because all the Tricks and techniques for theorem proving that we got from that part of the course will be important to prove the final result Okay So now let's step back Forget epsilons forget deltas. Just think intuitively if there were omega options Or omega possible messages, right? I'm not even talking about m possible messages I'm talking about the results of large numbers of races and so on it's some very large class of possibilities And typically these class of possibilities will have some label n They'll have some label n which describes, you know, how long how many of these things have happened before you actually have to reveal the result Right and it's so large right that We're going to just represent it as two to the nh. That's what this curve is Right, this is a log of omega. So if you're used to statistical physics, this is exactly You know Boltzmann's fundamental contribution Um Okay, so what is the role of information? Okay, so this is omega zero zero to indicate before you saw some further amount of information Right and after you see something right then the number of possibilities Decreases right the number of possibilities decreases right? A trivial example you have a bunch of people in this room the number of possibilities I have is the list of all your names Okay, and then somebody comes and says well, I have a piece of information I know that the person I'm thinking about is sitting on that side of the room We're on the on the left side of the room That's one bit of information And therefore the number of possibilities drops Okay, and therefore omega one after you see something is less than omega sub zero Omega one is less than omega zero right so omega one less than or equal to omega zero Okay, and the question is numerically How would you like to capture that collapse? So one perfectly viable way is just to give you know, there are 50 people in the room. So omega zero is 50 And once I have one bit of information that somebody sitting on that side It's literally zero or one another 25 Options so omega one is 25 but since These omegas are very large. We choose to measure information by the exponent. We just take logs Right, so we're going to define the information Is just the log Right of omega zero Over omega one I'm trying to build intuitively why we measure things in this way. I've said two things First I've said it's useful to measure things in logs because these are big numbers Okay It's also useful to measure things in logs because then we get certain additivity properties that are quite intuitive Secondly, I'm saying how did I measure how much information I got from learning that somebody's sitting on that side of the room The amount of information I got is literally 50 divided by 25 So you could you could count that as two Or you could count that as log two. I don't really care So either way, but you have to remember which unit you're talking about If you count it in the log and it's log base two then the answer is in bits again It's the same unit of information So now let me ask you again an intuitive question You can buy these terabyte hard drives, right? So I give you a one or these days you can buy a terabyte thumb drive So I give you a one terabyte thumb drive There's some notion of how much you can store in there, right? And I give you another one terabyte thumb drive and there's some notion of how much you can store in the second one Right intuitively if I give you two hard drives, you have twice as much storage as one hard drive Right, you don't say you have some product of storage capacities Right, so additivity is a very natural requirement You have for a measure of information It just works with our intuition That's yet another reason why we choose to measure things in logs but If you really obstinate you can go back and literally live in a world where you're looking at the number of options before and the number of options after and you're perfectly You can build a perfect theory of information just doing that right so these are just conventions They're conventions that allow us to represent big numbers using small numbers and they're conventions that allow us to Say simple english sentences like you have twice as much storage when you have two hard drives as when you have one hard drive Are there any questions? So in Shannon's original paper These kinds of conventions he shows that there's literally no way to measure information That satisfy these simple properties that we want other than the definition i'm going to show you So in his original paper he shows that for some very simple requirements that you might have in mind There's no other way to define A certain amount of information in a system So it's worth going and reading that paper It shows that the definition of information up to these simple conventions is absolutely unique Up to these simple conventions and up to one other silly thing, which is the base of the log At the base of the log is like the unit you can measure it in bits You can measure it in log base 3 you can measure it in log base e that's the only Sort of degree of freedom you have when it comes to measuring information Okay, so this is easy And always in real life always when I have some number of options Between which i'm going to choose and then i'm some information is revealed The number of options later is going to be less than or equal to the original So this quantity is going to be greater than or equal to zero Is that fine So this information has a nice positivity value in particular for this room. There were 50 people before there's 25 people on that side The information you gave me was left or right Right and intuitively that works because 50 divided by 25 log of that is exactly one So I got one bit of information from you and I exploited it to do factor of half compression Simple stuff now it's actually going to turn out not to be so simple And to to see why I'm going to give you a simple table Okay, and always I want you to keep the sentence in mind the question of how many The answer is always in H. Let's keep this in mind. I By definition is log of omega before you saw something Versus omega after you saw something and that's going to be greater than or equal to zero So the idea of before and after seeing some things Brings us to confront the idea of multiple random variables so far. We've only been looking at one random variable x Okay, but somehow we're going to be seeing another random variable y and we're going to have to confront Joint probability distributions x and y. Okay, this is the fundamental sort of object Of shannon's information theory these joint probability distributions And I'm going to draw you a joint distribution in a little matrix here Let me see if I can pull it out All right, and we're going to spend the rest of today's class Just exploring the properties of this matrix. So here's a matrix Which is x which is y. Okay, this is y This is x and here's the joint distribution So y can take on for example four values x can take on four values This is an example And Here's the joint distribution Okay, so this is the mathematical setup in practice. What are these x's and y's x's and y's are real things in the real world Okay, um, they are both random variables, but they influence each other So it it it could be that the you know the number of birds that are sitting in the tree Which is a random variable every time I look at it It's a different number has something to do with the temperature of the day Which is also a random variable every time I measure it. It's a different number So x could be the number of birds and y could be the temperature And the question we want to ask is to what extent is the measurement of the temperature going to allow us To estimate the number of birds And the answer to that question is basically this How many possible numbers of birds could there be? How many what are all the options before? And once I measure the temperature, how many options are they going to be after? Okay, and I'm going to divide this by that Take the log of it and I'm going to say that's how much information temperature gives me about birds Or any other pair of things you want to imagine Okay, so so far so good right it's very very simple now of course For a probability distribution I don't have an explicit integer which is the number of options So in what sense am I even given a number of options? I'm given a number of options because I'm always going to assume that every omega Is given by two to the n of some entropy Okay We spent the whole of the last several days proving that h is the answer to how many Right and the answer is two to the n h. So whenever I see a how many question Somewhere lurking in the background is a two to the n h Okay, so the only thing we have to do is cleverly figure out what the formula for h is And we did that for codes the formula for h was Minus sum over p log p. Okay, but that's for one dimensional distributions Now we're going to have to figure out what the formulas are for h when there are two variables. So let's work it out So this is everybody understands what a joint probability distribution is right So the sum over x and y of p x y Is equal to one If you add up all these numbers It's very easy because each row happens to add up to one quarter And there are four rows and so this adds up to one Everybody also knows how to make marginals out of joint distributions, right? So the probability of x is the sum over y of p x y The probability of y is the sum over x p x y You also know how to extract conditions, right? So the p of x y Can be written as p of x and p of y given x And p of x y So equally well can be written as p of y p of x given y Where p of x and p of y were defined like this This actually can be flipped around to give the definition p of y given x Is p of x y x comma y Over p of x and this is very simple stuff. I'm just writing down to make sure it's all on the board p of x given y Is p of x comma y Over p of y Okay, so the way I like to think about these conditional distributions And as I mentioned very early on in the stochastic process as part of the class The thing on the right side is just a label The thing on the left side is the random variable And so the normalization condition here for an object like this is when you add up over y's it must sum to one Independent of x And this establishes the normalization Because you divide out the chance of getting that x in the first place Okay, are there any questions about this? Yeah, so everybody can derive all these things right for example if I had to ask you what is p of y Sorry, what is p of x given that y is equal to two? What's the answer p of x given that y is equal to two It's this row times four Because the sum of this row is one quarter Right, so if you're asking p of y given x it has to be a normalized probability distribution This row itself as it stands is not normalized. You have to divide by the sum of this row to normalize. That's what this is done any questions Okay, so In the same way that we defined all the probabilities I can pretend that Remember the everything we've done over the past three days It doesn't depend on what the labels are for all the events It doesn't depend on the names of the horses which letter and so on I could simply pretend that this four by four matrix which is a probability distribution is in fact a one by sixteen vector Right as far as information theory goes Right, so I can easily define h of x comma y In the same way we defined it earlier This is not even a new thing right so in the same way we define which must be equal to the sum of x comma y p of x y log p of x y the joint distribution with a minus sign I could choose to do that Right and of course I can always define h of x and h of y By using the marginals I'd always do this kind of thing. Now. Here's the interesting question. I want to ask I don't know something about x and therefore here's the amount of How many options were there for x to be in the answer is 2 to the n h of x That's how many options there were for the variable x Right if x was the number of birds over the whole year and so on right So the number of options of x is measured by the entropy h of x How do I measure the number of options of x once I've seen some value of y for example y is equal to 2 So what's the most obvious answer to that? How do I measure so the number of options of x regardless of everything else is just this right? That's what we spent the last Several days doing the number of options of x is given by this entropy In other words 2 to the n h is the number of ways that n races could have been run if x is the alphabet So that's the number of options of x How do I define the number of options of x after I've seen y? Well, just just look at this matrix and tell me the obvious way Yeah replace so replace so the number of options of x The suppose I've seen y is equal to 2 Then the number of options of x is Given by the this the entropy of this row, right? So if I were to use this kind of thing omega 0 before I see anything is in fact 2 to the n h of x Right, but omega 1 which is after I've seen something is in fact 2 to the n H of this row x given y equals 2 Or it could be x given y equals 4 or whatever it is. So what is the actual entropy of that row? What is the definition of that so h of x given y is equal to some row Must be equal to just that entropy just the entropy of that row correctly normalized Right, so must be equal to sum over all values of x p of x Given y is in that row log p of x given that y is in that row What I've used the lower case y to indicate 1 2 3 or 4 It's nothing, but taking that row normalizing it to get this conditional distribution And once you normalize it use your standard entropy definition fine And now instantly you will see there's a problem and the problem arises for the following reason So let's let's work out the The marginals right so the marginals in this case are Uh, so do all the rows in the columns both some or no, so Uh 16 16 the 1 8 1 8 plus 1 8 is 1 4th 1 plus 1 4th is half 1 8 plus 1 8 is 1 quarter And this is 1 8 This is 1 8 And this is 1 quarter 1 quarter 1 quarter 1 quarter Okay, initially x has some entropy Can you say something about that entropy that entropy is certainly less than 2 How do I know the entropy is less than 2? It's not uniform if it's uniform there's four options the entropy is 2 Okay, great So the entropy is something less than 2 Now suppose I know that y is equal to 3 Suppose I know y is equal to 3 Then what is this entropy of x assuming that y is equal to 3? It's 2 Okay, so unlike this very intuitive statement after having seen something The number of options goes down Here the entropy of x was something less than 2 because it was not uniform But after seeing y is equal to 3 the entropy appears to have gone up So if you try and define information Based on this Right, then you would be defining information in the following way you'd be writing down something like n h of x Minus h of x given y is equal to 3 Forget the n And it's just a scaling factor, right And unfortunately in this case That's actually less than zero The here's an easy one though if y is equal to 4 If y is equal to 4 then what's the entropy of x given y? 0 In which case you've gone from some entropy less than 2 to 0 Right totally cool, but this is very bad news This is very bad news so And and what about by measuring this? What is the entropy of x if y is equal to 2? You have to multiply this by 4 So in fact you get one half one quarter one eighth one eighth It's the same So by observing this you get no collapse By observing this you get no collapse by observing this you get a full collapse But by observing this the number of possibilities went up So something is wrong and our definition of information Which we're hoping information is something exactly like this it's this h minus that h Right It doesn't actually work. So how do I fix this? I don't want to define information this way So how should I define it? How should I fix this? So is the problem clear when I intuitively set up the motivation to the problem? It was very obvious initially there were more options the number of options could not possibly increase Right And I know the number of options is something like an entropy And I want to make a formula where information can always go up Right I can't lose information number of options always has to go down This definition doesn't work Here you get no information here you get no information Here you get a lot of information here you Apparently lose information by seeing something Which obviously can't be right. So how should I fix this problem? How should I fix this problem? There's a very easy fix Not h of x and y right so suppose I define So suppose I define I of x given Given y is equal to little y right Maybe I can define it as h of x Minus h of x given y is equal to little y Right that's just the same definition as this If the omegas are 2 to the nh then that's the same definition Except for a factor of n But this doesn't work wrong It doesn't work because this i can be negative sometimes So what's the easiest way to make this i positive? It's not totally obvious, but what's a good start? Modulus oh my god, no no no what's a good start to make this positive? Look I mean just look at this matrix and I mean what what's a good so here you get none here you get none here it goes down Here it goes up. That's no coincidence So what's a good way to make this entire thing positive? Average it out Okay Because y itself is a random variable. It's not that every time you run this experiment you're going to get y is equal to 3 y is going to occur with equal likelihood in each of these rows And if it occurs with equal likelihood in each of these rows Sometimes you lose information Sometimes you gain information sometimes you neither lose or gain, but you hope that on average you gain Okay, so this is going to be our definition of information We're going to define it in this way to fit with our intuitive notion that this information has to be positive Because if we define it wrongly It doesn't work out So here's what we're going to do we're going to average this It's a probability that y is equal to little y and we're going to sum this Right over all values of y okay, and we're going to define this as the information between these two variables But now the labels have gone because I've averaged over values of y And I already averaged over x to calculate the entropy. So now there's no more labels This just gives a single number that defines the whole matrix Okay, cool Okay, so I have just enough time To finish this up. So let's Everybody's with me right? So I'm just trying to define something useful and I'm I'm I failed in my first attempt I'm now attempting again This looks plausible We'll unfortunately have to plug in a lot of stuff in there to see what this formula is. So let's Let's go for this right. So this is the sum of a y p So p of y is actually sum over x p of x comma y That's what this is This little thing is going to be sum over x of p of x log p of x With a minus sign And this thing is going to be a horrible thing which is minus minus sum p of x given y equals little y log p of x of y given y is equal to little y Okay, and this whole thing gets multiplied by this All I've done is these h's are after all defined in terms of the p's and I've just started to go in and expand okay So when the dust settles And I only have two minutes. So I don't want to kick up the dust, right? But let's assume the dust was kicked up and there'll be three or four lines of algebra here Right and finally you get to the bottom and it has settled Then you'll find That there's a single formula, okay and this It literally just comes out if you go through all these additions and summations and do the conditional probabilities and substitute the correct way and so on Okay, so trust me It's one of these things where you don't want to do another board because there's going to be like 15 mistakes before we get to the answer Yeah There's a few surprising things about this Before that happens. I'm just going to say one last thing this quantity this sum here Just this bit right this h is independent of y so it comes out And the sum of y of p of y goes to one Right, so this part of it just is h of x This part of it is the entropy of every row considered as its independent probability distribution average overall possible rows So the second part is given a special notation Right, it's called h of x given y Right and that's defined as the sum of p of y The sum of y of p of y is equal to little y times h Of that row right where this notation Is not the same as this notation. This notation is the entropy of that row That's why I put the little y here Okay, that's called the conditional entropy and the conditional entropy Is actually the average entropy of each of these rows considered as a probability distribution Weighted by the chance of getting those rows having done all that I'm going to write the final answers Up somewhere on the board. Let me see if I can make myself some space This has to go this has to go Okay, so here are the formulas I'm just going to rewrite the well the information formula is already there. So it turns out that i x y can be written as The entropy of x minus the entropy of x given y That's exactly this It's exactly that with our new definition of x given y Right, it can also be written as the entropy of x comma y Which is here in this formula Minus the denominator part because it's in the log Since the thing is totally symmetric it can also be written as h of y Minus h of y given x And because it's symmetric it's called mutual information Strangely it doesn't depend on which is x and which is y it doesn't establish causality These formulas Are encapsulated in this little mnemonic where you can consider The whole sort of figure 8 thing as representing h of x comma y This circle will represent h of x This circle will represent h of y This little bit will be h of y given x So the uncertainty reduces by exactly that little shape that wedge comes out This little circle will be h of x given y That guy in the middle is i x y and you can verify that this Little trick Venn diagram actually gives you the right formulas Okay, it's not a deep thing It's not a deep thing because it doesn't even work for three variables, right, but it's just a trick So this is not a geometric derivation of mutual information. The derivation is what we wrote down here Yes rose. Yeah, excellent. So I'll answer that question now. So Just to finish off the point before I answer your question Remember why we even went down this route? We went down this route because h of x minus h of x of each row was sometimes positive was sometimes negative And we hoped We didn't know but we hoped that by averaging over all the rows This i would become positive Now can somebody look at the board and tell me whether they think i is positive or not? Is there any reason to think this i is always going to be positive? Because these entries could be positive or negative Because this p of x comma y could be greater than or less than p of x times p of y What is p of x times p of y? p of x times p of y is the distribution we would get in here If x and y were independent In fact, what distribution is that we can write it down, right? It's one quarter times that it's just basically one eighth 116 132 132 all the way down, right? This is p of x Times p of y This is p of x comma y This is the distribution you get if x and y are totally independent This is the actual joint distribution of x and y which indicates that they somehow influence each other They are correlated If x and y not just for uniform if x and y were truly totally independent then p of x comma y Would exactly be equal to p of x times p of y And so each of these terms would be zero And measuring x will give you no information about y and vice versa because i would be zero Okay, so that case is very easy My question is How come by adding a bunch of positive and negative numbers here We know that the output is going to be positive. How do we know that the positives would be will outweigh the negatives? This is a d exactly So this is actually a cul-barc libel of divergence between the joint distribution and the independent distributions And we already proved that that was greater than or equal to zero And in fact equal to zero when those two things are exactly the same Okay, so we've come to the end of the class and I think it's a very surprising result It's a very surprising result and it's sort of you have to wrap your head around it Observing certain values of y actually increases your uncertainty of x But that means that some other value of y is actually going to compensate for it And it literally has to be that way for any joint distribution you write down It probably has to be that way because this thing looks like a k l divergence and therefore is positive Right, but intuitively why it has to be that way. I don't have a very simple answer for you Right, but you know in a sense. This is Shannon's genius. He pulled out this as the right measure He pulled out this as a right measure which captures our intuitive idea that the omega Before must be greater than the omega afterwards And by doing this the log of that Right will be in times the information, right? So, okay, so we'll stop here. So I think This is actually a perfect place to stop. So what's going to happen now? Today at five o'clock we have a tutorial Tomorrow at 11 15 we have an exam Friday at 11 15 after your other exam I'm going to continue. Sorry. Sorry. That's right Friday Friday at 11 15 after your other exam I'm going to start off with this definition of mutual information And I'm going to go through the proof and the explanation for channel Shannon's channel capacity theorem Okay, homework results will be submitted will be posted online. So if you haven't submitted already, it's almost too late You should have submitted yesterday. You have a few seconds to submit it Speak to me because I want to check that I'm sure that I got your emails and that I got everything right Yeah, please please do make sure that he has the homework because the homework is 50 percent of the grade Okay