 I noticed that, you know, some people have obviously seen this material before. In fact, many of the people who are in the master's program. So I don't know if I've just gone over things you already saw before, but I hope you saw it. You've been looking at it in a slightly different way than you might have been exposed to. For me, it's always nice to learn the same thing again and again and again. Like the fifth time, I finally understand it. Okay, so what we're going to do now is the cap of what I plan to do. So any course on information theory, at least once, you should see the Shannon's channel capacity theorem. So that's what we're going to do. And let me then write down all the conditions here. So the idea with the information theory is what we've always been saying. Is that there's a sender, there's a receiver, there's some sort of code book. You agree what messages you want to send. And what we've done all the way up till now in the second week of this course is talk about data compression. In fact, in the exam there was a question on data compression. So how do I send messages using the fewest number of bits to my friend and make sure my friend accurately decodes it with arbitrarily small error, right? Or some short length. So it's the same setup, right? There's m different messages. And we're going to write k is equal to log m is the sort of naive number of bits you need, right? So if there are eight messages, you naively need three bits, right? And trivially, because it's 0, 0, 0 to 1, 1, 1, you can always transmit the information, right? And so there's the idea of a code book as usual where there are m rows in the code books. These are the messages. And then there's some probabilities which we've been writing down. There's some length. And then there's a code. And in the code, you know, this is the naive code, right? There's if there are eight things, then there's a two, three, four, five, six, seven, eight. Then there are these eight, three bit strings that will completely cover them. But in practice, we don't send k bits. We can send some q bits or q bits are sent, right? Where q is not equal to k. In fact, the whole time until now, q is in fact less than k. And that's the data compression limit. And the data compression limit can work in some sense on average, right? The average length will be some number of bits per message. And therefore, all these results will work in the limit of n being some large number of messages. Is all this clear? So in a sense, if I want to send n messages repeatedly, I would have thought I need to send n times k bits, but instead I'm sending n times q bits. So I should send nk bits. That means there are two to the nk possible events, types of things that could happen, types of collections of signals, right? I actually send two to the nq, where in fact the q that I calculated was h, right? q h less than k. And this is data compression. And the reason I managed to do it is of all the two to the nk possible things that could have happened, like for your eight-horse race, any horse could have won, the worst horse could have won every time for a thousand races, right? What we realized is that there's a very small number of, comparatively, small number of situations that actually occur in practice, and those are called typical, right? And we know that two to the nh is the actual number of typical events that would happen. And it's only the typical events we have to encode, because the chance of an atypical event is some epsilon. And the chance of an atypical event is some epsilon. If you give me a very small epsilon, I'll say, well, you might have to use a larger block code to achieve that epsilon. But once you've done that, everything works out. So are there any questions about data compression? No questions. So now we're going to work in the other limit. Today, we're going to work in the limit for the case where q is greater than k. So for some reason, which will become clear, the reason, of course, is because of errors, we're going to send more bits than are actually necessary to naively encode the situation. Just take a simple example. Suppose for some reason, one of these bits could be flipped. One of the bits could be flipped due to noise. If I was using the naive code, which actually is already pretty long, right? If I'm using the naive code, which has three bits per message, it has eight messages, that is, if any bit flips, then the person who receives the message will make an error because the message you send looks like some other message. They won't even know there was an error, and that's the most dangerous thing. So what you need to do when you're transmitting information in the presence of possible corruption of the message is a couple of things. The receiver, at the very least, must know there was an error. So they should not pay attention to the message. And at best, they should be able to reconstruct what message they thought you meant to send, even though there was an error. That's the whole game. In order to do that, you're going to have to build some excess redundancy in the code, so you're going to have to send more than k bits. It's that simple. So we're going to define the rate of a code. We're going to define the rate of a code. So this code is not going to be length k, but it's going to be length q. We're going to define the rate of a code, which is also, I think of the rate as the efficiency, but in the information theory literature it's called rate, which is this number. It's that simple. A very efficient code has a rate of one. You're sending just as many bits as you need to encode all the messages. That's all. A very inefficient code, the rate will be much less than one. There is only eight messages, so k should be three, but you're using 30 bits. So you're sending just eight messages using 30 bits. How many messages could you send? Two to the 30. Incredibly large. So in some sense, the code has become inefficient. Of course, you need that inefficiency in order to fix errors. So technically it's called rate, and practically intuitively you can think of it as efficiency. So here's the setup to the problem. This is just the background flavor. The setup to the problem is the following. So you have the sender, and they are sending some message w. But which message in here is w? W is the index of this. They pick one of the messages out of n. Out of n. And you look up the code book, and you encode the message. The code book, somebody's given it to you already, because it was an encoder. And what comes out of this? What comes out of this is a bunch of bits, zero and one. And how many bits? There's q of them. There's q of them. So the length of this code is going to be, and then here's the most important piece. So previously, the whole class that we've done so far, what happens is you simply get the same thing again. You get the exact same message. There's no error. And you guess what the message was. And what we've done so far, this is the receiver. And everything we've done so far is that the receiver actually gets the same message you sent. And the only trick is they have to decode. And we talked about many kinds of decoding. We talked about instantaneous codes. We talked about the typical sequence code. And in the typical sequence code, for example, one of the codes that the receiver could get is the code saying error. And if there's an error, at least they know there's an error. And you can make the chance of error arbitrarily low. But now we're going to put in the middle a channel. And what does the channel do? The channel has the property that it takes this input x to the q. I'm just being consistent with my notation from the earlier part of the class. So q is always the length. And it gives an output y to the q. It means it's a string of length q where all the inputs are various x's. This is basically x1, x2, xq. That's what the message looks like. And the output looks like this, y1, y2, yq. But it messes up the signal. So you don't always get the same bit coming out. If you send in a 1, you might get a 0. If you send in a 0, you might get a 1. Bad things happen. And now you realize why in the last class we spent so much time talking about these joint probability distributions of two variables. We talked about temperature and the number of birds. And we had a definition of information. And the definition of information is going to come back in a very useful way as the answer to a very relevant question. So what does this p, y give an x? It's the usual thing. It's just a joint probability distribution. So you have p, y, x. It could be a bunch of rows and columns, right? So there'll be some joint distribution of x and y. And this joint distribution x and y characterizes the behavior of the channel. So somehow you know the physics of the channel is encoded in here. And how you find this? The same way that earlier, the p sub i, the chance of getting various messages was something you had to work out, the chance of various horses winning, something you had to go look in the real world and work out, right? In the same way, the p, x, y encodes the physics of the channel, depends on the situation, and you might have a lot of information about it. So we're going to assume you know everything and you know p of x and y, right? And p of y given x works the usual way. So p of x, y over p of x, right? Which is the conditional probability distribution. It's a chance that any given x, if it goes in, it induces a distribution on y. So in practice, let's just look at a couple of channels, okay? Last time, we worked out, and I'm not going to go into the details, but remember this mnemonic, right? Where the whole thing was representative of this joint entropy, that's this double circle, and this circle was called h of x, and this whole circle was called h of y, and this little bit here was called h of x given y, and this little bit here was called h of y given x, and that thing in the middle was called i xy. So we went over all this last time, right? So I'm not going to go over it again, but we had a formula for i, which is the thing measured in bits, and we defined it, it was called mutual information. Mutual information was essentially a Kuhlbach-Liebler divergence between the actual distribution and the same distribution if we assume that these two variables are independent, but have the correct margins. I don't need to go over that again, right? So we have all these properties. Just recall, okay, all this stuff. Recall that we've already discussed all this stuff previously. And in your homework, you're supposed to work out these quantities for various examples, okay? Just to remind you, how did we work out this information? So mutual information measures the log of how much your options collapse once you've picked up some information, right? And so if you had omega 0 possibilities, before you knew something, before you knew the value of x, there were omega 0 possibilities of y, and then you knew the value of x, then you have omega 1 possibilities of y, which is always smaller, right, on average, and the ratio of those two, the log of that is the mutual information. Okay, it tells you how much everything collapses. The number of options went down by a factor of 2, then it's one bit of information that you picked up. I'm going to leave this up here in fact. I'm going to continue discussing that. So what is the goal, right? The goal is to make sure that the receiver receives the message with arbitrarily small error. That's the goal. Even though this channel is causing bits to flip and definitely inducing errors in the individual letters that are being sent through the channel, somehow the message itself is not corrupted because we have a very stringent goal, arbitrarily small error, yeah, and secondly, and that's the first goal, but the first goal is sort of easy. The first goal is easy. For example, if you have a channel that flips 0s to 1s with some lowest probability, you can just keep on sending repeated bits and just take the majority of all the bits that you sent, right? So you send a string of 10 1s and the chance that all 10 1s or half of them will flip to 0 is sort of small. So if you take the majority of those, you can decode that as 1. So you can always make the error small by making the message longer, but that's not what we want to do. With the stipulation, while keeping number of bits per message q, and this is the strange thing, okay? Q is not a large parameter. So the number of bits per message, you want to keep that fixed. And this is only going to work in the limit that you send a large number of repeated messages, right? That's going to be the idea. Are there any questions about this? So the answer to this little question turns out to be... The answer to this question precisely turns out to involve this quantity i. So let's just work out for a few examples before we start working a little harder what this value i will be for different types of channels. So if I give you p of y given x, I need to calculate the probability for different channels. So for example... No. Let me just hold it for a second. In order to calculate mutual information, technically, I need the entire joint probability distribution. In order to calculate mutual information, I need the entire joint probability distribution. I cannot calculate mutual information unless I have the whole thing. So the question is then, how do you get p of x, y given just p of y given x? And the answer is there's this unknown p of x that's sitting over here. There's some unknown p of x sitting over there. And always in the back of your mind, you should be thinking that there's some unknown distribution of symbols that are going to go into this channel. Unknown or fixed? You know, you pick some arbitrary. I don't want to say unknown, but some arbitrary one. Not unknown, but arbitrary. So if somebody says, here's a physical channel, then you say, well, okay, let me just assume some distribution of symbols x that go into the channel. And then you just go through the motions, right? So then you know that p of x, y is p of y given x, p of x. So you have the whole thing. And from that, you can calculate i of x, y. That's actually a very important part of this whole theorem. The channel just tells you how x gets converted to y. It doesn't tell you how to use all the x's differently. So you have to sort of pull that out of a hat. And the real trick that you're going to have to use, what Shannon says is there's a thing called capacity. Right now it's unmotivated. Right now it's just part of the same recipe of mathematics that we're using over here. The capacity of the channel is actually the maximum value of i x, y, maximized over possible values of p of x. It could be that for different ways of using the inputs, this quantity i which I have not yet motivated very well, but this quantity i could have high or low values. And for some reason we want this to have as high a value as possible. I'll explain why in just a second. And as high a value over what? Over different ways of using the input symbols. So let's look at a few examples. So here's a channel which takes 0's to 0's and 1's to 1's. That's an example of a channel. It's a channel that has no noise. Right? So what is the p of y given x? And what kind of p of x would you use to maximize the capacity of this channel? Very, very simple. So p of y given x is trivial. So p of y given x equals 0 is 1, 0 and p of y given x equals 1 is 0, 1. In other words, if you send a symbol 0, you're guaranteed to get 0 out. This is y equals 0, y equals 1. If you send a 1, you're guaranteed to get 1 out. Okay? Turns out the way you get the maximum probability here is trivial. So let's say p of x, y is in fact this matrix. So the joint distribution is just this one. This joint distribution is just this conditional distribution plus p of x being half and half. Everybody sees how I just jump back and forth between these? So p of x is half and half, each row has to sum to half, sorry, each column has to sum to half, which it does and it correctly respects the probability distribution. In this case, you can calculate the mutual information. You just go through the motions and the mutual information in this case is trivial. So i x, y in this case is equal to 1. This is the very simplest case. And this is the case of the noise-free channel. Yes? Ask me, please. The exact information. Why not just send the true information without error? Because you're sending it via radio and the radio has to bounce through the atmosphere and come back down and by the end of it, there may be corruption. So there will be fluctuations in noise that are induced by the environment. And the noise that's induced by the environment is by definition independent of the signal. That's what the idea that the channel behavior is independent of the signal that's being sent. You cannot avoid noise. You cannot avoid noise. So we have to deal with noise. And this is going to be a theory of how to deal with noise, assuming we know how the noise behaves. But this is a very special case. This is a noise-free channel. It induces no noise at all. So this is the previous case that we were working with earlier. Okay? So in fact, let's keep a little zoo here. So the capacity of this channel the capacity of this channel is one bit. What is the capacity of this channel? The channel that flips 0s to 1s and 1s to 0s, it's the same. It's noise-free. Zeroes and 1s are arbitrary, right? So c is one bit. What is the capacity of this channel? There's an interesting one. It actually has, for example, I never said how many symbols there have to be. In principle, there could be more than two symbols that you use for the channel. We've always been dealing with channels that take 0s and 1s as inputs and give 0s and 1s as outputs. But in principle, your channel could have four inputs. Right? So here's a channel that takes four inputs and gives two outputs. What is the capacity of that channel? That channel is clearly noisy. Right? If I send an A, there's no way I know I'm receiving an A. If I send a... Because I could have been sending A or B. And if I send a C, I could have been sending a C or D. So the question is, what is the capacity of this channel? And you can go through the motions. You can write down the conditional probability distribution. You can find the input distribution P of x that maximizes this by symmetry. It turns out to be uniform. And once you finish, for the capacity of this channel, even though it has four inputs, and therefore you might have thought it... You can squeeze two bits of information in there. Right? It turns out that the answer is one bit. Right? Because in practice, if there's going to be this much ambiguity, right, you should just not use B and not use D. And then you have a noise-free channel which just sends A to 0 and C to 1. And that goes back to being just like that. Okay? So the intuition is captured exactly. So let's work out the capacity of a very relevant channel which is the symmetric binary noisy channel. With some probability it sends 0 to 0 but with some error it sends 0 to 1. And P is the probability of error. And with some probability it sends 1 to 1. And with the same error it sends 1 to 0. So these edges must have probability 1 minus P. This is called the binary symmetric. So the... So this seems really bad, right? So unlike this channel, unlike this channel where by avoiding the use of certain inputs you essentially can have a noise-free transmission. This one is really terrible. If I'm a receiver and I get a 0 I don't know whether you send a 0 or a 1. If I get a 1 I don't know whether you send a 0 or a 1. I can have a guess depending on the values of P but I can't really tell. So it seems like, naively, if you have a channel like this there's no way to prevent errors. And yet we're going to find out a way to prevent errors. And just for the moment let's calculate the capacity of this channel. Binary symmetric channel. Let's calculate the capacity of this channel. So we have to calculate I of x comma y. I of x colon y. So you calculate the capacity. And in this case you could go through the motions. Remember we don't know the value of P of x. So this P x equals 1 is an unknown quantity or 0. P x equals 1 is another unknown quantity. And the sum of these two quantities must be equal to 1. So if we call this Q then this must be 1 minus Q. So there's an unknown quantity Q over which we want to maximize the information capacity of this channel. So there's an easier way to do it just by using the definitions. So I of x y is actually equal to h of y minus h of y given x. The definitions which we've used earlier. Which is h of y plus what is the definition of h of y given x. It's the chance that you get various values of y. Given various, it's the average entropy of every row of this matrix averaged over the different ways of getting x's. So it's P of x equals 0. h of y given x equals 0 plus P x equals 1 h of y given x equals 1. I've just explicitly written it out. That's the definition of P of y given x. h of y given x. That definition is what I spent a lot of time last class motivating. It's a bit strange. It's the average entropy of one of these joint probability distributions. Average for each row weighted according to the chance of getting those rows or columns. Any questions about this? Yeah? So what is the entropy of y given that x is equal to 0? What is the entropy of y given that x is equal to 0? It's a very simple thing. If x equals 0, then you get y is equal to 0 with probability 1 minus P. y equals 1 probability P. So it's just sum of P log P plus 1 minus P log 1 minus P with a minus sign. Which we agreed a long time ago to call h of P. And so is this. This is just h of P. Right? Because it's symmetric. So this whole thing just falls out and becomes h of P. It doesn't matter which x you use, the uncertainty in y is the same. That still leaves this guy h of y. And now you want to maximize this. And the way you maximize it is quite simple. How would you get the maximum entropy of y? What combinations of x should you use to make sure y has the maximum entropy? It's totally obvious. Uniform. Right? You should use x equals 0 with probability half x equals 1 probability half. Yeah? So in fact, this is less than or equal to 1 minus h of P because y entropy is maximized when the two values of y have equal probability. So the entropy of that is 1 bit. Right? So this is the capacity. So the capacity of this channel, even though it has two symbols and so you might think the capacity is 1 bit, the fact that it has two symbols is corrupted by the fact that there are some cross possibilities. And so this is the answer that you get. And if I were to plot this as a plot with the capacity on the y axis and the error rate on the x axis which goes from 0 to 1. This is P. This is the capacity. When P is equal to 0, h of P is 0. When P is equal to 1, h of P is 0. So in fact, if the channel streams the capacity is 1 bit and in the middle, the capacity goes down in fact to 0. So imagine what's happening in the middle there. In the middle, I have a channel where 0 could go to 0 or 1 with equal likelihood and 1 could go to 0 or 1 with equal likelihood. And in that case, literally, there's no way I'm going to be able to send it. Is it still working? Mike is working. If that's absolutely, then there's no way to send any information through. And I'm going to do one last calculation which is from the book, it's actually quite interesting just to show you that these calculations are not as trivial as they might appear at first glance. So I want you guys to follow along this calculation because it shows that the calculation of channel capacity can be slightly more complicated than you might naively assume. The notes here. So we're going to use the following channel. We're going to have a channel with 0s and 1s and 0 goes to 0 with some probability and 1 goes to 1 with some probability. But unlike the previous channel which would flip 0s to 1s and 1s to 0s, this channel is a little more well behaved. What does it do? If there's an error, it says, oops, there was an error. You might imagine constructing such a channel. And the error happens with some probability. So both 0s and 1s could result in an error. And the chance of an error is some alpha which is a small number and the chance of no error is 1 minus alpha. Slightly more complicated channel. Now, of course, when you have an error, it's totally useless for you. You don't know if it came from 0 or 1 because it's equally likely to come from both. When you have a 0, it would come from a 0. When you have a 1, it would come from a 1. Conditional on what you see here. So what is the capacity of this channel? Okay, so let's look it up. So p of x, right? Let's say this is some value q and this is some value 1 minus q and this is some value q. And the trick is to maximize over these q's to get the maximum value of the information. So if we write it down, the same condition, right? So the capacity is the maximum over the value of q which is the same as the maximum of the value of p of x of i x, y. Same as we wrote down earlier which is the maximum h of y minus h of y given x, right? Which is the maximum h of y minus the probability that x is 0 h of y given x equals 0 and the probability that x equals 1 h of y given x equals 1. Same thing as we did last time. Same thing as we did last time. It's symmetric. What is the value of this? What is the entropy of y given that x was 0? It's very easy. What is it? Can y take on three values? It can only take on two values. And which two values can it take? 0 and epsilon and the probability is alpha minus alpha. The actual quantity of it is just h of alpha even though there are three symbols on the outside because only two of them actually occur. This is also h of alpha. So this whole thing is just h of y minus h of alpha just as we had last time. So now that you know this and I want to maximize something what's the answer? There's only one thing left. Alpha is a fixed parameter. I need to maximize the entropy of y. What is the entropy of y over these values? Symmetry seems to play a role. What is the maximum possible entropy of y? What is the maximum possible entropy of y? Log 3. Log 3 appears to be the maximum possible entropy of y. Is there any value of alpha that will give a uniform distribution on y? So 0. Not alpha. Is there any value of q? I'm sorry. Is there any value of q that will give a uniform distribution on y? Turns out there isn't. So although this naively you might have thought this was log 3 you're really squeezing information through there it turns out you can't achieve log 3. And you can just sit down you can work out y as a function of x you can do the conditional probability calculation you can write it as a function of q take a derivative, set the derivative to 0 find the value of q where the answer goes. After a lot of steps the answer turns out to be 1 minus alpha okay various things cancel various things cancel and the answer turns out to be 1 minus alpha the capacity of this channel even though you have two symbols that go in is not one bit it's one bit minus an alpha and does this make sense? Yeah it actually makes sense if you send a long bit string how many of those bits will you get successfully? 1 minus alpha alpha of them will just be errors but the ones that you get will not be errors right so in principle at least you can start to send error-free coding can somebody tell me how I can use without even thinking very hard if I'm standing here I have this silly channel in the middle and I have a friend standing on the other side and I want to use this to send codes with this efficiency what kind of code should I use how should we agree to work so if I send a 0 and they get a 0 no problem if I send a 1 they get a 1 no problem if I send a 0 and they get an error what should I do? Resent now for that I have to know they got an error so they need to say look I got an error and then I reset okay so that's called a channel with feedback if I know that somebody had an error then I can always resend the symbol and that will only happen in proportion to alpha of the time with feedback I can definitely send messages where on average for every bit I send actually 1-alpha bits go through so on average for every bit I send 1-alpha bits go through so the rate is 1-alpha this is my starting motivation to say that this little i does capture this rate if I had the binary symmetric channel with zeros and ones could I do the same thing I can't because my receiver doesn't know there was an error if the receiver has no idea there was an error right you can't say resend it so of course in the case of feedback this turns out to be the capacity there's a very strange feature of information theory that if the channel is memoryless if it doesn't if each of these things is independent the chance of an error in one transmission is independent of the chance of error in the next transmission okay sending this kind of feedback to the sender the receiver is saying oh I got an error it's helping out doesn't actually even help because you can be clever enough with your codes that you can send your result with zero error they don't even need to send you feedback and that's the proof I'm going to show you okay so this is all motivation let me now give you the big picture so how many possible how many possible code words are there how many possible code words are there not the one of course I have M code words I have M code words but how many possible code words are there assuming your channel has zeros and ones it's 2 to the Q it's 2 to the Q code words right so there are 2 to the Q code words here sender's code words and somehow there's 2 to the Q code words here as well there's 2 to the Q sender's code words and there's 2 to the Q receiver's code words now what we want to do is to have just a few code words corresponding to each of my messages so how many such code words are there there's going to be M of those one two three four five right there's M M of those 2 to the Q code words are the ones that you actually use in your code book those are the rows of this book now because so if the channel had not been noisy then whatever code word you send would end up going to exactly one point right whatever code word you send would end up going to exactly one point but because the channel is noisy the code words you send end up going to many points right the same code word can end up going to this many points that's what the noisiness of the channel does right it takes what was an uncorrupted input and every time I send it gives me a different corrupted version it expands what the receiver could get from the same signal so it could happen that another code word which in the uncorrupted state would have gone to a single point here in the corrupted state goes to this collection of points and that is bad news the reason that's bad news is because in this region I don't know if this is the thing I wanted to send or this is the thing I wanted to send so the trick somehow is to make sure that these things here are spaced widely enough such that the things they map to are also spaced widely enough this should be totally non-overlapping totally non-overlapping question yes of course it doesn't in principle for a binary symmetric channel it could go to anything absolutely true but typically where does it go because finite error but very small so every argument I'm going to make from now on obviously for the typical case because all the atypical events would be stuck in low probability epsilon order epsilon thanks for the question so these circles just like earlier I drew a typical sequence circle inside the set of all possible 2 to the n strings each of these circles is typically where this message will go so this is x this is y actually this is x q times repeated this is y q times repeated so here's the game somehow very very cleverly I've chosen my code book such that once I start sending messages and by the way what is the worst case of sending messages the worst case is where the messages themselves have no structure so every one of these messages is equally likely in that case I really have to send all this information if that wasn't true then you might as well first compress the messages and then make sure all the options are still equally likely so it doesn't add anything all these m messages are equally likely up front but I have some very clever code book and according to that very clever code book I have some distribution p of x that happens what is that distribution p of x it's literally the statistics of these symbols when any one of these messages is equally likely since I know p of x I know the channel p of y given x therefore I know p of y so I have p of y so in the space of all possibilities how many typical x sequences am I going to get so if I use typical over the choice of all possible words here there's going to be 2 to the h of x 2 to the n h of x typical messages even in this space and well let me put it this way there's going to be also 2 to the some smaller number 2 to the n h of y typical messages in that space because we know the distribution of y and only the typical ones will occur now how big are each of these little circles how how wide does one message here a single string become when it comes there that we know when I say how many the answer is an entropy in this case when I say how many for a given x average over all possible input x's the size of each of these is actually 2 to the n h of y given x so if I wanted to pack a bunch of circles each of size 2 to the n h of y given x and the total space in which I can pack them is 2 to the n h of y how many circles can I pack so the total number of circles is 2 to the n h of y over 2 to the n h of y given x it's the size of the whole blob divided by the size of all the little blobs right and that if you're paying attention is actually equal to 2 to the n i so it turns out that I cannot pack more blobs here right sorry iq not nq q I cannot pack more blobs there then this number which is given by the information okay if your channel capacity is 1 then there's 2 to the q blobs if channel capacity is less than 1 it's 1 minus alpha or 1 minus h of p then there's less than 2 to the q blobs and that's actually the number of messages you can send the number of blobs here must map backwards each blob to a certain set of code words here right and these words are cleverly chosen to make sure that blobs pack are nicely over there so this is the m so this total number m maximum value of maximum possible maximum possible value of m is that we already knew that m is equal to 2 to the k m is 2 to the k we knew that so the maximum possible value so just looking at this definition so in this case r right is k over q which is actually equal to that's the whole motivation for defining mutual information in this way it's a sphere packing problem if you want to pack a bunch of spheres in a constrained volume without overlapping I'm not saying this is possible but at the very most this is the number you can pack right so there's no way in principle to have more than this many messages because what happens if you add even one more message if you add even one more message the collection of sequences will overlap with one of these other sets at least and if it overlaps with one of those other sets at least then you cannot decode it with zero error and we want it zero error so the rate you could think of the efficiency of the system right is basically the naive number of bits divided by the actual number of bits you send and by this little geometric argument I've just shown you it must be equal to i x y based on this sort of typical sequence argument and in fact the maximum possible rate must be equal to the maximum i which is in fact the definition of the capacity right so it says you can achieve maximum rate by cleverly choosing your code to get the correct distribution p of x such that the i is maximized but that's secondary to me even if you use the wrong p of x so that i is not maximized r can still only be equal to i at best are there any questions about this? it's just a geometric argument and now I'm going to get into the full blown proof for zero error codes or asymptotically zero error codes with real channels so basically the proof of the theorem any questions? yes yes good question I'm going to tell you right now this is first of all it's hand wavy right? secondly even if it was not hand wavy and these geometric arguments could be made rigorous this is merely an upper bound it doesn't tell me how to discover these values and where to put them these are the kinds of questions one would like to ask in practice I'm going to tell you right now that's what the rest of this is going to be so we have half an hour left and I'm going to cover that what can I erase? I can erase let me erase this we're going to choose a code does somebody remember how we chose the code does somebody remember how we chose the code for data compression using the AEP using the asymptotic equipartition property the way we encoded that was here are two to the N possible sequences and here's the typical set and how many other? there's two to the NH of those remember how we did this and so there's a bunch of sequences but we only choose to use NH bits because we're only choosing to encode this little piece and we assume there's going to be an error if it falls anywhere outside remember how we did this this recipe is a well-defined recipe for making a code how do you work it out? you say I know some distribution P this H depends on P's I know the distribution P over many letters if I know that then I can explicitly work out which are all the typical sequences up to some plus or minus epsilon I can explicitly work them out and for exactly those I use NH bits to encode them and for the others I declare an error remember how this was done so now we're going to use a very similar strategy to build a code for this channel we're going to use a very similar strategy to build a code for this channel so what we're going to do is to define typical sequences in the same way we did earlier let me see make sure I got all the epsilons and delta is correct so I'm going to define a set of sequences a set of pairs of sequences and y but it's x repeated q times y repeated q times and I'm going to call them jointly typical I'm going to define it as jointly typical this is the following properties it's the same as we had earlier q minus h of x less than epsilon same for h of y one more very important thing where both of them together the joint distribution is close to the expected entropy of the joint distribution it's an obvious extension of what we did earlier so one over q well is it one over q yeah it is this is now a combined probability distribution these are the marginal distributions it's just the distribution of all sequences x this is the distribution of all sequences y this is the joint distribution the whole collection of ways where for this x you can get that y right and that has some each of those has some probability in the joint distribution and you just find those sequences where this is also the difference between these two is less than epsilon it's exactly the same as we did earlier so if you understood typical sequence coding in the context of data compression you're going to understand typical sequence coding in the context of data transmission with errors this is my code almost I haven't told you what code to use but here's the code I'm going to use first let's define some properties of this there's only a few properties important properties the first property is the usual one all the sequences almost all of them will belong to the typical sequence x of q y of q belong to the typical set goes to 1 right so eventually all pairs of sequences will be in this typical set as q goes to infinity for sufficiently large q by the way I said I'm not going to take large q but I just took large q now how am I allowed to do that I said I'm not and I don't want you guys to think I did some pull the wool over your eyes so this is fair the reason is because I'm not going to take large q for fixed m I'm going to take large q eventually for large m so please bear with me for a little while so although I'm saying large q I'm actually sticking to my original principles okay I'm going to take large q but I'm going to take large k and the ratio of k to q is going to be constant yes minus the joint entropy of x and y thank you so there we go so this is clear right this we proved last time also it's just the law of large numbers so you can prove this you can also prove in the same way that we had proved earlier that the size of the set the size of the set the property did I get that right? yeah this is a small number this is a slightly bigger number and these things get squeezed between them so just like earlier the total number of typical sequences was 2 to the NH here the total number of jointly typical sequences is 2 to the q H of x, y that's how many there are but we're going to say something slightly different previously for typical sequence coding we said how many what's their probability? 2 to the minus NH that's what we said now here we're going to say something slightly different we're going to say how many 2 to the q H x, y what is the probability? well of course the probability of each of those if that's how many there are and they're all equally probable so the probability of those must be 2 to the minus NH of x, y and that's not interesting I'm going to ask a slightly more interesting question so this is property 1 property 2 and property 3 this is the whole crux of the argument so please pay attention for the next 30 seconds suppose this x comes from the distribution p of x and y comes from the distribution p of y repeated q times, independent identically distributed suppose x is a typical sequence for x and suppose some other y is a typical sequence for y or to put it differently if x and y are drawn from their respective probability distributions but separately if x and y are drawn from their probability distribution separately there's no force that causes the value of y to be correlated with the value of x if x is drawn and sent through the channel and then you get y then y and x are totally correlated they will be jointly typical but if I just pull out some random x and pull out some random y it's a bit like just pulling out so this is the x to the q this is the y to the q right if I just pull out some random x and pull out some random y remember for x and y to be jointly typical the y sequence has to be in the typical projection of x if I pull them out at random what's the chance that it's in there right and this is the exact question to which I had made that geometric argument earlier it's just asking what's the size of this whole ball divided by the size of that little ball but now with the correct epsilon and delta so if the two things are drawn independently what's the chance that it's in there and here I'll give you the answer and then I'll work out the answer for you the answer turns out to be the probability that the joint sequence belongs to the typical set can everybody read down here because it's quite important the probability that these two sequences together belong to the typical set in other words the probability that the y sequence is in that ball okay lies between these two values it's greater than or equal to 1 minus epsilon it's the same 1 minus epsilon that comes over here times 2 to the minus and here's the magic bullet q times i x y plus epsilon I'll write this a little bigger you should never start in the middle because you run out of space on one side 1 minus epsilon 2 to the minus q i x y okay plus okay there happens to be a plus there's a 3 epsilon in the proof right it doesn't matter but okay I'll stick with the books notation it's some some number less than or equal to the chance that both the sequences belong to the typical sequence the jointly typical set is less than or equal to you can guess from the structure of this 2 to the minus q i x y minus 3 epsilon because this is a bigger number this is smaller okay stare at this stare at this it's important here's the code my code is going to be I mean it's a it's a very complicated code to actually build and use but it is a code my code is going to be the following I'm going to take since I know p of x p of y I can with sufficient computation calculate all the typical x sequences I can calculate all the typical y sequences okay and my code is going to be that if I receive some y sequence then I'm going to decode it to an x sequence which is jointly typical with it right so if I receive any sequence in this ball I'm going to decode it this x there could be other sequences which is jointly typical but I'm going to choose a subset of x's choose a subset of x's such that I'll decode to this one that is my code so you ask how I choose a code that is my code I still have left open one final question which x's should I choose such that all these balls are overlapping right and that's actually difficult question you have to go through all possible ways and figure out for every possible combination of the x's where the balls are and find where they're not overlapping and so on right so the actual way to write to discover the code is quite difficult but once you've discovered it I have a bunch of x's and I have a bunch of typical projections of the x's and the chance that two of those balls overlap at all right is given by this two to the minus q times right therefore I can pick at least two to the qi values of x before two balls will overlap okay that's the idea any questions in this kind of code we're actually fixing both the error and the length okay so we're going to send q bits for sure and you've given me an error we're not fixing the length in the sense all the lengths are equal the value of the length is large we choose q to make the error small okay but we don't choose q to make the error small for a fixed message size okay that's the bit we choose q to make the error small right and pack as many messages in the left side as possible okay so in fact you are getting if q increases you're getting more messages right so it's not a redundancy code where I'm just repeating the same signal and you're taking some majority rule and so on because then the number of bits per message goes down here the messages go up and the number of bits go up okay so okay so here's the setup so you know if everything else gets confusing just keep this in mind if there are m messages naively you should send k bits but I'm sending q bits q is big I want to make k as big as possible I want to make k as big as possible such that no two of these balls will overlap okay how much can I fit surely no more than two to the qi because as soon as I have more than two to the qi the chance that two of them will overlap becomes larger okay so what can I let me erase I gotta erase something okay let me erase something here I'm going to erase this last part plus the h of x y bit so here's a sequence of codes so I define a code with some rate r r in fact sets k over q and the code has the property that the error probability of error is less than some epsilon for these values of k and q okay such that the rate is fixed so as q increases the k increases and the rate will be fixed for sufficiently large for all q greater than some q no so it's a bit of a subtlety but all I'm saying is you define some error when you define some error you're going to find some q and when you find some q you're going to find some k such that I'm definitely able to send that many messages through okay and that's what this code is going to be the probability of error is less than whatever probability you choose the smaller you make the probability of the desired probability of error the larger you make the q but the larger you make the k because r is a constant and what constant is it going to be you can guess what it's going to be it's going to be i now for the rest of this I'm going to move away from the board because it's a series of arguments it's best to see in plain text so I'm going to pull out the proof for the theorem from the book and it's nice to see it once I'm just going to walk you guys through it and I urge you to go and read more about it later but the answer is very simple for a given channel I can work out an i and once I find the i in fact I can even work out a c which is the sort of supremum of the i over the the maximum of the i over the p of x's forget that so for a given channel I can work out an i if my r is less than that i if my r is less than that i then I can always make this happen if my r if my desired rate or desired efficiency is less than the size of that pipe the i measures the size of the pipe that the channel represents and if you try and squeeze more than the pipe is able to take you're going to make errors but if r, what you're trying to squeeze through is less than what the pipe is able to take if k is actually less than q and the size of the pipe which is i, k over q is less than i you can always send it through with zero error and one way to prove that is to use jointly typical codes you could use other codes but this is one code that does achieve that okay I don't know questions, I'm going to go straight to the book because the rest of it, I only have 10 minutes I do want to get through it, I don't want to write in the board because I'll make enough errors with epsilons and deltas to make your lives miserable, so let's see if this works okay, there it is okay, so let's walk through so this is Covert and Thomas and here's the theorem I'll make this a little bigger so you can see more clearly here's the theorem I've been using consistently throughout the course the idea that q is the number of bits that you're sending but for the purposes of this theorem they use n for little n for the same thing so their little n is my q so it says for a discrete memory less channel which is your p of y given x all rates below capacity if r is less than i which is less than c less than or equal to c r achievable, okay that means you can do it with zero error specifically it says for every rate r is less than c and this is a bit complicated but I'll explain what it means in terms of what I use for every rate r is less than c if you're trying to send less than the size of the pipe then there is a code where the probability of error is essentially zero what does this little thing mean right, n was my q so that's the number of bits I'm sending that's the size of each codeword the n over there is my q is the size of each codeword what is nr nr is my k which is the effectively the number of messages right, so 2 to the nr in fact is the 2 to the k which is my number of messages so all this is saying is I can send m messages using q bits that's what that says in my language okay so I can send m messages in q bits as long as log m over q which is k over q is some r which is less than the channel capacity okay, so is the setup of the theorem clear and in fact if I try and send more messages with the same q bits or if I try and send the same number of messages with fewer than q bits so r then exceeds c right then you will have 0 then you will have a finite probability of error so for all rates below capacity you can have 0 error but for any rate above capacity if you put too many messages here for the same q you will make an error that's the setup of the theorem any questions so in the next 10 minutes now that's the end of the class in a sense if you've understood this bit regardless of the proof of the theorem that's all you need to know now I'm just going to walk you through the proof of the theorem all this is just optional extra so can I pause over here and ask if there's any any questions about the content of the theorem theorem says given the channel for a fixed q there's only a certain number of m's you can send right or for a certain number of m's there's a minimum number of q's that you have to send what is that ratio? this is in the limit of large q but large m so the q is the large number but the m is also large in a sense this is a block code your code will not really achieve these bounds unless the number of bits you're actually sending are rather large okay but the number of total messages you are sending can become arbitrarily large as well okay another way to think about it if I have a binary symmetric channel right if the messages I'm sending are themselves just zeros and ones I don't have some complicated list of horses or letters or whatever if the messages I'm sending are literally zeros and ones right the best thing to do is to bunch them into some number k okay so your block size becomes k and once you have a k block size of the fundamental units zeros and ones then there must be 2 to the m 2 to the k rows in this right so if you want to think of it k is the block size q is the number of bits you need to send for that block size and q must be sufficiently large is that fine for everybody? it's a good question because usually we're not dealing with exponentially large number of messages per se so I'm going to walk you through this proof this proof there are many proofs there's a proof in Shannon's original paper which is an approximate proof I'm just giving you the proof of how these things go okay it's a very strange proof the proof starts off by making a code book right how many rows in the code book there are m rows in the code book how do you know there are m rows because it's 2 to the n r n r is k so there's 2 to the k rows of this code book so this is just the same picture now how did I make the code I made it in a very absurd way I picked a code of length q or length n and I put the letters in there I put the 0s and 1s in there at random based on some underlying distribution p of x so I made a random code it's terrible it could even be that this code is not decodable it could be that 2 rows of the code are identical all kinds of bad stuff could happen but nevertheless that's my code and I'm going to, it's insane but I'm going to use this random code book to try and send information so there's a random code the sender and receiver share the code the message is picked uniformly some message is picked I look up the code word, that message I send out the code now the code goes through the channel it runs into the physics of the channel it gets corrupted so what comes out of the other side of the channel some y some sequence of y's which depend on the x as their percent now is the real trick how does the receiver decode it and the receiver decodes it in the following way the receiver looks up my code book and says if a particular code word is jointly typical with what I saw I see this value I see this value and now having seen that value I realize that that value is jointly typical with one of the code words if so I assume it came from that code word right? that's fine but there could be something bad that happens it could be very rarely that that code word was jointly typical with two things could be if so I declare an error it could also be that this thing is not in the set of any typical projection if so I declare an error there's only two types of errors so there's type one and type two errors either it's not jointly typical or it's typical with a bunch of things and the rest of it is just the calculation of the error you define the error as the chance of getting an error for that code book you average the error of all possible messages and then you do a strange thing which is now understand the behavior of this number the error rate for the code averaged over all possible code books because the code book remember was itself random somewhere among all the random code books is the one you want right? but you're going to find out what the random behavior is and starting from here the proof becomes a bit weird this is the average error over all possible code books cursive C that you're going to use and if you want to understand why this is done you need to go and read the proof of the theorem in great detail it's because only in this case can you actually get a number that depends on all these generic quantities independent of the code book by taking an average over code books you get quantities which are independent of your specific choice of code books so that's the rest of this and here's the error the chance of getting an error suppose you sent message number one it's either that message number one is not jointly so the received version of message number one the Y that you got for message X1 is not jointly typical with X1 so it's called E1 complement or it is jointly typical with all the other messages with at least one of the other messages so then you get an error then you write down the probabilities of these errors then you work out that various things are unlikely right? the chance that your Y is not typical with X is less than epsilon that's one of our AEP properties right? the chance that it is typical with one of the others is in fact 2 to the minus i that's another AEP property okay? so when you finish the whole thing the chance of an error at all is less than some epsilon as long as the R is less than i that's the whole proof so I have 3 minutes let me see if I can squeeze a lot out of the last 3 minutes so this proof says averaged over all random code books the probability of error is less than is smaller than any number you care to name as long as the ratio k over q is sufficiently small as long as you've padded it with enough extra bits but that's for a random code book how do you know for an actual code book that's the rest of the proof okay? strengthened by a series of code selections okay? first of all you maximize the capacity by choosing a p of x that you can replace i with c that doesn't matter here's the important bit get rid of the average over code books right? since the average probability of code books is small there must be some code book there must be some code book for which the error is at least that small if the average over all possible then there's some terrible code books and some very very good ones but since the average is epsilon there must be some code books that are pretty good yeah? again it's the average so and secondly there's a little trick where you show that the maximum error in that code book is still small not just the average error even for that code book but the maximum error in that code book is sufficiently small and that's the value of the maximum error so that's the end again I urge you to go look at the proof of this theorem in this way and in other ways one of the best most illuminating proofs is not here but it's in Shannon's original paper so go read that it gives you an idea of how you motivated this whole project so I'll turn that off and I'll turn that off and leave you with this final result right? so this week was started with the idea of you know cells decoding where in a fruit fly embryo they are by measuring various various chemicals but the idea is those chemicals are stochastic they're stochastic you said where does the error come from the error comes from the fact that this is a physical system so in our case what channels are we using? we're using channels that are built of you know biological molecules, small cells transcription factors, diffusion, binding, unbinding chemical kinetics all those can be used to figure out p of y given x so the whole first half of the course was figuring out these p of y given x's from the sources of randomness that we know exist in living cells now this second half of the course is more speculative it says in principle a cell is able to extract information even in the presence of noise as long as it's using certain features, one is that it shouldn't be trying to send too many different signals because they're not all mutually decodable but the second piece which is a bit odd in the information theory setting which was designed for mobile phones and was not designed for cells the second piece says you have to wait infinitely long to really extract the zero error performance these large queues over you get zero error and let me tell you a cell is not willing to wait infinitely long so what does a cell need to do? a cell needs to solve a practical problem it needs to solve the decoding problem in some relevant timescale for those relevant timescales if there is noise there will be error this proof in fact allows you to work out even what those errors ought to be so it suggests a research program to go and check how dynamically a cell is able to make inferences on what timeframes therefore how many samples was it picking up and what is its actual performance and we'll predict that the performance has to be there's a certain bound on the performance the error cannot possibly be lower than the ones mentioned by these codes so I think this research program is still quite nascent people haven't gone and looked in living cells except for very few the kind of paper we read at the beginning of this week is one example of such a case where they found that this is one percent error which is optimal, it's optimal in the sense of information theory finally just in case somebody is interested do mobile phones actually use decoding principles that reach these errors? they don't because they also have to decode rapidly so they use different kinds of codes if you want to make a code that achieves these kinds of errors because you have to wait so long there are a huge number of messages between which you have to choose and choosing those messages based on what you see in a maximum likelihood sense or whatever is a complicated computation so you need ultra-fast circuits and so on to actually achieve these zero errors so it doesn't come for free it comes at a cost of a huge amount of computation as it happens there are a series of codes these days that are called turbo codes which are actually called low density parity check codes that come quite close to this bunk so we now do have codes that come so this is not just an abstract theory that gives you some very difficult to work with code book there are real codes that are really used which are these codes that come close to the Shannon bound not in biology but in mobile phones not even mobile phones but in some context like that some electronic context ok it's one o'clock so I'll stop and I think we have some nice things to hand out from the front so I'm going to call Matteo so