 Welcome back to the course on data compression with deep probabilistic models. Today we're going to derive a fundamental lower bound on the bit rate of loss-less compression. But before we do that, let's briefly recap where we left off last week. Last week we first talked about communication in a very general manner. So our goal was to transmit a message from some sender to a receiver and in a way that is fast and reliable. And we saw that in order to do that we have to think about redundancies in the message. And we have to do two things. We have to first remove redundancies that already are in the message to begin with. And then since the channel could introduce some noise we have to add some strategically placed redundancies back in so that even if there's some noise in the channel we can still decode the received message on the other end of the channel. We then saw that we can in general separate at least in principle the removal and the addition of redundancies. And this is called source coding and channel coding. And because we can separate the two and because channel coding only needs to care about properties of the channel and not properties of the message we can build ourselves an abstraction of a noise-free channel that contains a physical channel combined with some error correction tools. We then looked at our first class of lossless compression methods and those are called symbol codes. And symbol codes are very simple. They operate on a discrete alphabet and they assign a so-called code word, a sequence of bits or more generally, bary bits to each code word. And then in order to encode a message which is a sequence of several of these symbols they just concatenate each of these code words. And we gave examples for our favorite game of simplified monopoly where different code books had different properties. And in order to compare these in the tutorial we defined three properties of code books. One is the expected code word length. And that is if you have a probabilistic model of your source or precisely a probabilistic model of the probability in which a symbol will occur, then you can calculate what's the expectation value of the length of the code word for that symbol. And generally we want to have the expected code word length to be short because we want to be efficient. We then defined two more properties. One is unique decodability and a symbol code is uniquely decodable if for any sequence of symbols it maps it to a unique compressed bit string. And this is a property of C star which is the concatenation, C star of some message X underlined is the concatenation of all the code words for all the symbols in the message. And the second definition that we made is that of a prefix free symbol code and a prefix free symbol code. This is now a property only of the code book and the code book is prefix free if no code word is the prefix of another code word. And we argued on the problem set that every prefix free symbol code is a uniquely decodable, leads to a uniquely decodable symbol code but the inverse is not necessarily true. So with that let's now get to the topic of today's lecture. That is theoretical bounds for lossless compression. So today we're going to answer a number of questions. The first question will be what is the theoretical lower bound on the bit rate of lossless compression? And then the second question will be is this theoretical lower bound is that just some kind of academic result or can we actually kind of come close to this lower bound with is there actually a symbol code in particular that comes close to this theoretical lower bound? And then the third question is even more practical if we can prove that there's such a symbol code is that proof of existence? Is that just some symbol code that we know that exists out there or can we actually construct that symbol code? So in order to answer these questions let's look at our first theory that we will prove. The first theorem is the so-called Kraft-McMillan theorem. And here in this proof of this theorem I'm going to partially follow the videos from the YouTube user who calls themselves mathematical monk and I'll also add references to their videos in the video description. So this Kraft-McMillan theorem has two parts. First part is it starts from a uniquely decodable symbol code C and then the statement of the theorem is if we have any uniquely decodable symbol code C the lengths of the code words which I denote as L of X it's the length number of bits in a code word. It always satisfies this property that the sum over all symbols of one divided by B to the power of L of X where B is the base. So typically two, this sum is not larger than one. So this part A of the Kraft-McMillan theorem goes from says if we have a code book then it satisfies some property. Now part B goes the other way around and it says if we have any functions L of X that assigned to each symbol in some alphabet integers such that they satisfy this equation and it's just some function L of X then we can always construct a B or E prefix code and therefore a uniquely decodable code such that the lengths of the code words are exactly L of X. So in other words, if somebody gives us the task of constructing a code book with where they tell us for every symbol in the alphabet they tell us how long should the code word for that symbol be then as long as the lengths of those symbols satisfy this inequality then we can always construct such a code book because there always exists such a code book that will satisfy these lengths. Okay, let's prove this theorem and in order to prove this, we'll start with the lemma and the lemma is something very simple but we need to kind of write it down formally so that we don't get confused later. The lemma is simply let's have some integer S and some uniquely decodable symbol code C and then let's define ourselves a set Y sub S and this is the set of all messages so all sequences, all strings of arbitrary lengths of symbols from our alphabet and but we're only considering messages whose encoded representation, so C star of X has exactly length S. So we're giving ourselves a length S and then we're considering all messages that decode, sorry, that encode to a bit string of length S and the claim here is that there are at most B to the power of S such messages. So the size of the set is not larger than B to the power of S and this is a very, there's a very simple argument or that simple argument is simply that if C is a uniquely decodable symbol code then that means that C star is injective so it maps different messages to different codes but there are only, if we're looking only at codes of bit strings of length S then there are only B to the S such bit strings so if there were more than B to the S elements in this set, so if there were more than B to the S messages whose encoding has length S then two of those must be mapped, two different ones of those must be mapped to the same bit string and then C would no longer be uniquely decodable. Okay, so this was the easy part. Now let's go to the proof of part A. So for part A let's take some integer K and then let's define this quantity so the left-hand side of this craft inequality as R, I'm going to call this one R. So this quantity is R. And now I'm going to calculate R to the power of K. So R to the power of K, I can just take this expression, take the power of K and then can just rewrite just to save some space instead of one over B to the power of L of X, I'm going to write from now on B to the power of minus L of X which is the same and then we take the sum over all symbols and then the whole of that we take to the power of K. And then the power of K just means that you can just write that out and we see that it's K factors where in each of these factors we have a sum over all symbols B to the minus L of X. And now these X in these sums, these are independent summation variables so I can rename them so I'm going to call this first X and I'm going to call this X one. So here we have sum over X one, B to the power of minus L of X one, going to call the second X two and so on the last one XK because there are K such factors. And the reason why I'm doing this is because now I can go ahead and I can look, I can rearrange this. I can rearrange it in the following way. I can pull out all of these sums into the front using the distributive property. So I can say instead of having all these sums individually I can pull them out into a one big sum that sums over all of these combinations of X one, X two, X K. And then here I have to multiply the terms together and that just leads to B to the power of the sum over all I equals one to K of L of X I. So now this large sum, I can rewrite that as just the sum over all messages of length K. So all messages X underscore from alphabet to the power of K to the product space X to K. And again, B to the power of minus I equals one to K of L, so the sum of all the length. What I'm going to do now is I'm going to call this exponent this minus sum over I to the I from one to K over L of X I I'm going to call this sum S. And so S is the total length of the encoding of this message. And there are obviously there will usually be several different messages that will lead to the same encoding, but for all of these messages, it won't be the same length all the time. So we have different lengths appearing in this sum, but not all of the messages will have a different length. So what we can do is we can rewrite this sum into two terms. One, the first term will only sum over S. So all the lengths that will appear in this and we'll think about the limits for this in a second. And then the second term will sum over all messages X underlined whose length is S. So in other words, those messages are in the set Y sub S that we defined earlier. And then this term simplifies to just B to the minus S. So what are the limits for this sum over S? Well, let's first assume just for now that the alphabet is finite. Then if we have a message of K symbols, so if we have a finite alphabet, there's some L max, so some length L max, that is kind of the maximum length of all the code words. So all other code words or all code words in the alphabet, their length is at most L max. So therefore, if we have a message of K symbols, the concatenation of all those code words is at most K times L max. So S runs from zero to at most K times L max. And it's okay if there isn't actually any message K times L max in that case, our set YS would just be empty. So this sum would just be zero for that term. But in general, S sums here from zero to K times L max. And then we can use our lemma from earlier and we can rewrite or we can first see that this term doesn't depend on the message X anymore. So we can just rewrite this as S from zero to K times L max over the size of our set over which we sum times B to the S, B to the minus S. And now we can use our lemma and we can know that the size of the set is not larger than B to the positive S. So in total, we have B to the positive S times B to the minus S. So we can rewrite that together and we have sum S equals zero to K L max as B to the S, B to the minus S. So this term is exactly one. And then we have a sum over terms that are all one. So that just evaluates to the number of terms. And since we started zero, this will be K times L max plus one. And remember here, there's a not an equal sign but a less than or equal. That's because of our lemma only tells us that this set is at most, has at most size B to the S it could even be smaller. Okay, so in total, we have now shown that R to the power of K is smaller or equal than this expression, K times a constant plus one. So this I wrote it up again here. R to the power of K is at most K times some constant plus one. And this is for all K. So now what we can do is you can already see if you make K very large, then this left hand side will grow exponentially whereas the right hand side will grow only linearly. So if R is larger than one, then this will grow faster than the left hand side will grow faster than the right hand side. And for some K, this inequality will be no longer hold. So R cannot be larger than one. A more formal way to show this is we can divide by K. We take only positive, apply it only to non zero integers. So we have R over K, R to the power of K divided by K is smaller than L max plus one over K. Then if you take the limit of K to infinity, this second term will go to zero. So the right hand side will just go to R max whereas this side will go for K to infinity. It will go to infinity because the enumerator grows faster than the denominator. And at some, and this means that in the limit, so we have an inequality, it also has to hold in this limit, but at some point it will no longer hold because this left hand side is in the limit, larger than the right hand side. So with that, we know that, and this is only, it goes to infinity if R is larger than one. So with that, we know that R is not larger than one. So we know that R has to be smaller or equal to one. And that is exactly what we wanted to show. So in the Kraft-McMillan theory, the statement was that R, which is this quantity, is smaller than one. All right, so for now we've assumed that the alphabet is finite. Now what happens if the alphabet is infinite? Well, remember we defined for symbol codes that they always have a, if they are, have an infinite alphabet, if the alphabet always has to be countably infinite. So without loss of generality, we can assume that the alphabet is the set of the natural numbers. And then R is again the sum over all symbols, one over B to the L of X. So we can now write this if the alphabet is the set of the natural numbers, we can write this as the sum from X equals one to infinity over B to the minus L of X, which is defined as the limit of a finite sum from X from one to N, where N then goes to infinity. And now we can apply our theorem. So because now here we are in the finite case. So we know that this, because if we reduce our symbol code to just a finite set, it will still be uniquely decodable because it will be actually simpler. So we know that our theorem still applies here and we know that this is smaller or equal to one. And then if you take the limit of a quantity that's smaller or equal to one, that limit is also smaller or equal to one. So again, the craft inequality is satisfied. So this was the proof of part A of the Kraft-Mitt-Millen theory. Now the part B of the Kraft-Mitt-Millen theory kind of states something like the inverse. So it goes from, here's again, the theory. And part B, we start from just any length of symbol of code words that we are given. And we assume that these lengths satisfy, again, the craft inequality. And now we want to show that if they satisfy this inequality, then we can construct a prefix free symbol code C that has exactly, whose code words have exactly these given, these target lengths. And the nice thing about this part of the theorem is we can actually give a constructive proof. That means that not only can we show that there exists such a symbol code, we can actually provide an algorithm that constructs such a symbol code. And so the algorithm goes as follows. First, we look at our alphabet and we sort it. So such that, I mean, we are given the only thing that we are not given a symbol code yet. We want to construct a symbol code. We are only given the length, the target length of those code words. So we can always sort the alphabet such that we start with kind of the longest target lengths and then we go down. And if some of the target lengths are equal, then we don't care how they are sorted. So this is the first step. And then the algorithm goes as follows. We initialize some real value variable xi to one. And then we iterate over all the symbols in our alphabet in this order. So going from long target lengths to short target lengths. And in each step, we do the following. First, we reduce xi, which we initialized here as one. We reduce it now by B to the power of minus L to X. So these are these quantities that add up to something not larger than one by our assumption. So if we keep kind of reducing xi in each step of this iteration, by this term, we can never, xi can never be smaller than zero, can never become negative. So it will always remain within this interval, this half open interval from zero to one. And then we can, that means if it's in this half open interval from zero to one, that means we can write xi in BRE expansion. So we can write it as zero point something in the BRE system. So for example, in BS2, if we're thinking about binary codes, then we can write it as zero point, some sequences, some sequence of zeros and ones. All right, and then we can assign a simple as we can assign a code word. We just set the code word for X. We just take this BRE expansion and we take the first L of X. So L of X is the target length of our code word that we are required to satisfy. And we just take the first L of X bits after this. I mean, the zero point always appears. So we just take the first L of X bits that are marked here as question marks. And if the BRE expansion is actually shorter, I mean, we can always pad it with zero bits at the end. That always works. And now the claim is that this algorithm actually really does construct a prefix free code, a prefix free symbol code or for short a prefix code. So the proof of that will be on the problem set, for which you will find a link in the video description. And the problem set will kind of guide you through the proof. So instead of proving it here, we're just going to illustrate this algorithm with some example. And the example that we're going to look at is again, our favorite simplified game of Monopoly. So as a reminder, in the simplified game of Monopoly, we have these symbols, we have the alphabet of symbols from two to five and I've kind of reordered them already because we know from the symbol codes that we've looked at so far that there exists a symbol code that has the following lengths. There's length three for the code word of symbol, the symbol for two, then again three for the code word of the symbol for six, and then the other symbols have a code word of length two each. So we know that this exists. And if we didn't know that, I mean, we can still check if there exists such a symbol using our graph McMillan theorem. And the way that we check this is we just evaluate R. So R is the sum of our symbols, B to the minus L of X. So B in our case, we are interested in a binary code. So B would be two. And then this equation is just, we have two symbols with length three. So we have two times two to the minus three. And then we have three symbols with length two. So plus three times two to the minus two. So that is in total, these two add up to one times two to the minus two. So in total we have four times two to the minus two, which is exactly one. So which is smaller or equal to one. So it satisfies the graph McMillan theory. So it satisfies the graph inequality. So therefore there exists a prefix free code and we can apply our algorithm from the last page. So how does this algorithm work? Well, we start with XI equals one and then we go through the symbols in this order because I already ordered them in order of decreasing length, decreasing target length. So what we're going to do in each step is we start with one and then we take XI one minus two because our base is two to the power of minus three because our length is three. And I can write this in a binary notation. So that would be 1.000 binary minus two to the minus three is 0.001 binary. Now you can do the kind of subtraction like you would do in the decimal system just with binary arithmetic and you will get 0.111 again in the binary system. So now you can read off the code word. We need three bits. So we just take the first three bits that follow the string zero point. So our code word will be one, one, one. Then we go on. We now start from this XI. We start back again, zero point. So we start from this one. We have strapped again two to the minus three. So which is again this term minus 0.001. And that will lead to the term 0.110. Notice that I padded with an additional zero bit here, which I mean isn't from the mathematics isn't strictly necessary. But now we want to read off our target length of three bits. So we have to call this one, one, zero. So we have to include the terminal zero bit. And you can see you actually can't leave this out because if we left out this terminal, the padding zero bit, then this one, one would be a prefix of this other code word. So we no longer, we would no longer have a prefix free code. That would be bad. Now we can go on. Now we are only interested in code words of length two. So we can start from what we had here, 0.11. Subtract this time two to the minus two, which is 0.11 minus in binary minus 0.01 in binary, which results in 0.10 in binary. And again, there's an additional zero that we mustn't forget. Next step, we start with 0.10, the result from here. We subtract again two to the minus two, which is again 0.01. And we arrive at 0.01. So the code word here is 0.1. And then finally, we start with this, 0.01. We subtract again 0.01 because we're interested again in a code word of length two. So we get zero, which is, you can write as 0.00 in binary. So since we want to read off two bits after two fractional bits, we get the code word 00. And you can see this is indeed a prefix code. None of all these code words are different and none of the shorter ones are prefixes of the longer ones. And they indeed have the target code word length. So our algorithm worked here. And again, on the problem set, we will prove that it will always work as long as this craft inequality is satisfied. Okay, so this algorithm generates a simple code with some given code word lengths. But now, usually we really want to find, we want to kind of tune these code word lengths and we want to make them as short as possible because we want to communicate our data as efficiently as possible. We want to compress as much as possible. So the question is, how short can we make these code word lengths? This is the question we're going to answer next. So particular, what is the minimal expected code word lengths of a uniquely decodable symbol code? So in other words, we want to maximize the expected code word length which we defined as capital L is the expectation value of small L of X where we weigh with the probability that any symbol appears. So we assume now that this probability distribution is given because that is a property of the data source. We assume that we have no influence on the data source but we have a given data source and we want to compress data from that source. So we are free to choose the lengths of each code words. But here's where the Kraft-McMillan theorem becomes important. We know that in order to construct a uniquely decodable symbol code these lengths have to satisfy the Kraft inequality. They have to satisfy that this sum does not exceed one. And we use really both parts of the Kraft-McMillan theorem here. We use that they have to satisfy the Kraft inequality. That is our constraint for the minimization. But we also know that if they satisfy then we can find a symbol code. So we know that we will not minimize this term. We will not find a value that's too low. All right, so that is our task in order to answer the question what's the minimal expected code word length? We have to minimize this expression. Overall code word length that satisfy this Kraft-McMillan, the Kraft inequality. So this is a constraint, discrete optimization problem. It's constrained because these Ls have to satisfy the Kraft inequality. And it's discrete because obviously the code word lengths are all integers. And that's actually the last one part. The fact that it's discrete, that's actually the more difficult part about this optimization. Because discrete optimization is typically more difficult than continuous optimization. So in order to answer this question we're going to follow the following strategy. You first derive a lower bound on L and we're going to do this by actually ignoring the discreteness of the length. And then once we have this lower bound we will show that we will introduce the discreteness and we will show that there actually exists a code word that gets close to this lower bound. So in other words, we will first derive a lower bound and then we will show that this lower bound is non-trivial. Because I mean you could always obviously say that the lower bound is zero, that is definitely a lower bound because L cannot be negative but that would be a trivial lower bound. But in step two of our strategy we'll show that the lower bound is actually non-trivial. So let's start with step one. So in order to derive a lower bound on L, as I already mentioned, we're going to look at a relaxed optimization problem where we minimize again a target function that has the same kind of mathematical structure as we noted here. But the difference now is that and we have the same constraint. So again the craft inequality has to hold. So sum over all symbols from our alphabet B to the minus L of X has to be smaller or equal to one. We want to minimize over all lengths but now we no longer have the constraint that the lengths are integers. So now we're just assuming that the lengths are any positive real numbers. So this is obviously in the end we don't want, we want integer lengths but the reason why this is a useful thing to derive this minimization problem is a useful thing to look at is if you get a minimum value for this relaxed optimization problem then we know that the minimum of the discrete optimization problem cannot be lower than the minimum of the relaxed optimization problem. And the reason for that is simply that if there were integer lengths that lead to a shorter expected codeword length than those integer lengths they are also real valued lengths because every integer is also a real value. So we could also use those integer lengths. We would also find them in our relaxed optimization problem. So let's look at this relaxed optimization problem. We have a target function and a constraint. The first thing that we notice is the constraint is an inequality but we can easily convince ourselves that the optimal solution so the one that makes this expectation the smallest that will always satisfy actually equality in this equation. And the reason for that is simply we want to make all these little ls as small as possible. So if this sum does not add up to one if it adds up to something smaller than one then we can always make one of these ls larger and it will make the left hand side larger so that it will eventually add up to one that will be better because then we will have smaller little ls. So in other words, if we find a solution that satisfies this equality but it actually where the left hand side is actually strictly smaller than one then that solution will waste something because in that solution we can still look at some of those ls and make some of them a little bit larger and it will still, since we have still some room we can still satisfy this inequality but we will make the expected bit length shorter because we can make one of those ls smaller. So the optimal solution will always satisfy equality here. And that's a nice thing because then that means we can minimize this function just by with the equality constraint by introducing a Lagrange multiplier. So as a reminder, if you have a continuous optimization with some equality constraint you can solve that by minimizing a new objective function which is the original objective function plus some real parameter lambda over which you now also have to minimize or you have to set the gradients to zero at least and then this lambda gets multiplied by the constraint. So here the constraint is sum over all symbols from our alphabet B to the minus l of X that's the left hand side of our constraint. And now we have to minimize this function over all l of X so we treat now these values that these real valued lengths we treat them as free parameters because now we no longer have to deal with the constraint so we can just assume they are independent all these values are independent parameters because the constraint is already taken care of by our minimization over lambda. So let's do this. So in order to do this minimization we have to look at this function again and we have to make that easier I'm going to introduce a shorthand notation and I'm going to introduce this shorthand for this term that appears in the constraint. This term in the constraint B to the minus l of X I'm going to call that Q of X. So I'm just going to define Q of X that will just make some of our equations shorter. I'll call that B to the minus l of X I'll call that Q of X. So to invert that that means that l of X is the negative logarithm to base B of Q of X this is just the solving this equation for l of X. So now we have to minimize our objective function. So we have to set for all symbols we have to set the derivative of our objective with respect to Q of X. So whether we optimize over l of X or over Q of X is really irrelevant because there's an injective mapping between the two an invertible mapping between the two. So we have to minimize our objective function with respect to all Q of X and then I'm going to call that Q of X that will satisfy that the derivatives are zero I'm going to call that Q star. So we're taking the derivative and be evaluated at the point Q star. So more precisely this will evaluate to this term so we take the derivative of our objective function which is first part of that objective function was just the expected codeword length which is the probability distribution. So we're now summing over X prime because X is already taken here. So we're summing over X prime we're taking the probability of the symbol times B to the minus sorry times the length and the length is just negative logarithm that's base B of Q of X prime. So this is negative logarithm base B Q of X prime. And then the second part of our objective L was this constraint which is lambda times this term which we now recall rename as Q. So we can very easily take the derivative we will see that all terms where X is not equal to X prime will drop out because their derivative will be zero and we will only end up with these two terms. So for the first term remember the logarithm to base B is the same as the natural logarithm divided by the natural logarithm of B and then the derivative of the natural logarithm is just one over the thing that we're deriving. So we will get P of X as a constant pre-factor negative times one over logarithm of B times Q star of X and this term will just evaluate to lambda because we derive by Q of X. So for that single term where X prime equals X we just get a one here and multiplied by lambda. So we know that this has to be zero. So we can solve for Q star because Q star is really just a reprametrization of our lengths that we want to find or optimal lengths. We will get Q star is some constant times P of X if we solve this equation for Q star. So in other words Q star, the optimal Q is proportional to P of X. Now there's a proportionality constant and that will depend on lambda and lambda was our Lagrange multiplier which you then would get by considering again reconsidering the constraint. So let's look at the constraint. So we want to make Q star as large as possible because if we make Q large, that means we make L small because of this negative sign and we want small length. And we have this constraint that the sum over Q star this is the constraint for which we introduced lambda. The constraint was the sum over all Q's must not exceed one. At the other hand, we know that the sum over all P's is exactly one because P is a probability distribution. So if we know that Q star has to be proportional to P and that at the same time the sum over all Q stars must not exceed one, but it should be as large as possible then the best thing you can do is just by setting Q star to P, Q star of X to P of X because we couldn't make it any larger because then we would exceed this inequality and if we made it any smaller then we wouldn't make our lengths unnecessarily large which we don't want to do. So the optimal choice for Q star is Q star of X is P of X which if we then solve for L is that L the length, the optimal lengths are minus negative logarithm with base B of the probability that a symbol occurs and this is an important quantity. This is called the information content of that symbol. So the optimal if we could choose an arbitrary real valued length of our symbol, then the ideal length would be the information content of that symbol negative logarithm of the probability that it occurs. And this makes sense because if the symbol is more likely if the probability is higher then due to this negative sign you will get a shorter codeword length which is kind of what we expect. So we can now take this L star and we can insert that in our equation for the expected codeword length and we know that in our relaxed optimization problem the expected codeword length would then evaluate to this term negative sum over a symbol's probability of that symbol times logarithm base B of that symbol of the probability of that symbol and you may recognize that this is exactly the entropy of this probability distribution P. So what we've shown so far is that for all uniquely decodable symbol codes the entropy of the probability distribution of the symbols poses a lower bound on the expected codeword length of that codeword code of that symbol code. Why a lower bound? Well, we could reach it, we could reach this length if we were allowed to choose fractional codeword lengths but if we are only allowed to choose integer codeword length we may have to find a codeword length that are slightly larger. So this is an important result, the expected, so this is our first real fundamental lower bound for compression that we have shown that you cannot beat the entropy, at least in expectation, you cannot beat the entropy of your data source. Now this, so far we've only considered the relaxed optimization problem, let's now get back to the discrete case. So, and the question I kind of want to ask here is how close can we get to this lower bound? So you know that this is a fundamental lower bound but is it kind of a purely academic result or does it actually matter? Can we actually construct a symbol code that comes close to this lower bound? And the answer that we'll find is that we can get within less than one bit and that is less than one bit per symbol of this lower bound. So we can construct a symbol code regardless of the probability distribution and regardless of our alphabet that it will not, obviously it's expected length will not be shorter than the entropy and in general it will be somewhat larger than the entropy, there will be some overhead but that overhead will be less than one bit per symbol and that may sound very nice but the per symbol is really important here and we will see later in the course that this restriction of per symbol is actually not acceptable in many machine learning based compression algorithms. But before we get to that, let's first prove this lower bound. So in order to prove this, let's, that's very simple. Let's just return back to our discrete optimization problem. So now our lengths are integers again for all symbols and one thing we could do is we could just set, I mean, we have these optimal kind of real valued lengths. We cannot use real valued length for our real symbol code but we can just round them up to the nearest integer and if you round them up, we know that that rounding will at least, will cost less than one bit. If you round some real value up, you will add less than one. And we also know that if you round up, we are only making the lengths, the code words longer. So we will still satisfy the craft, so the craft inequality. So the sum over all symbols B to the minus L of X, this will be smaller or equal to the same with L star because the L's are larger and there's a minus sign. And we know that this for L star, it is equal to one. So the left hand side will be smaller or equal to one. And we know that this is equal to one because that was our constraint, the relaxed optimization problem. So now that we have these length integer lengths that satisfy the craft inequality, now we can just use the constructive proof of our, part B of the Craft McMillan theorem to construct a symbol code that will actually have these lengths. And I'm going to call the symbol code C Shannon, subscript Shannon. So we know for the symbol code, the length of these code words will be slightly the target lengths, which are the optimal length rounded up. And so these are less than the optimal length plus one. So if you take the expectation of that length, that plus one, we can take that plus one out. So we will be able to sandwich the expected code word length for the Shannon code. So we know that any symbol code will have an expected code word length that is not shorter than the entropy. That's the fundamental lower bound that we proved on the last page. But we also know that there exists a symbol code, namely the Shannon symbol code, that comes close to this fundamental lower bound within less than one bit. So it's sandwiched between the entropy and the entropy plus one. And as I mentioned as a remark, this procedure of calculating the optimal length of the symbols, the information contents, rounding up the information contents and then applying our algorithm, which we went through an example on one of the previous pages, then this combination of steps is called Shannon coding. And that's our first kind of strategic way of designing a symbol code. Now, as an additional remark, we showed that Shannon coding, it satisfies this kind of sandwich. So it kind of in some sense comes close to the optimal, to the fundamental lower bound, but it's in general sub-optimal. And to see that, I would like to look at an example. Again, I'd like to again look at our example of the simplified game of monopoly. So again, we have the symbols two, three, four, five and six and I've rearranged them again in the order from low probability to high probability. And now what we can do is we can actually do Shannon coding for this and we can derive kind of what symbols would you get with the Shannon coding algorithm. So first step of Shannon coding is we have to calculate the information content. So this term that in the brackets here is L star of X, that's the information content of the symbol X and then we have to round that up. So what is that? So we have to just plug that into a calculator and for probability one over nine, we get the number, I get the number 3.17. So if I round that up, we'll get the length, target length four. Then this again, this is the same probability, so same target length, then for two over nine, I get 2.17, 2.17, which rounded up then leads to three. Here, same probability, same target length and then for one third, probability one third, I get the number 1.58, so rounded up. That will be a target length of two. So we can run our Shannon coding algorithm. I'm not going to write the size anymore, but you can think of how to calculate them. So you start at one, when you subtract two to the minus four, you get 0.1111, so you get symbol code 1111. So write up the first four digits after the zero point. Then for the next symbol, you again subtract two to the minus four. So you get, so you have to kind of think about a zero point here in front of that and then subtract 0.0001, so what you will get is just 1110. Next step, now we're subtracting only two to the minus three, so we're kind of subtracting a one here at this position and then we're only going to take the first three bits, so we'll get 1110. Again, subtracting a one here at this position, so that will lead to 101. And then finally, we subtract something here at this position, so we'll get 0.1111, but we're only going to, we only need to retain the first two symbols. So you get the code 01 for this. So you notice again, this is a prefix code, none of these, so none of the symbols is a prefix of any other symbol. You can already see that it's somewhat suboptimal because for example, you could drop this zero here. So you could define kind of a new c tilde of x that has instead of this one, it has only the code word, sorry, no, you could drop this one. So it has here only the code word 101. You could not drop this zero because then the result of the 11 would be a prefix of both of these. So you have to keep this at its original length and then here, you also cannot drop anything from these because then if you drop the last bit, either of these will become a prefix of the other one, zero. So could construct a code word like this. And in fact, you could also drop this one. So you could, right? Because this will still be a prefix code. So you can see some obvious way to improve this, but even that will not be, as we will see in a minute, even that will not lead to the optimal code, but it will satisfy this, this both of these inequalities. So let's calculate that now. So pause the video and try to calculate these expected code word lengths again and then compare with my results. So here are the results that I get. For our Shannon code, I obtain an expected code word length of 26 divided by nine, which is approximately 2.89, 2.89. For our kind of improved Shannon code, I obtain seven divided by three, which is approximately 2.33. And then it's also interesting to calculate the entropy. So the left-hand side here and the entropy for that, for these probabilities is to base two, I obtain just by plugging it into a calculator, I obtain approximately 2.20, this is a zero. So you can see that none of these expected code word lengths is smaller than the entropy and that's good. Because that's our fundamental lower bound. And also none of these exceeds the lower bound by more than one. In fact, this one is actually quite close to the lower bound and also this one exceeds it only by about 0.69, which is smaller than one. So they are satisfied both of these inequalities. And you also see that the Shannon code is suboptimal and you can very kind of obviously make some improvements just by stripping off some of the trailing bits. But we will see that it's still, actually if you already compare it to some of the examples that we had earlier, you'll see that you can still do slightly better than this. And as I remark, kind of the optimal solution, there's an algorithm that always finds the optimal code book and that's called Huffman coding. I'm going to go through Huffman coding in terms of an example, because I think that's the easiest way to introduce Huffman coding. So the way Huffman coding works is you first kind of write out your symbols again. So those are the symbols two, six, three, five and four. And then for each of those symbols, we need to know their probabilities. So these are 1 9th, 1 9th, 2 9th, 2 9th. And here you don't necessarily have to order them in this way, but we'll now continue to draw a tree with these symbols as leaves and the tree will have a very complicated shape if we didn't sort them. So this makes it slightly easier to read. So the way the algorithm works is we construct a tree that has all the symbols as its leaves. It's a binary tree. And the tree construction rule goes as follows. We always look at all the symbols that we have and we take the two symbols with lowest probabilities. So at the moment those would be this one and this one. And then we introduce a new edge in the tree that will be the parent node of, sorry, a new vertex in the tree and new node that will be the parent node of these two. And this node then also has a weight which is just the sum of these two weights. So this is 1 9th, this is 1 9th, so the sum is 2 9th. And we label the two edges by zero and one in arbitrary way. So I'm just going to use the convention that I always label the left edge with zero and the right edge with one. Now we continue. Now these nodes are already taken care of. We can cross these probabilities out. Now we're only looking at these four probabilities and we will look at the smallest, the two smallest probabilities. Now there are three smallest probabilities so there's a tie. We can break that arbitrarily. I'm just going to break it this way so I'm going to choose these two. Again, sum up their probabilities, four over nine. These go away and we have now a forest of three trees. One trivial tree and two not quite as trivial trees. So we continue. Again, we look at the two nodes with lowest weight and those are now this one and this one. So we have to introduce a new edge that will sit here like this. And I'm going to again label these zero and one and zero and one. And the sum of two ninth and one third, this is three ninths so we have five for nine. So finally, we have to combine these two. These are the two nodes with the smallest weights because they're the only two nodes. We have to introduce a new edge here again and this will be our root node. And I'm again going to label these nodes, these edges of zero and one. And so now we've constructed a tree, the so-called Huffman tree. And now we can assign a code word to each of these symbols by using properties of this tree. And the way we define this code word is we walk, for each symbol, we walk the unique path from the root to that leaf and you just collect the labels along the edges that we have. So for example, for the symbol two, you see the labels zero, zero, zero. So we have the code word C of X will be zero, zero, zero. For the symbol six, we have the label zero, zero, one. So we will see zero, zero, one. Then for the label three, we now have to take this again start from the root but now we have to take the other branch so we have to start with a one and then follow up with a zero, one, zero. For five, again one and then follow up with a one. Finally, for four, in order to get to four, we now have to go back to the root and to take the edge with label zero and then the edge with label one. So this is zero, one. So we can now to note this here. So C Huffman of X is zero, zero, zero. Zero, zero, zero. Zero, zero, one. One, zero, one, one, zero, one. And you can see again, this is a prefix code. None of the smaller shorter code words is a prefix of the longer code words. And that is obvious because when you have any tree you can interpret that as a try on your symbols. So you will always have, you will never be in a situation where one symbol is the prefix of another. But what you can also see that it's obviously better than this Shannon code because some of the code words have the same length. Other code words are shorter. For example, these two code words are shorter than in the Shannon coding case. Also this one, also these two are shorter. It's not so easy to see that the Huffman code book is better than our C tilde because there's one code word that's actually shorter in our C tilde. But when you evaluate the expected code word length, what I got was 20 over nine. And to comparison here we have seven over three which is 21 over nine. So that's a little bit bigger. So other way to write this, 20 over nine is about 2.22 which is slightly better than our C tilde. And we're not going to prove it here but we're going to prove it on one of the upcoming problem sets that Huffman coding actually always leads to an optimal code book. And again, this expected code word length again satisfies, I mean, it doesn't beat the entropy because no uniquely decodable code word can and it also satisfied the sandwich obviously because it's even better than Shannon coding or at least not worse than Shannon coding. All right, that concludes our this video. So we've now derived or some very fundamental bounds on loss as compression on this six problem set to which you'll find a link in the video description. You will show that these bounds so far they are only for simple codes. But on the problem set you will use some very simple arguments that show that actually once this lower bound applies for simple codes, it really applies for any code, for any lossless for any uniquely decodable lossless compression. And the way you will prove that is by looking at something called a block code. But you'll also see that these block codes why they allow you to prove such a fundamental lower bound on really any lossless compression they are not really practical. They would have exponential runtime. So in order to make this to really practically come closer to this lower bound, you would have to do something smarter than block codes and we'll introduce that in one of the upcoming lectures. Another problem on the problem set is that you will actually implement this Huffman coding algorithm in Python. And you will see that and this I think is an interesting problem because this Huffman coding despite it's somewhat overhead because it's still limited to being a simple code, it is really used very widely in practice. A lot of lossless compression algorithms use Huffman coding with some additional tricks internally. So in the problem set you'll implement your first real compression algorithm that's actually used in practice today a lot. With that, see you in the next video.