 In this video, we'll prove the important source-coding theorem, which states a fundamental lower bound on the expected bitrate of lossless compression. This is the fourth video of our university course on data compression with and without deprobabilistic models. As always, you can find a link to a playlist with all videos from this course down in the video description. There are also links to lecture notes, problem sets and solutions. In the last video, we proved the Kraft-McMillan theorem, which states a bound on the codeword lengths of uniquely decodable symbol codes. Part A of this theorem states that every uniquely decodable symbol code satisfies the so-called Kraft inequality, that is, the sum of the shortnesses of all codewords must not exceed 1. Here, we define the shortness of a codeword as 1 over b to the power of the codeword length, where b is usually 2 if we compress to a binary bit stream. Part B of the Kraft-McMillan theorem goes in the opposite direction, and it states that for any target codeword lengths l that satisfy the Kraft inequality, there indeed exists a corresponding uniquely decodable symbol code, a prefix-free one even, that has these codeword lengths. In this video, we'll address the question of how one should choose these target codeword lengths. And to answer this question, we'll recall what we've learned at the beginning of the course, namely that data compression fundamentally relies on a probabilistic model of the data source. Now, if we have a given probabilistic model, then we already know how to construct an optimal symbol code. I showed you the famous Huffman coding algorithm in video 1.2, and we'll actually prove that Huffman coding is optimal in next week's videos. But starting with the next video, we'll consider situations where we don't yet have a probabilistic model of the data source. Instead, we'll assume that we only have some finite number of example data points, for example, a collection of images if our goal is to design an image compression method. And then, we'll want to fit or learn a good probabilistic model for this kind of data. That is, we'll need to be able to calculate the bitrate that we would get if you compressed the data with a compression code that was optimal or near-optimal for some candidate probabilistic model, and we'll then want to minimize this bitrate over many candidate probabilistic models. For this optimization of our probabilistic models, the bitrates that Huffman coding gives us are not that useful because we cannot differentiate through the Huffman coding algorithm. Instead, we'll need a differentiable expression for the bitrate under an optimal compression code. And that's exactly what we'll derive today. This differentiable expression is called the information content of a symbol or a message under a probabilistic model. Don't worry if this also sounds a bit confusing at this point, it will become clear by the end of this video. And on the next problem set, you'll already apply the information content in practice, and you'll use it to train a probabilistic machine learning model that you will then turn into a compression method for written natural language. But we're getting ahead of ourselves. Let's first remind ourselves what we mean with a probabilistic model of the data source. We'll discuss much more expressive models in future videos, but so far we've considered very simple probabilistic models where we have some discrete alphabet and we assign a probability to each symbol in the alphabet. For example, in video 1.2, we introduced a toy model that we called the simplified game of Monopoly. Here, we assumed that each symbol x in our message is generated by throwing a pair of fair three-sided dice and then recording the sum. So we can easily calculate the probability of each symbol in the alphabet. Now, if someone gave us a list of target codeword lengths, l of x, then we can run the algorithm from the last video to construct a corresponding prefix code C sub 1. And this works because the given target codeword length satisfy the craft inequality. If you sum up 1 over 2 to the power of l of x, we get, in this example, exactly 1, so the craft inequality is satisfied. But you might have noticed that these specific target codeword lengths are not particularly well chosen. And this is finally where the probabilistic model comes into play. Why should the symbol 2 get a shorter codeword than the symbol 4 when the symbol 4 occurs with higher probability? It would certainly be better if you take the same set of target codeword lengths, but we assign the three shorter codewords to the three most probable symbols. This still satisfies the craft inequality, of course, so we can again run our algorithm from the last video to obtain a prefix-free symbol code with these new target codeword lengths. We claim that the second code is better than the first one. One common way to formally compare two codes is to calculate the expected codeword length under the probabilistic model. Let me denote this simply by capital L sub lowercase l here, because the expected codeword length does not depend on the actual code C sub l, just on the length l themselves. And we indeed find that the second code has a shorter expected codeword length than the first one. So if our goal is to minimize the expected codeword length, shouldn't we then try to make the codewords even shorter? In particular, in this example, the symbol x equals 4 has a higher probability than the symbols 3 and 5, so shouldn't we give it a shorter codeword? Well, if we try to just make this one codeword shorter without changing anything else, then it won't work because these new target codeword lengths violate the craft inequality, so there does not exist any uniquely decodable symbol code with these codeword lengths. But we could imagine that it might be worth making some sort of trade. For example, if we now make the codewords for symbols 3 and 5 one bit longer each, then the craft inequality is again satisfied, so we can run our algorithm. And you can indeed easily see that the resulting code is again prefix free, but we've actually heard ourselves because it turns out that we've now increased the expected codeword length. So you can see that the choice of optimal target codeword length is not so obvious. Is there maybe a different kind of trade that reduces the expected codeword length without violating the craft inequality? It's hard to tell if we just try out some random choices, so let's formalize our goal. We have a constrained optimization problem. We want to minimize the expected codeword length, and the minimization runs over the function lower case l that specifies a codeword length for each symbol in the alphabet. And the minimization is subject to the constraint that these codeword lengths have to satisfy the craft inequality, which I'll write now again for a general base b. And there's implicitly a second constraint, namely the lengths of our codewords must be positive integers because they correspond to the actual counts of bits in a codeword. It turns out that the second constraint is actually more difficult to deal with because discrete optimizations are typically harder than continuous optimizations because there's no gradients in discrete spaces. So let's make our lives easy and ignore the second constraint for the time being. Don't worry, we'll reintroduce it on the next slide. We thus now have a relaxed optimization problem where the function values l of x are real valued, but still positive. With this relaxed constraint, the minimization runs over strictly more functions than in the original problem. So relaxing the constraint can make the minimum only smaller. That means the minimum under the relaxed constraint 2 prime is a lower bound to the minimum under the original constraint 2. Fortunately, it turns out that we can solve the relaxed optimization problem quite easily. Let's first make an observation about the constraint number one that is the craft inequality. While the constraint is an inequality, we can easily convince ourselves that a solution to the relaxed optimization problem will satisfy it just barely, that is, with equality. This is simply because if we had a real valued function l where the left-hand side of the inequality is smaller than 1, then we would have some room to reduce some function value l of x by an infinitesimal amount while still staying within the constraint. And this would lead to a better solution because it would also reduce the objective function l sub l. So we can replace constraint 1 by a constraint 1 prime that replaces the inequality with inequality. And now we can enforce this constraint 1 prime in the continuous optimization by introducing a Lagrange multiplier lambda. If you don't remember how Lagrange multipliers work, they are a technique to turn a constraint optimization problem into an unconstrained optimization problem over a higher dimensional space. In our example, it means that we now have to find a stationary point of an objective function that depends both on the function lower case l over which we are originally optimized and also on a new real valued parameter lambda. And the objective function consists of the original objective function plus lambda times a term that will give us back the constraint 1 prime, as we'll see in a second. Let's find these stationary points. So let's take the derivatives with respect to all parameters, set them to 0, and then solve for the parameters. For the derivative with respect to lambda, we simply obtain the expression in the parenthesis. And if you set this expression to 0, then we recover exactly constraint 1 prime. And this is no coincidence because this is exactly how one enforces constraints with Lagrange multipliers. The point of the Lagrange multiplier method is that we can now treat the function values l of x over which we optimize as if they were independent parameters because the constraint that ties them together is already taken care of by the Lagrange multiplier. So we now have a separate equation for each symbol x in the alphabet that sets the derivative of the objective with respect to the function value l of x to 0. Let me actually rename the placeholder for the symbol to x tilde so that there's no confusion with the axis that appeared in the two sums inside the objective function. Let me then write out the objective function in a slightly simplified form. The constant term minus lambda drops out when we differentiate by l of x tilde. And in each of the two sums, only the term with x equals x tilde contributes. The derivative of the first term is simply the probability p of x tilde. The second term is a bit more complicated. I rewrite it here by applying first the natural logarithm and then the exponential function. This doesn't change anything because logarithm and exponential are inverse of each other, but it allows me now to use a logarithm identity and pull the exponent inside the logarithm as a factor in front. Now it's clear how to take the derivative with respect to l of x tilde. And then we can finally identify that this exponential term is again just b to the negative l of x tilde. So at the stationary point, this right-hand side is zero. We can now solve for l of x tilde and we find first that b to the negative l of x tilde is p of x tilde divided by lambda times the natural logarithm of b. Thus, l of x tilde is the negative log to base b of this expression, which I'll rewrite as the negative log b of p of x tilde plus a constant alpha, where alpha is log b of lambda times l and b. And recall that this equation holds for all x tilde in the alphabet, so we might as well rename x tilde back to x because there's no potential for confusion anymore. We still have this constant alpha hanging around, which depends on our artificially introduced lacrange multiplier lambda. To eliminate this constant, the final step in the lacrange multiplier method is that we insert the result into the constraint. The constraint one prime was that one has to equal the sum over b to the negative l of x. So if we insert our solution l of x from above, then we can pull a constant factor b to the negative alpha out of the sum and the rest simplifies to the sum over all probabilities p of x, which is one for a properly normalized probability distribution. So we find that alpha is zero and therefore the solution to our relaxed optimization problem is that l of x is the negative logarithm to base b of the symbol probability p of x. This is an important relation to remember, so let me erase the board and summarize the result. In the calculations on the last slide, we found the solution to a relaxed optimization problem. I'm going to give the solution the explicit name l star. So l star is the real valued function that minimizes the expected codeword length under the relaxed constraint two prime where codeword lengths don't have to be integers. And we found that l star of x is given by the negative logarithm to base b of the symbol probability p of x. This quantity is so important that it has a name. It's called the information content of the symbol x. Or if you want to be more specific, it's the information content of the symbol x under the probability distribution p. We can now find the value of the quantity that we set out to minimize. If we insert l star into the expression for the expected codeword length, then we obtain the expected information content or more explicitly the negative sum over x of p of x times log p of x. This is probably the most important quantity in information theory, and it is called the entropy of the probability distribution p. Remember now that l star was the result of a relaxed optimization problem. Doesn't correspond to any actual expected codeword length because it would require a code with non-integer codeword lengths. So let's now restore the constraint that l must be integer valued. Then the minimization runs over a few of functions, so it will in general result in a larger minimum as we've discussed on the last slide. With the integer constraint restored, the left-hand side of this inequality is now interpretable. It's the smallest possible expected codeword length for all symbol codes that satisfy the craft inequality. And that's a big deal because by part A of the Kraft-McMillan theorem, we know that all uniquely decodable symbol codes satisfy the craft inequality. Therefore, we've derived an important lower bound on the expected codeword length of any uniquely decodable symbol code. The expected bit rate is lower bounded by the entropy of the symbol probabilities. Now, in and of itself, a lower bound is not necessarily meaningful. After all, I could have told you without any complicated derivations that the expected codeword length is always larger than zero. That's also a correct lower bound, but it is kind of meaningless because in most cases, we won't even be able to come close to it. But I'm going to claim that the entropy is a more meaningful lower bound because we can always construct a symbol code that comes close to it in a well-defined way. So recall from the last slide that all uniquely decodable B-ary symbol codes have an expected codeword length that is larger or equal to the entropy, which you can calculate from this expression. And we would be able to achieve equality in this expression if we were able to set the individual codeword length of each symbol X to the information content of the symbol, so to the negative logarithm of P of X. But this is typically not possible because codeword lengths are integers, but the information content is not, at least not in general. So if you can't achieve the lower bound exactly, how closely can we approach it? We can find a simple upper bound by defining target codeword lengths L sub S that simply round up the information content of each symbol to the nearest integer. By the way the subscript S is not a variable, it's just short for the name Shannon. These target codeword lengths satisfy the craft inequality because rounding up can make codewords only longer. More formally, we can insert rounded up information content into the left-hand side of the craft inequality. Now, if we remove the rounding up operation, then this underlined part can only become smaller, but there's a minus sign in front of it, so removing the rounding up operation can make the entire expression only larger. Now the logarithm cancels with the exponentiation and we obtain again the sum of all probabilities, which is one. Thus, L sub S satisfies the craft inequality, so by part B of the craft inequality theorem, there exists a B or E prefix code C sub S with these codeword lengths. Let's now calculate the expected codeword length of this code. The length of each codeword is by construction the function L sub S, which is the rounded up information content. Rounding up increases the value by less than one, so we can upper bound each term in the sum by the information content plus one. And then we can regroup and we obtain an upper bound that is given by the entropy plus one bit. So let's summarize. So far, we've derived theoretical bounds for the expected codeword length of symbol codes. I like to think of these bounds as the bad news and the good news. The bad news is that a uniquely decoded to be a re-symbol code cannot have an expected codeword length that is smaller than the entropy of a symbol. But the good news states that this lower bound is at least meaningful in that there exists a code, for example, the Shannon code that we just introduced, whose expected codeword length is less than one bit larger than the entropy. In practice, we're interested in the expected codeword length of an optimal symbol code because that's a quantity that we want to minimize over model parameters when we fit a probabilistic machine learning model. And the bad news and the good news allow us to sandwich this quantity from both sides. It's lower bounded by the entropy because this lower bound applies to all codes and it's upper bounded by the entropy plus one bit because the optimal symbol code is for sure not going to be worse than the Shannon code. There's one caveat though. So far we've restricted ourselves to symbol codes, so codes that essentially only operate on a single symbol. When we want to compress a message that consists of more than one symbol, then we simply apply the symbol code to each symbol and concatenate the resulting codewords. You might be thinking that this is kind of primitive and maybe you could come up with a better loss as compression code if you can somehow consider the entire message as a whole rather than chopping it up immediately into symbols. And you would indeed be correct. Symbol codes have a suboptimal bit rate because they always reserve an integer number of bits per symbol. Imagine for example, you wanted to compress a message that consists of three symbols, x1, x2, and x3. In this figure, the length of the green, orange, and blue bar indicate the information content of each symbol using the scale below. Recall that the information content is in general not an integer, but if you encoded the message, for example with Shannon coding, then each symbol would be mapped to a codeword whose length rounds up the information content to the next larger integer. So it's true to say that an optimal symbol code has at most one bit of overhead. But this overhead applies per symbol and so overheads add up if you have a long message that consists of lots of symbols. Unfortunately, many modern machine learning based compression methods transform the data before entropy coding to a new representation that lies precisely in the awkward regime where one has lots of symbols with often only a fraction of a bit of information content per symbol. So an overhead of up to one bit per symbol would really hurt for these methods. Fortunately, there's a practical solution called stream coding that we'll discuss in lectures five and six. But before we get to these practical methods, let's quickly generalize our theoretical bounds beyond symbol codes. So let's consider a uniquely decodable lossless compression code that doesn't have to be a symbol code. The code operates on some message space X, which might be the set of arbitrarily length sequences of symbols from some alphabet as we've discussed so far, but it doesn't need to be. However, note that the lossless compression code maps injectively from the message space to the discrete set of bit strings. So this only works if the message space is discrete. So for our theoretical analysis, we can interpret the generic compression code on X as a giant code book for a symbol code with alphabet X. And so the bit rates for messages in the original interpretation become codeword lengths for the symbols in the interpretation as a symbol code. The probability distribution of our messages can now be something very complicated, but as long as it has a finite entropy, the bounds for expected codeword length of symbol codes that we've derived in this video now give us a lower and an upper bound on the expected bit rate of an optimal code that operates on entire messages. And this is the famous source coding theory. In practice, it would be prohibitively expensive to construct a truly optimal code for long messages, but we will see that the stream codes that we'll discuss in lectures five and six typically have a really negligible overhead over the entropy. There's a somewhat subtle caveat that I've been glossing over a bit because I don't think it's particularly important, but I should mention it for completeness. Lower bound really only holds for uniquely decodable codes, and one could argue that unique decodability is not a crucial property for a code that operates on entire messages since in practice, one will rarely concatenate entire compressed messages without some form of container format or protocol. And it's true that if you remove the constraint to unique decodability, a code can indeed slightly violate the lower bound. The way it can do this is by encoding some of the information content of the message into the length rather than the content of the compressed bit stream. On the other hand, one can argue that at some point, some container must always store the length of the compressed representation. For example, if you save the compressed representation of some message to a file, then your file system must store the file size somewhere on your SSD. So at the end of the day, in expectation, lossless compression cannot beat the entropy. That's it for this video. On the problem set, which is linked in the video description, you'll walk through some simple examples of Shannon coding and you'll get some more intuition of the concepts of entropy and information content. This includes a kind of fun example in problem 2.4 that is so trivial that it is almost difficult again. I really encourage viewers to have a look at this problem because I've made the experience that it has helped students grasp the concept of way more than what learning equations can do. In the next video, we'll see how the theoretical bounds that we've just arrived manifest themselves in practical situations. We'll quantify the overhead that one gets due to an imperfect probabilistic model of the data source and we'll see how we can learn probabilistic models. See you there.