 Welcome to the second week of the course on data compression with and without deprobabilistic models. In this in the following video we'll prove the so-called source coding theorem which states a fundamental theoretical lower bound on the expected bit rate of lossless compression. In the last video we introduced our first class of lossless compression codes so-called symbol codes. We consider a discrete alphabet x that contains all symbols that we might want to compress and we assumed a simple probabilistic model where each symbol x occurs with some probability p of x. The messages that we want to encode are sequences of symbols from this alphabet and we notice such sequences as boldface x. A symbol code is then specified by a so-called code book c that assigns a so-called code word to each symbol in the alphabet and these code words are variable length bary bit strings. Once we have such a code book we define a symbol code c star that maps entire messages to bit strings by simply concatenating the code words for the symbols that make up the message. More precisely c star doesn't introduce any deliminators between code words and we already saw that this can lead to ambiguities if we're not careful in how we design our code book. So in the following we restrict our discussion to symbol codes where c star doesn't have any ambiguities and we call such symbol codes uniquely decodable. We also introduce the term prefix free symbol codes or confusingly also just called prefix codes for short and these prefix codes are symbol codes where no code word is the prefix of another code word. Prefix freeness is easier to verify than unique decodability and we already saw in problems at zero that every prefix free code is uniquely decodable but that not every uniquely decodable symbol code is prefix free. We then argued that it would be good to assign short code words especially to symbols that appear with high probability so that we obtain a code c with a small expected code word length which we denote by capital L sub c. And we introduced our first example of an entropy coder namely Huffman coding which takes a probability distribution over symbols as input and which generates a prefix free symbol code that I claimed minimizes the expected code word length. In this video we'll build on the formalism of symbol codes but we will generalize to arbitrary uniquely decodable lossless compression codes in the next video. We approve the important source coding theorem. In phrase for symbol codes then the source coding theorem relates the expected code word length to an information theoretical quantity that we will call the entropy of the symbol probabilities p. I like to think of the source coding theorem as two parts the bad news and the good news. The bad news states that it is not possible to find a uniquely decodable b re symbol code whose expected code word length is smaller than some fundamental lower bound namely set entropy. But the good news states that this fundamental lower bound is at least meaningful in that we can in fact almost achieve it with less than one bit of overhead. Our proof of the source coding theorem takes two steps. In the first step in this video we'll put the symbol probabilities p of x aside for a moment and we'll prove a general constraint on the code word length of a uniquely decodable symbol code. In the second step in the next video we'll analyze this constraint in the context of given symbol probabilities p of x and we'll derive the promised bounds on the expected code word length. And we'll also see that the result actually generalizes beyond symbol codes. So the source coding theorem gives us not only bounds on the expected code word length of a symbol code but also bounds on the expected bit rate of entire messages for arbitrary lossless compression codes including so called stream codes which we'll introduce in later videos. I should give credit where credit is due. This in the next video follow quite closely the proof presented by Professor Jeff Miller which is linked here on the slides and also in the video description. But I wanted to include the proof in this video series for completeness because we'll soon depart from topics covered in Professor Miller's excellent video series. So let's start with step one that is we'll prove a constraint on code word length that holds independently of any symbol probabilities. This constraint is given by the so called craft McMillan theorem which has two parts. Part a states that every uniquely decodable b or re symbol code satisfies the so called craft inequality. Namely that if you take one over b to the power of the length of a code word and then sum this over all code words the result cannot exceed one. This craft inequality may seem a bit abstract. So let's try to get an intuition for it. The way I like to think about this inequality is that when you design a uniquely decodable symbol code then you have a finite budget of shortness for all code words. So if you call one over b to the length of a code word the shortness of the code word then the sum of all the shortnesses must not exceed one. Therefore if we're given a uniquely decodable symbol code that just barely satisfies the craft inequality and we want to make some code word c of x shorter then its shortness one over b to the code word length grows. Therefore in order to stay within our shortness budget we have to reduce the shortness of some other code word that is we have to make some other code word c of x prime longer. This was part a of the Craft McMillan theory. Part b kind of goes in the opposite direction. It starts from some function l that assigns target code word lengths to each symbol in the alphabet. And the statement is that as long as these target lengths again satisfy the craft inequality there exists a b or e code whose code words all have the exactly the given target lengths. And more over you can make this code not only uniquely decodable but even prefix free which as we've seen on problem set zero is a strictly stronger statement than unique decodability. We will prove parts a and b in a second but before we do so I want to highlight a simple corollary that follows when we combine these two parts. If you don't already see where I'm going with this then pause the video here read through statements a and b again and try to come up with an interesting corollary that follows when you combine parts a and b. Have you paused the video? Then here's the corollary. We've already seen on problem set zero that every prefix code is uniquely decodable and while the inverse is in general not true we can conclude something that's almost as good as the inverse. If you start from any uniquely decodable b or e symbol code c then by part a of the Kraft-Millen theory its code word length satisfy the craft inequality. Therefore you can apply part b and create a prefix code with the same code word lengths. The corollary will allow us to simplify several discussions in upcoming videos because it basically tells us that there is no reason to ever use a uniquely decodable symbol code that is not a prefix code because you can always construct a prefix code with equal code word lengths and prefix codes are better because we can decode with them with a simple greedy algorithm as we've seen in problem 0.3. But of course we should only trust this corollary if we trust that the Kraft-Millen theory was true. So let's prove the theory now. We will first prove a simple lemma that will come in handy in a second. Let's consider a uniquely decoderable b or e symbol code over some discrete alphabet and some integer s and let's define the set y sub s as the set of all messages x that have a bit rate of s that is all messages x whose encoding c star of x has a length s. Then it should be easy to convince yourself that the size of y sub s is at most b to the power s because there are only b to the s distinct bary bit strings of length s. If you want to show this more formally then you can note that the image of the set y sub s under c star contains by definition only bary bit strings of the length s so it is a subset of the product space of numbers 0 to b minus 1 to the power of s. This product space contains b to the power of s elements so any subset of it can contain at most as many elements. Now we assume that c is uniquely decodable which is just a fancy way of saying that c star is injective and for injective functions image and pre-image have the same size so we find that y sub s has at most b to the s elements as claimed. Let's remember this lemma for a second and use it to prove part a of the Kraft-McMillen theory. As a reminder part a claims that if you have a uniquely decodable bary symbol code c then the Kraft inequality holds. To prove this part we refer to the left hand side of this inequality as r and we now consider the quantity r to the power of k where k is an arbitrary positive integer. So this is just the sum of the shortnesses in parenthesis to the power of k. Let's write out the power of k as an explicit product of k identical factors and let's give the summation variables x a different name x1 to xk in each of the k factors so that there's no confusion. We can now use the distributive property of multiplication in addition and we can pull all the sums to the front of the expression. The expression that we're summing over is a product where each factor has the form b to some power so we can rewrite this expression as a single factor of b raised to the sum of the exponents. And now we can simplify. First the k sums go over all possible combinations of symbols x1 to xk so we can equivalently think of them as a single sum over all messages of length k. I'll use underlined x here for messages because I can't write bold phase in hand writing. Then the sum of the codeword length that appears in the exponent is the same as the length of the concatenated codewords c star of the message x so for a symbol code this is just the bit rate of the message. We're almost there now. Let's assume for a minute that the alphabet is finite. Then there's a well-defined and finite maximum codeword length which we'll call gamma in the following. Thus if you consider again the expression for r to the power of k that we just derived then for all terms in the sum the bit rate i.e. the length of c star of the message x as a length of at most gamma times k bits because we only sum over messages of length k and each one of the k symbols in the message contributes at most gamma bits to the bit rate. Let's now be generous and instead of summing only over messages of length k let's sum over all messages of arbitrary length whose bit rate is at most gamma times k. This includes all terms that we had before so it results in an upper bound. Since this is still a finite sum we can now group terms in the sum by the bit rate. So let's first sum over all considered bit rates s from 0 to gamma times k and then for each bit rate s we sum over all messages x whose bit rate is s and we conveniently defined the set y sub s for these messages on the last slide. By definition the bit rate of x here is simply s and so we get the same term b to the power of negative s for each element in y sub s. Remember now from the lemma that y sub s has at most b to the s elements because c is uniquely decodable. Therefore each term in the sum over s contributes at most 1 and so the sum over s is bounded by the number of terms which is gamma times k plus 1. Thus in summary we've shown that r to the power of k is upper bounded by gamma times k plus 1. What does this mean? Well if we compare the left hand side to the right hand side and solve for gamma then we find that gamma is larger or equal to r to the power of k minus 1 divided by k. And importantly this inequality holds for all positive integers k and gamma is just the longest codeword length so it is finite independent of k. Now if the quantity r that we're interested in here was larger than 1 then this expression would go to infinity for k going to infinity so there would be some integer k for which the right hand side would be larger than the finite constant gamma. Therefore we conclude that r cannot be larger than 1 and if you recall how r was defined up here then you can see that we've just proven the craft inequality at least for the case of a finite alphabet. So what happens if the alphabet is not finite? Well then these arguments don't apply directly because we cannot define a maximum codeword length gamma but it turns out that we can reduce the problem to the finite case. Remember that we always required that the alphabet be discrete which we defined as either finite or countably infinite. So for the infinite case we can assume without restriction that the alphabet is the set of natural numbers n. Then in the expression for r the sum over the alphabet is just a sum over x from 1 to infinity. The mathematicians among my viewers might object here that this notation implies a certain order in which we carry out the sum but that's not an issue here because we're only summing over non-negative terms so the sum is absolutely convergent if it is convergent at all. Such an infinite sum from 1 to infinity is formally defined as the limit of finite partial sums from 1 to some n where n goes to infinity. And for each partial sum we effectively have a finite alphabet of size n so case one applies and we have a limit of expressions that are all smaller or equal to one so the limit is also smaller or equal to one. We've now concluded our proof of part a of the Kraft-McMillan theorem. As a reminder part a starts from a uniquely decodable code and then states that the craft inequality must hold. The craft inequality is an expression in all the codeword length so you can probably already guess that this inequality will play an important role when we prove a bound on the expected codeword length in the next video. But before we do this let's prove part b of the Kraft-McMillan theorem. As a reminder in part b we give ourselves some target codeword length l of x for each symbol x and the statement says that if these target codeword lengths satisfy the craft inequality then there indeed exists a b or e prefix code c sub l with precisely these target codeword lengths. In fact we can even make a stronger statement. Not only can we prove that such a code exists the proof is even constructive that is we prove existence of c sub l by providing an algorithm that explicitly constructs it for any given set of target codeword lengths. Moreover this algorithm is actually useful in practice and it will give rise to two entropy coding algorithms namely Shannon coding which we'll introduce in the next video and a more effective entropy coder called arithmetic coding which we'll introduce in lecture five. Let me first present the algorithm somewhat informally and then walk through an example in order to resolve any potential ambiguities. The algorithm takes as input the target codeword lengths l of x for all symbols x in the alphabet and it outputs the codeword c sub l of x for each symbol x. Let's assume for now that the alphabet is finite. We start by initializing a variable xi to one. Then we iterate over the symbols in the alphabet in order of descending target codeword length l of x. So we start with a symbol that has the longest target codeword length and we end with a symbol that has the shortest target codeword length. More precisely I should say that we iterate in order of non-increasing target codeword length because there can be multiple symbols with the same target codeword length and we iterate over those in arbitrary order. For each symbol x in the iteration we only need to do three simple steps. First we reduce the variable xi by the basis b to the power of negative l of x. Let's now consider the representation of xi in the b a repositional numeral system. So for example if b equals 2 then this would be the binary representation of xi. Because l satisfies the craft inequality you can easily convince yourself that at this point xi is always in the half open interval from 0 to 1. Therefore its representation in the b repositional numeral system always has the form of 0 points some sequence of bary bits. The non-obvious step of the proof is that you can now simply set the codeword c sub l of x to the string of l of x bits that come after the zero point in this expression. If there are fewer than l of x bits after the point then you have to pad with trailing zeros and if there are more than l of x bits then you can ignore them for the purpose of the codeword. Now of course I don't expect you to immediately see why this algorithm should always construct a prefix free symbol code or even just a uniquely decodable one. The arguments turn out to be fairly simple and I could walk you through them on these slides but in my experience watching someone else analyze an algorithm step by step is kind of tiresome and doesn't really get the points across. So I've instead prepared a problem on problem set 2 which is linked in the video description that guides you through the proof that this algorithm really constructs a prefix code and you'll also generalize to the case of accountable infinite alphabet in this problem. Remember that the solutions to the problem set are also linked in the video description in case you get stuck. But before you get started with this problem set let's quickly run through an example of this algorithm to make sure it's all clear. We'll consider an alphabet that consists of five symbols labeled 2, 3, 4, 5 and 6 like in our beloved simplified game of Monopoly from the last video and let's now assume that someone gave us this list of target codeword lengths L of x and that we want to construct a binary prefix code so the base capital B is equal to 2. The first thing that we should do is to check if these target codeword lengths satisfy the craft inequality because only then does the algorithm work. In fact if the craft inequality was not satisfied then we've shown in part A of the Craft McMillan theory that a uniquely decodable symbol code with these target codeword length does not exist. But it's all good here. If you sum up 2 to the negative L of x for all symbols x in the alphabet for these target codeword lengths then we obtain exactly one so the craft inequality is just barely satisfied. The next step is to initialize the variable xi to 1. Then we iterate over all symbols in the alphabet in order of non increasing target codeword length L of x. So we'll start with the two symbols with longest target codeword length and I'll just sort them here in arbitrary order and then we'll iterate over the three symbols with shorter target codeword length again in arbitrary order. Let's now actually do the iteration. In each step we first reduce the variable xi by 2 to the negative L of x. So in the first step for symbol x equals 2 it's like xi to its current value of 1 and then subtract from it 2 to the negative 3. It's most instructive to write this out in binary which is 1.000-0.001 equals 0.111 and we can directly read off the codeword 111 for the symbol. The next symbol is x equals 6. We again reduce xi from its current value of 0.111 in binary by 2 to the negative 3 which is again 0.001 in binary and we obtain 0.111 in binary. I'll actually pad with a trailing 0 and write the result as 0.110 so that we can again read off the codeword. The next symbol is x equals 3. Now the target codeword length is 2 so we now reduce xi by 2 to the negative 2. So from its current value of 0.110 in binary we subtract 0.01 in binary which results in 0.10 in binary and we can read off the codeword which is now only two bits long as requested by the given L of x. The last two symbols are analogous and you can easily convince yourself that the resulting code is indeed prefix free and has the requested target codeword length. As a simple exercise I encourage you to execute this algorithm again with pen and paper but to leave out the step where we sorted the symbols by decreasing target codeword length. Instead, try to simply iterate over this table from top to bottom. You should find that this does not lead to a prefix code or even to a uniquely decodable symbol code. So it is indeed important that the algorithm iterates over the alphabet in order of decreasing target codeword length and problem 2.1 will help you to understand why this is important. So we've seen that the algorithm works in one example and you'll prove on the problem set that it works in general but there's still something that I've been sweeping under the rug until now and that is where do these target codeword lengths come from? We know that they have to satisfy the craft inequality but there are many potential choices of l of x that satisfy the craft inequality so how can we know which one we should pick? For example, we could have swapped the last two assignments so we could have set l of 5 to 3 and l of 6 to 2 without changing our calculation down here where we check the craft inequality. From the last two videos you should already have some idea of how you might decide which one of two possible assignments of target codeword length is better. You need a probabilistic model of the data source. If you have such a probabilistic model then you can compare for example the expected codeword length. In the last video we introduced the Huffman coding algorithm and I stated that that Huffman coding constructs an optimal sumo code that minimizes the expected codeword length. Many practical compression algorithms indeed use Huffman coding but for a theoretical analysis of the optimal codeword length Huffman coding is not so useful because it does not give us the codeword length as a closed form mathematical expression and we'll need such a closed form mathematical expression if we want to minimize the expected codeword length over parameters of the model as we'll do with machine learning models starting on problems at 3. This is what we'll address in the next video. We'll derive a simple mathematical expression called the information content of a symbol that at least approximates the optimal choice for the target codeword length l of x and which converges to the true optimal bit rate if we go beyond symbol codes. Have fun with problem 2.1 and see you in the next video.