 Welcome back to the course on data compression with the probabilistic models. This video is going to be a treat. We're going to cover a milestone in source coding theory. We'll have a deeper look at the Huffman coding algorithm that we saw at the end of the last video, and we're going to prove that it always constructs an optimal symbol code. So let's jump in. On the last video, we proved theoretical bounds for lossless compression with symbol codes. We proved both a lower and an upper bound. The lower bound states that the expected length of a code word. So the expectation value of the length of a code word under the probability distribution of symbols that might appear in your message. This expected code word length cannot be lower than the entropy of the distribution of symbols. And then the upper bound that we proved is basically states that you can indeed come close to this lower bound. So we showed that there always exists a uniquely decodable symbol code, a prefix code even, that reaches this lower bound up to at most one bit. Now it's important to remember for symbol codes, this is one bit per symbol. So you can have an overhead of up to one bit per symbol in your message. So if your message is very long, contains a lot of symbols, then the overhead grows linearly in the number of symbols. But it can only be less than one bit per symbol if you choose an optimal symbol code. And the way we proved this is we actually use a constructive proof. We showed that there actually exists an algorithm that can take in a probability distribution and then construct a symbol code which may not be optimal. In fact, we saw that in certain situations it's not optimal, that there were obvious ways to make it better. But it does already satisfy this constraint that its overhead is less than one bit over the entropy. So therefore the optimal symbol code cannot be worse than this code that we constructed. And this method was called Shannon coding. Then on the problem set, we went a step further and we asked ourselves the question, what happens if we go beyond symbol codes? That means if we no longer have this restriction that each symbol has to be mapped to an integer number of bits. And we did this by considering so-called block codes, that is basically symbol codes, but symbol codes applied to a sequence of symbols at once. So a block of symbols at once. And we saw that if you then increase the block size, then your overhead epsilon, which is then here the overhead per symbol, per original symbol, that can become very, very small. So we saw that this becomes, goes as one over M, where M is the block size. So as you make the block size large, which you can do if you have a long message, which is kind of the practical application anyway, then this overhead goes to zero and becomes negligible. So this proves that there always exists a lossless compression method that comes very, very close to this theoretical lower bound. But in contrast to our proof for symbol codes, these block codes are not really practical. So we saw that their runtime complexity grows exponentially in M in the size of the block. So it's not really practical to use a block code with very large blocks. But later in the course you will learn about different methods, so-called stream codes, that can also come very, very close to this theoretical lower bound and that they are more efficient, they have only linear complexity in the size of the message. Then going back to symbol codes on the problem set, you implemented a method called Huffman coding, which is, you can think of as an alternative to Shannon coding, but it was stated on the problem set that Huffman coding actually always leads to an optimal symbol code. So the claim here was that Huffman coding always leads to an optimal, uniquely decodable symbol code is optimal. And that leads us to the topic of today. So today we will prove this statement. Today we will prove, sorry, we will prove that Huffman coding is indeed optimal. And this is really a milestone in compression theory and source coding theory. It was actually surprising for people that such a simple algorithm can lead to optimal symbol codes. But before we can prove that Huffman coding is optimal, we have to be clear what we actually mean with this statement. And there is actually a complication here because Huffman coding is not even necessarily completely defined. So there are cases where Huffman coding can be ambiguous and you have to break ties in some way. So let's remind ourselves how Huffman coding works and then also at that same time see how these ties can occur and what happens when you have a tie. So complication happens when you have to break ties, breaking ties in Huffman coding. So let's look at an example. Let's say we have an alphabet where the symbols can be either a, b, c, or d. And these symbols have the following probabilities. p of x equals 1 6th for a, 1 6th for b, 1 3rd for c, and 1 3rd for d. Now what Huffman coding does is it constructs a tree whose leaves are these symbols. So the way you do this is, as discussed at the end of the last video, you always look at the two symbols with lowest probability, which in this case would be these two a and b, and then you introduce a new node that becomes the parent of these two symbols. And the weight of this new node will be the sum of the weights of the original symbols. So this is now 1 6th plus 1 6th, which is 1 3rd. Now you continue, you take out first these two symbols, these two probabilities because you've already taken care of them. And then you continue and you look at the nodes with lowest probability or lowest weight. And now you have a tie in this example. I've constructed this example so that you now have a tie. And so you could, for example, now say let's take these two symbols and introduce a new node that becomes the parent of these two. And it will then have weight 2 3rds. And then in the last step you have to introduce a new node, so you have to take out, now these are taken care of. Now the only two nodes left are these two, so you have to again create a new node whose weight is now one because it covers all the symbols and the probabilities add up to one. And then Huffman coding assigns a bit to each of these branches. So let's just use the convention that the left branch is always zero and the right branch is always one. And then the code word that you get for each one of these symbols is you get that code word by starting at the root of the tree and following along the unique path towards either of those code words. So these unique parts and just picking up all the bits that you have on these branches. So for A that would be 0 0 0, for B it would be 0 0 1, for C it would be 0 1 and for D it's 1. Now when I say that Huffman coding is optimal, what I'm saying with that is that I'm claiming that the expected code word length is the lowest possible for any uniquely decodable symbol code. So what is this expected code word length? Well we calculate the expected code word length by summing over all the symbols in the alphabet, taking the probabilities of these symbols and then the length of the code word of each symbol. So in this specific example we will get for A 1 6 times 3 because the code word for A is 3 bits long then for B again 1 over 6 times 3, for C we get 1 over 3 1 3rd times 2 because the code word for C is only 2 bits long and finally for D we have 1 3rd times 1 because that code word is only 1 bit long and if you multiply this all out and sum it up you get 2 bits as expected code word length. But we already saw that there was some ambiguity in this process. When we introduced this new node we could have done some things different. So instead of going this way we could also have started again with the same symbols. Let me just copy this, CD with probabilities 1 over 6 1 over 6 1 over 3 1 over 3 and then the first node is still uniquely defined because there are only two lowest probabilities so we have to combine these two and add in a node with weight 1 3rd and then these two are taken care of. But now we could break this tie in a different way and we could say let's introduce a new node here and this has no weight 2 3rds the same as it had here because there was a tie. And then finally, so you take out these two and then finally you have to introduce the final node, the root node which is here which has weight 1 and then again assign bits to each of these branches. So here you would now get different code words so for A you get 0 0 B 0 1 we'll see 1 0 and for D 1 1. So not only do you get different code words but the lengths of these code words are even different than in our other way of breaking the tie. So you see that for example the code word for A has only 2 bits here but it used to have 3 bits in our other method but on the other hand the code word for D is now 2 bits which and it had only one bit in our other way of breaking the tie. But you see immediately that at least in this example the expected code word length is again 2, right? That's you don't even have to calculate anything you can see all the code words have length 2 bits so their average, their weighted average is also 2. The claim is and I'm going to state this as a theorem is that this is always the case in Huffman coding. So whenever you have a tie in Huffman coding then no matter how you break the tie you will end up with the same expected code word length. This is not the full theorem that we want to prove today but this is an important preparation so that the theorem that we actually want to prove today even makes sense. So let's state that as theorem 1. Theorem 1 of today that is the expected code word length. So the expected code word length which is the expectation value of the length of the code word under your probability distribution that does not depend on how you break ties in Huffman coding. And that's good to know because if that's the case then it even makes sense to say that Huffman coding is optimal. If it would depend on how we break ties then it wouldn't make a lot of sense to say that Huffman coding is optimal because you then have to specify which Huffman coding do you mean is optimal. So let's prove this theorem and the proof is actually very simple. I'm not even going to write it down. I'm just going to walk you through it at the example of this picture. That is here we calculated the expected code word length by summing over by taking this weighted average of the code word length where the weight is always a probability distribution. But you could have calculated this expected code word length also in a different way. And the way you could have calculated is that every time we introduced a new note here we could have kept track of how that introduction of that new note how that affects how that contributes to the expected code word length. So we would have started with a code word length of an expected code word length of zero and then we would have added in this first new note here. And by introducing this first new note the effect of this first new note is that it adds this last bit to the symbols A and B because this last bit corresponds exactly to these bits on the branches of the first note that we introduced. So how does that affect the expected code word length? Well it adds one bit. I'm sorry for the introduction. Let's get that right again. So it adds one bit with the probability given by the sum of the probabilities of these two notes which is exactly the weight of the new note that we introduced. So it adds one bit with probability so one times one third because one third is exactly this new weight that we introduced which is the sum of the probabilities of all the symbols that are affected. Then in the next step we introduced this note which now has weight two thirds and it affects the symbols A, B and C and what it does for symbols A, B and C it adds this one new bit. So always the second bit for all of these symbols. So again we're adding to the expected code word length we are adding one bit with probability two thirds because the sum of all the affected symbols, the sum of the probabilities of all the affected symbols is two thirds. So this two thirds is this two thirds which is now hardly readable. Finally in the last step we introduced this new note which in effect adds these first bits for all the symbols. So we're adding now one bit with probability one because that's the weight of the root note which is always weight one. So we're adding this probability one and if you do the math again you will see that this adds up to two bits as it did in our other calculation. Now you could do the same in the other example in the other way of breaking the ties but I'm not going to even write this down because you will immediately see that the notes that we added in the sequence of all it matters is the sequence of weights that we added in because we always multiply it with just one bit. So only this kind of sequence of weights that we add in and since it was a tie, since we broke a tie the only reason why we were able to break this tie in different ways is because those weights were the same for both choices. So these weights here again add first the note with weight one third then we added this note with weight two thirds which is a different note but it had the same weight that's why it was a tie in the first place. And then finally we added this note with weight one. So no matter how we break the ties we get the same sequence of weights and therefore we end up with the same expected codeword thing. So I'm going to scroll down again to the theory that we stated. So I hope this convinces you that this theorem always holds. So it does make sense to talk about optimality of Huffman coding. And just as a remark, now you might think that it doesn't even matter how we break ties but that's not really the case. So if you implement Huffman coding for an actual compression method then of course you have to make sure that even though it may not matter how you break ties you still have to be consistent between the encoder and the decoder. And decoder have to break ties in the same way so they still have to break ties in the same way. And this may seem trivial but when you actually implement Huffman coding and you have these floating point numbers maybe for probabilities then you have to be really careful that you add them up in the exactly the same way because even if you get slightly different rounding errors that may make something that is maybe a tie on the encoder side, not a tie on the decoder side if they add up numbers in just this mathematically equivalent but numerically different way. So you may run into problems there where then your encoder and your decoder do not agree on the code book that they construct. But as long as you follow exactly the same procedure and the exact same way to break ties you can be sure that you can use Huffman coding for encoding and decoding. So with that out of the way let's now carry on to a very bold statement and that's theorem 2 which we are going to prove in the rest of this video, theorem 2. And loosely speaking this theorem states that Huffman coding is optimal, it leads to an optimal symbol code. Huffman coding constructs an optimal symbol code. So that's loosely speaking the message of this theorem. In reality we can even make a bolder statement so let me write down the more precise statement. First let's make some assumptions, two assumptions. First we're going to assume that we have an alphabet x which has size at least two. And then let's assume that we have a probability distribution on this alphabet where all symbols in the alphabet are non-zero so are strictly positive for all x in the alphabet. So notice that these assumptions really just state that we're not dealing with a trivial system. So if the alphabet were size less than two so if we had only a single symbol that we could ever encode then the situation would be trivial because all we need is the length of the message and that we know that all the symbols in that message are that one symbol that's even possible. And in the similar way if there was any symbol in the alphabet with probability zero that means it cannot appear in our message so we could just take it out of the alphabet. So don't take these assumptions more than that they are they just state that it's not a trivial system that we're dealing with. And then the statement of the theorem is that under these conditions these assumptions that all optimal uniquely decodable symbol codes so for all uniquely decodable symbol codes this alphabet that are optimal that are optimal with respect to this probability distribution. P for all these simple codes there exists a Huffman code with the same codeword length for all symbols. So to make this bit more mathematical let's say if we call this for all uniquely decodable symbol codes Z and if we call the Huffman then there exists a Huffman code CH such that I.E.C. of X the codeword the length of that codeword is the same as the length of the codeword in the Huffman code for the same symbol and that holds for all symbols in the alphabet. And then you can it's easy to see you know if for all uniquely decodable symbol codes that are optimal there exists a Huffman code with the same codeword length doesn't have to have the same codewords but the codeword lengths are the same. Then therefore also the expected codeword length is the same. So if the Huffman code has the same expected codeword length as an optimal as any optimal uniquely decodable symbol code then Huffman coding is optimal. Now as a brief remark I've stated this theorem in the most general form that I could come up with but you may already see that we can make our lives a bit simpler. So if you remember if you did the problems then you may remember this from problem 2.1 on the second problem set. If you didn't do these problems basically be proved here that for any uniquely decodable symbol code you can find a prefix code with the same codeword length. So that means that it suffices to show that theorem 2 holds for prefix codes for optimal prefix codes for all optimal prefix codes. So what I mean with that is instead of saying that for all uniquely decodable symbol codes you could also say for all prefixes. So because if you have a prefix code then we can always first or if you have any given uniquely decodable symbol code but you don't know if it's a prefix code then we can always in a first step construct a prefix code. That has the same codeword length and then with that prefix code if you know that this theorem holds for prefix codes then it also holds for all uniquely decodable symbol codes. So it suffices to show this theorem only for prefix codes but once we've shown it only for prefix codes we then know that it will also hold for any uniquely decodable symbol codes. So actually let me not strike this through because it's still correct we just don't have to show it for all for uniquely decodable symbol codes it suffices to show it only for prefix codes. Now in order to show this we're going to use these assumptions these kind of non-triviality assumptions a couple of times so I'm going to give this a name just going to call this star these assumptions which basically mean that we're not dealing with a trivial situation. And we're going to use these assumptions already kind of in a first lemma that we prove on the way to our theory or to the proof of our theory. So in order to prove this theorem two let's first prove a lemma lemma one and lemma one again assumes again that these non-triviality conditions hold. And let C be now an optimal prefix code and again optimal is with respect to this probability distribution P which is non-zero everywhere. Now let's if you have a probability distribution then we can sort the symbols so let's sort the symbols such that their probabilities are non-decreasing. So P of x1 we're going to give them indices 1, 2, 3, 4 and their probabilities are then non-decreasing so P of x1 is smaller or equal P of x2 which is smaller or equal P of x3 and so on. Now obviously there could be ties so there could be two symbols with same probability which as it happened in our example and then we're going to break ties if that happens. By then we're going to break ties by the codeword length we already have a prefix code a symbol code so we can talk about codeword lengths then we break ties by codeword length and we are going to break them then in descending order. So what I mean with that i.e. if there are two consecutive I mean if there are two symbols with same probability then in this sorting they must be have consecutive numbers. So P of xi is P of xi plus 1 then then we're going to sort to solve this tie by saying that L the length of the codeword so this is length of codeword xi has to be larger or equal to the length of the codeword xi plus 1. And then there could still be ties but we don't care about that then we just break ties arbitrarily then break ties so if there are still ties after in this argument this condition then we break ties arbitrarily. So I'm saying the statement of this lemma doesn't depend on how we break ties. Now what is the statement of this lemma is saying then so after we sorted the symbols in this way the statement is that twofold first if you sort the symbols in this way and if it's an optimal code then we know that the length of the codewords. Actually are non increasing in general. So remember we only enforce this condition here in the case of a tie. But the statement now is if it's an optimal code then this and we are sorting them first by increasing probability then these codeword length are actually non increasing everywhere not just where we find ties. And then the second statement is and this will use the fact that we have a prefix code. Actually not yet but will soon use the fact that we have a prefix code. So the second statement is that here at the first inequality that is actually an equality. So the length of the first codeword is the same as the length of the second codeword in this sorting where we sort first by probabilities and then by codeword length. So let's prove this the list lemma. I'm going to prove first statement one and then statement two. So for statement one prove of lemma for statement one let's assume the opposite. So let's assume that there exists some ij some indices ij which come in this order. So with i less than j so they're different obviously and i becomes before j. So x i comes before x j in this sorting that we introduced and let's assume that what we're claiming to prove here does not hold for these. So that the length of the codeword for x i is strictly smaller than the length of the codeword for x j. So that's the opposite assuming the opposite of the statement here. Then well we can make two observations. First of all since i is smaller than j according to our sorting let me scroll up just briefly. So according to our sorting if i comes before j so i is maybe here j is maybe there like x i is maybe x 1 and x j is maybe x 3 or something. Then in this chain of inequalities the probability of the earlier one has to be smaller or equal to the probability of the later one. So therefore p of x i is smaller or equal to p of x j. And now we can ask ourselves the question can they be equal in fact. So can it happen that the two probabilities p of x i and p of x j that their equality here holds. Well if equality holds then there must be a tie actually along the entire path from then all the equalities in between i and j have to hold. So there must be a tie between this and we broke ties by enforcing that then L of x i is larger or equal than L of x j. But that's not the case right in our assumption that is not the case. So we conclude that in fact due to now this part here make the screen. Due to this part we can conclude that they are not equal so that equality does not hold. So therefore p of x i let me maybe make it this way. So from this and this together now we find that p of x i is actually strictly smaller than p of x j. Okay so these are two important things that we found we found that the well on this part that the probabilities kind of go in this direction. And the codeword length also kind of have a relationship in the same direction. And then my claim is that therefore the symbol code cannot be optimal thus z is not optimal. I believe this should be easy to see well why is that the case? Well if the symbol with a strictly lower probability has the shorter codeword length than the symbol with a higher probability then why wouldn't we just swap these two codewords? That wouldn't just swap in codewords doesn't change the fact that we still have a prefix code. But it would certainly reduce the expected codeword length because we are then assigning the shorter codeword to the thing that actually happens more frequently. That's certainly better than assigning it to something that happens less frequently. It's not optimal because we could swap the codewords for x i and the codeword for x j and that would reduce the expected codeword length. Alright so that has proved part one of the theorem because we assumed that z is an optimal prefix code. So if I may scroll up again to have a look at the lemma so we assumed that z is an optimal prefix code and then we showed that if this what we're trying to prove here if that does not hold then z cannot be an optimal prefix code. Now in statement two we now have to prove that we can already assume that this is true because we have already proven it. And now we have to show that in this first equality actually we have an equal sign. So let's again prove this by assuming the opposite. So for statement two let's again assume the opposite which in this case now is that L of x1 is not equal to L of x2. But if it's not equal I mean we already know it's larger or equal. We may scroll up again so we all already shown this part. So if it's not equal the only thing that can happen is that it is strictly larger. So let's assume that. Now what we've also shown is that we know from part one that L of x2 is larger or equal than all the other. The parts that follow after it. So then all L of x prime for then let me write it in the right way. So then L of x prime for all x prime that are not x1. And everything that comes after it it's also larger or equal than itself trivially. So therefore with this assumption that L of x1 is larger strictly larger than L of x2 which itself is larger or equal than all other code words. That means that L of x1 is strictly larger than all code words. All for all x prime which are not x1. And then we can also see that claim thus again c is not an optimal prefix code. Why is that the case? Well the reason is just because we could drop the last bit of c of the code word. We could drop the last bit of c of x1. Sorry this should be a 1 not an i. C of x1. And since c of x1 is the only code word that's as long as this thing the only way how this could then clash how this could break the conditions of a prefix code is if then the shorter that after dropping that one bit it would become a prefix. It would become equal or a prefix of something else or if something else became a prefix of it. But remember c of x1 was the only one of that length so you may have maybe a couple of code words right 0 1 1 0 1 maybe this is c of x1. Now you drop the last code word so you drop this one. Now you're claiming there is some other code word some c of x prime that is a prefix of this so maybe is 0 1 1 it is a prefix of this. Well if that exists even if it's a prefix of this entire thing well it certainly was also already a prefix of c of x1. Here it was if we hadn't dropped this bit it would also already have been a prefix so in that case certainly we would have already had a not a prefix code to begin with 0 x1. So this can't clash because if c is a prefix code as we assumed. But then this operation dropping this last bit reduces the expected code word length by precisely you know by one bit with the weight p of x1. Which we assumed in our kind of non triviality conditions that that is strictly larger than zero so therefore it reduces the expected code word length without violating the conditions of the prefix code. So therefore c the original c could not have been an optimal prefix code. So let's briefly take stock where we are we still want to prove theory in two which states that in a nutshell Huffman coding is optimal. And in order to get there if now proven this first lemma which has these two statements that if you sort your symbols kind of first by increasing probability. And then if there are ties by decreasing symbol code word length then in the resulting sorting you always have that the first two symbols are equal. I have equal length are not equal but the code words are not equal but the code words have equal length. And then following that the code word length don't rise they can stay constant for a while but and then decrease but they don't go up. So the first two there are always two code words with the lowest probability and also which are long which have the longest code words and are equal in length of the code word. Now we're going to use this lemma to prove yet another lemma and that will be the last lemma before we actually go to the proof of the full theory. So in lemma two we're going to make a slightly more complicated statement. So let's say that lemma two and that's again assume we were are going again to assume that these non triviality conditions holds that we have an alphabet of size at least two and that all the symbols have no zero probability. And then again let's assume that C is an optimal is an optimal prefix code demo prefix code so optimal again with respect to this probability distribution P to P. And then the statement is then there exists a pair X X prime in the alphabet which is really a pair so they are not the same and their length the code word length are equal and larger equal than all other code word length. So so far that shouldn't really surprise you because that's exactly what lemma one said that you can always find, you know, two code words, we call them X one and X two that have were whose code words have equal length and their length are larger equal than all the other code word length. But then the statement of this lemma two says that you can even kind of make an additional condition apply for this pair such that and the condition is if you have a prefix code that their code words C of X and C of X prime only differ on the last bit. Okay, so this is a kind of a lengthy statement. So let me actually break it down into two parts. So first we have this part that there always exists this pair which we basically we've already proven. So let me call that part triangle. And then we have this other part that states that that then additionally on top of that you can choose this pair such that their code words only differ on the last bit. And I'm going to call this part square. I'm just going to give them names because we are going to use them a couple of times. So how do we prove this lemma two proof of lemma two, we're going to approve again by assuming the opposite. So again, assume that such a pair does not exist. So it's basically going to be the same strategy we're going to approve to assume that the pair does not exist, and then show that then see cannot be an optimal prefix code. So if it doesn't exist, well, what exactly doesn't exist? Well, we know that this first condition triangle, a pair that satisfies this that always exists because we've already seen this in lemma one. Let's say, you know, but from lemma one, we know that there exists kind of an X pair X, not equal X prime that satisfies triangle. And the claim is, if such a pair exists, but no such pair exists that were only the last bit differs, then again, C is not optimal. So thus, C is not optimal. And this is not easy to see, but we're going to walk through it. So why is C not optimal? Because we can now take either of the code word C of X or C of X prime. Let's just take C of X. We can now drop the last bit of that the last bit of C of X. I could as well as well have written C of X prime. I mean, there's we don't really make much of a statement about the difference. And, you know, we could drop the last bit of C of this code word C of X. And the claim is then that after dropping that last bit, it's still a prefix code without violating the conditions of a prefix code. Let me abbreviate it here. So it's still a prefix code after we drop that last bit of this code word. Okay, this may not be so obvious to see. So let's actually be a bit more thorough here. So that's now the proof within the proof. So proof of this claim. You know, let's first give this thing a name. So let's call the C of X with the last bit dropped. Let's call this gamma. And now, in order to show that this operation still leads to a prefix code, we have to show two things. We have to show that this gamma is not a prefix of any other code word. And we should have to show that any other code word is not a prefix of gamma. So for any other symbol. So then for all other symbols X tilde, which are not X, let's actually show the second one first. We know that C of X tilde is not a prefix of C of X. And we know that simply because C is a prefix free code by assumption. Now, if C of X tilde is not a prefix of C of X, again, think of it as, you know, maybe C of X is some 0 1 1 0 1 0 some sequence of bits. And now we want to show that it's, you know, we know that it's not a prefix of X, but could it be a prefix of gamma. So how does gamma look like? Well, we know that gamma results from taking C of X and removing that last bit. So removing this one, dropping this last bit. Now, if this C of X tilde, if that was a prefix of gamma, so this is now gamma. Now, if C of X tilde was a prefix of gamma, let's say that was something like 0 1 1 0, which is a prefix of gamma because it's this string. Well, then it certainly is also a prefix of C of X, because C of X is just gamma extended by one bit. So if C of X tilde is a prefix of gamma, then it is also a prefix of C of X, but that's not the case. So therefore, C, let me use black again, C of X tilde is not a prefix of gamma. So that's half of the things we have to prove. What about the other direction? Now we have to prove that gamma is not a prefix of any other C of X tilde. Well, if gamma was a prefix of C of X tilde, then, I mean, if it's a prefix, then the length of C of X tilde has to be larger or equal, obviously, than the length of gamma. Otherwise, gamma couldn't be a prefix of it. Now, can it be the same length? Well, no, it can't, because if it were the same length, if there was equality here, then not only would gamma be a prefix of C of X tilde, but if two code words of the same length, if you have two code words of the same length and one is the prefix of the other, that just means that they are equal. And therefore, if they are equal, then the other is also a prefix of the first one. But we've already shown that C of X tilde is not a prefix of gamma. So therefore, if gamma is a prefix of C of X tilde, they cannot be of the same length. So equality does actually not hold here. And it's strictly, C of X tilde is strictly longer than gamma. But now, also remember that gamma resulted from taking C of X, which was a longest code word and dropping one bit. Now, if C of X tilde is longer than gamma, it means that it has to be a longest code word. It can only be longer by one bit and has to be a longest code word. Thus, C of X tilde is a longest code word. And I'm saying a longest code word because they could, I mean, we know that there are more than one longest code word, but it has the longest length of all code words. No other code word is longer than it. But what does that mean? So now we have a longest code word C of X tilde, which is different. And we know that X tilde is different from X. And C of X is also a longest code word. And we know that if we drop one bit from C of X, then the result from dropping, which is gamma, suddenly is a prefix of C of X tilde. So again, what does this look like? You have a C of X. Let's take the same example, 0 1 1 0 1 0. You drop one bit. And then you know that this, the result of this dropping, dropping this one bit makes that a prefix of C of X tilde, which is exactly one bit longer. So it has to be 0 1 1 0 1. And then something here can't be 0 because it has to be a different code word. So it has to be 1. So now this is the situation that we're looking at, right? So this gamma is a prefix now of C of X tilde, which is exactly one bit longer. But then exactly the case happens that we were assuming doesn't happen, that we don't have such a pair where the two longest code words only differ by the last bit. So then C of X tilde and C of X are two longest code words that differ only on the last bit. And that's again a contradiction. So let me just to remind you, scroll up to the lemma that we wanted to show. This is exactly this condition square, which we assumed does not happen. So see if we call it, if you said that no such code words exist. But now we found some, we gave them different names, but we found that such a pair of code words exists, which are both the longest ones among the longest ones. And they differ only on the last bit. So this is exactly i.e. they satisfy square, which is exactly what in the assumption we assumed that such a pair does not exist. So we know they satisfy the triangle, but now we also show that they satisfy square. This concludes the proof of lemma two. So let's briefly recap again and see where we are. So we're still in the process of proving this theorem two that basically states that Huffman coding constructs an optimal symbol code. And in order to do this, we have now proven two lemmas. One basically says that if you can sort the symbols kind of in ascending probability and then break ties by code word length. And if you do that, then you always find two code words that have equal length and equal and longest length. And then lemma two, I've kind of abbreviated it here, states that among, I mean, they're in lemma one, there could even be more, right? There could even be more than just two code words that have equal and longest length. And then theorem lemma two states that among all of these, among the set of all code words that have the longest length of all code words in the code book. Among the set, there's always a pair that differs where the code words differ only on the last bit. We're going to use this now to prove theorem two that Huffman coding is optimal. That means that for any uniquely decodable symbol code, and we've already seen that for the proof, it suffices to say that, you know, to prove this only for prefix codes. Because once we know that it holds for prefix codes, we know that it holds for all uniquely decodable symbol codes. So we're going to prove that for all prefix codes, for all optimal prefix codes, there exists a Huffman code with the same code word lengths. So let's now finally come to the proof of this theorem. So let's come to the proof, what we've been waiting for, of theorem two, two which as a reminder was optimality of Huffman coding, of Huffman coding. And this proof is going to work by induction. And so we're going to go by induction on the size of the alphabet and the number of symbols that we have in our alphabet. The base case is pretty simple. So base case is just that if always assumed in this condition star that the alphabet size is larger or equal than two, so the base case is alphabet size equals two. And that is actually a trivial case, because if the alphabet size is two, then there exists really only two optimal prefix codes. The only optimal prefix codes are the ones that say the alphabets, let's say the symbols are A and B just so that we don't confuse them with going to give them letters instead of numbers so that we don't confuse them with code words. So let's say the code, there's one optimal prefix code that assigns the code zero to A and one to B and then the other one is the one that assigns one to A and zero to B. And these are obviously already Huffman codes. These are Huffman codes. So you have your symbol A, your symbol B, then you construct a tree and you can either give this branch label zero and this branch the label one, or you can construct a tree on the symbols A and B where this one has label one and this one has label zero. Both of them are already Huffman codes, so obviously there exists a Huffman code with the same code word length as these because they are already Huffman codes. The more interesting step is the induction step. So here we're not going to assume that the alphabet size is strictly larger than two and we're going to prove that if our theorem holds for all alphabet sizes smaller than the current alphabet size that we've already proven it there so we only have to show that if it holds an alphabet that's one smaller than it also holds for this current alphabet. So this is strictly not a larger equal. So in order to do this we're first going to apply lemma two. So from lemma two we know that there exist two symbols XX prime which are not equal and which are among the longest code words. So whose code words are among the longest. So with longest code words and we know that their code words only differ on the last bit that differ only on last bit. Now there's a slight complication because we would now also like to apply lemma one which states that these longest code words that they also have the two lowest probabilities. But remember if I may scroll up again to the three cup so lemma one only states that if you sort by probability then you always get that the two lowest probability code words that they are indeed longest code words. But there could be more right X the code word for X three could also be the longest code word and then if you apply lemma two and say that you know these two longest code words only differ on the last bit. It's not guaranteed that this lemma two applies to X one and X two it could apply to X two and X three if these are also longest code words. So but we would like to say that kind of the lowest probability code words are the longest code words. And that's why in theorem two sorry for scrolling again. So we're not claiming that this uniquely decodable prefix code is already a Huffman code. We're only claiming that there exists a Huffman code with the same code word length. So what we're now going to have to do is we have to sometimes maybe have to reorder some code words of the same length. And so in particular what we're going to do is so generally you can in the simplest case you can think of these two longest code words that differ only in the last bit. They probably already have like low probability otherwise they wouldn't be longest code words in an optimal symbol code. But there could be other there could actually be other code symbols with the same length of code words and they could even have lower probability. So and if that happens then we're just going to reorder some of these code words. So if P of X from this pair and P of X prime if they aren't among the two lowest probabilities among aren't the two lowest probabilities. Then we are going to apply lemma 1 and that states that there exist symbols and we call them X1 and X2 because they resulted from our sorting. And they actually do have the two lowest probabilities because that's how we sorted them lowest. And they also have the longest code word length and also so they are also among the set of symbols whose code words are the longest. And then we can construct a new symbol code construct a prefix code again C prime by swapping the following two pairs. So on the one hand we have C of X and C of X prime. So remember these this is our pair which we got from lemma 2 which has the longest code word length and only the code words only differ on the last bit. And we're going to swap this pair with the code words of this pair of code words with C of X1, C of X2 which are also code words with the longest code word length but we know from them that they also the code words only differ on the last bit. So we know about all of these that all of these have the same length all have same namely the longest length. Therefore swapping them with swapping I mean we're constructing a new C prime which now assigns to X1 assigns this code word and to X2 assigns this code word and to X assigns this code word and to X prime assigns this code word. And if maybe one of those are actually appears in both sets then that doesn't matter then that swapping is just a know up. So swapping them doesn't change the expected code word length it doesn't change the code word length at all L of X for any any X in the alphabet because they have all the same length. But then additionally about these two these were the ones actually maybe write it here. So these were the ones that we got from lemma two so these only differ here we know something about the code words here only differ on last bit. And then here for these two we know something about the probabilities. So here we know that P of X1 P of X2 are lowest probabilities are lowest. So I mean if we do the swapping then we only change the code words for each symbol we don't change the probabilities of each symbol. So this stays the same but this property now moves over to the right hand side. So we now know that thus in this new code book in C prime of X1 so in C prime we have a pair X1 X2 with the upside satisfies both of them. So they have lowest probabilities props and code word lengths are longest and only differ the code words are longest C of X1 C prime of X1 C prime of X2 are longest and only differ on the last bit. Alright that was a lot of work for kind of constructing now a code a new code book that has the same code word length so we haven't lost anything yet because we're only trying to prove that we can find a Huffman code book that has the same code word length and we haven't changed them yet. And we have done a lot of work to now come up with a code book that were the longest there is a pair of two symbols with lowest probability whose code words are equal and the longest and equal length and the longest and the code words themselves only differ on the last bit. Why does that help us well now we can basically kind of show that this these two code words can be thought of as resulting from a Huffman coding step a step of the Huffman coding algorithm. So we can now reduce the code book size by one which is what we want to do because we know that our theory and we can assume that our theory on ready holds for the smaller code book size. So let's now define a couple of things let's define a new alphabet X tilde which is the old alphabet without X1 and X2 but in the year of that we add a new symbol to this alphabet which I'm just going to call star. Star is just some new symbol we're just giving it a name. So this new I mean something that isn't already in here or even in here. So that means that obviously the size of X tilde is size of X minus one which is larger equal than two because we started from something that's strictly larger than two. So we can assume that our theorem already holds for this alphabet the size of the alphabet. Now in order to kind of apply the theorem we have to use a probability distribution have to define a probability distribution on this new alphabet and we're going to call this probability distribution also P tilde which assigns to any X tilde which is kind of an element of this new alphabet. It assigns the following probabilities either if X tilde is one of these original symbols then it just assigns the original probability distribution which it can in this case. But if X tilde is this new symbol then it assigns P of X1 plus P of X2 which were the symbols that we took out so you can easily convince yourself that this new probability distribution is still normalized because if you sum over all symbols in this new alphabet then you get all the probabilities in the old probability distribution except for the probabilities of these two but you get the sum of these two when you evaluate the probability of the new symbol. And then finally we're going to define a symbol code C tilde that again assigns something a code word for every symbol in this new alphabet and the code word will be as you would probably expect it to be so if again if this symbol is in the original alphabet we're just going to use that code word that we had previously actually C prime now after our reordering and then if X tilde is star then we're going to take C prime of X with the last bit dropped. So this is if X tilde equals star and again recall that now I'm writing sorry this should be X1 so let me clear that C of X1 C prime of X1 so I'm writing C of X1 here. I could as well have written C of X2 right because we know that C of X1 and C of X2 only differ on the last bit so if you drop the last bit it doesn't matter which one from which one we start. Now the claim is that this new C tilde is now an optimal prefix code and therefore since it operates on an alphabet that's one smaller we can close the induction step. So let me write that out so claim C tilde is an optimal prefix code and now it becomes clear why I was always so pedantic about what I mean with optimal now this is obviously optimal with respect to P tilde with respect to this probability that's defined on its alphabet. So let's well give an argument sorry give an argument for that so this is the proof of this claim well if it weren't optimal then there would exist a better prefix code that's just what not optimal means. Then there would exist a better prefix code by the way the fact that it's a prefix code should be obvious because C prime was a prefix code and we only dropped one bit from these code words that only differed on the last bit. So if that introduces a clash then similar to what we've proven before the original code word that this thing clashes with would already have been a prefix of the original code words but now we're showing that it is an optimal prefix code. So if it weren't an optimal prefix code then there would exist a better prefix code C tilde tilde on this new alphabet X tilde and again optimal with respect to P tilde but what I can then do is I can basically inward this step here. So I can then construct a symbol code on X symbol code on X on the original alphabet by inverting the above step. What I mean with that removing the symbol star you know remove star from the alphabet and introduce symbols X1 and X2 with the code words C that's now called as C prime prime of X1 equals C tilde tilde. Of star concatenated with zero and C tilde tilde of X2 given by C prime prime X2 given by C tilde tilde of star concatenated with one. So as you would expect basically just the opposite of what we just did above here. And then that symbol code then this operation let's go one step back this operation here this reduces the expected code word length by one bit because we drop this one bit. With the probability of well of this symbol star which is the probability P of X1 plus P of X2 so with by P of X1 plus P of X2. Now this operation here if you do the inverse then therefore this increases L again by again it adds one bit to these two symbols so it increases L by P of X1 plus P of X2. So we have in total we know if I now use the same kind of symbols for the expected code word length we know that we started from C prime we removed one bit for X1 and X2 so removed or dropped one bit from C prime. Of X1 and C prime of X2 and that lead led us to a C tilde which is so this one is defined on X and this one is defined on X tilde. So now what that means for the expected code word length is this one has expected code word length L prime this one has which is the same as the original one because we only swapped code words that had the same code word length. And then here we said that the code word length L tilde that dropping this one bit removed it reduced code word length by minus P of X1 plus P of X2. Now our claim is that we can find the our assumption here that we want to disprove is that we that there might exist a different code word a code book again on X prime we're having this assumption that we want to disprove that if it weren't optimal so if C prime weren't optimal. So the assumption is that L tilde tilde is smaller strictly smaller than L tilde because it's better. And then what we can do is we can then invert this process and again get a C prime prime which is defined on X again and then here we're adding in this one bit appending one bit. So we're splitting up this code word for C tilde tilde of star and we adding one bit to C tilde tilde of X1 and C tilde tilde of X2. So this then increases again the code word length so L2 prime is now the original one this one plus P of X1 plus P of X2 which is okay so let's follow through with this so L tilde tilde is smaller by our assumption than L tilde. So can write it like this and then we can use the fact that L tilde is given by L prime plus so three minus these two probabilities so that is then the same as L prime. So in total we get L2 prime is less than L prime which means this one is also the same as L so which means that C the original one code word that we code book that we started with was not optimal which is a contradiction. Alright so with this kind of circle we've now proven I'm going to scroll up again that this C tilde that we've constructed is indeed an optimal prefix code. Therefore we now have reduced our code word we have shown that there exists a code book C prime which has the exactly the same code word length for which we can apply our theory that means C prime since it is optimal. It is sorry since C tilde is optimal and it operates on the smaller alphabet we can apply our theorem and we know that C tilde has the same code word length as some Huffman code. And now we can show that if that's the case I mean then what we did here in this in this these definitions is really taking out from our original code from our C prime the two code words with lowest probability and we did this one Huffman step with a contraction where we constructed the last bit from the first step of the Huffman coding algorithm. So we can follow that thus C tilde is optimal prefix code on alphabet X tilde of size X minus one. Therefore we can apply our theorem to applies because it's a smaller alphabet and we get you see that there exists a Huffman code Huffman code on X prime with same L of X for all X tilde. So with the same code word length as C tilde and therefore C prime has the same code word length than a Huffman code code where length as a Huffman code. Because again basically the step of this contract we have done the first step of the the Huffman coding algorithm. This step here is nothing but the first step of the Huffman coding algorithm. And therefore C has the same code word lengths. Huffman code on X same code word length as a Huffman code on X because C and C prime have the same code word length and that's exactly what we wanted to prove. All right, let me scroll up one last time just so that we can let this theorem set in that we've just proven. So we've proven that in short Huffman coding constructs an optimal symbol code. What do we mean by that even more we can improve even more that for every uniquely decodable symbol code. Not only can we construct a Huffman code with the same expected code word length but we can even construct a Huffman code with exactly the same code word length for each symbol. But in short it's just as Huffman coding is optimal. And this was really a milestone in compression theory and Huffman coding is used very widely now even though it's it is only an optimal symbol code and not an optimal compression code, lossless compression code in general but it's still used very widely. That concludes this proof concludes our treatment of symbol codes in this lecture. So in the next video we're going to actually take a step back from coding theory. So in the next video we're going to take a step back from coding theory and we're going to be. So what we've learned now is that all these coding algorithms, whether it be Huffman coding or Shannon coding, they all need a probability distribution over the symbols in order to construct and also the theoretical bounds. They only make sense. You can only compress data if you have a probability probabilistic model of your data source. But so far these probabilistic models have been extremely simplistic that we've used. We've assumed that we have just a sequence of symbols that are what's called IID that's independent and identically distributed. But in reality that's obviously not the case. If you want to compress an image then the pixels in an image are obviously not IID. They are strongly correlated. So starting the next video we will be thinking about better, begin thinking about better probabilistic models of our data source. And we will see that we have to draw a strife or a balance here between how much these probabilistic models can capture of the data source and whether or not they become prohibitively expensive. And we will see that an important aspect that these probabilistic models have to capture are correlations between symbols. And interestingly that will then tie back again into compression theory because then we will see that depending on how you model correlations between symbols. So correlations become important. So we will see, we will also show theoretically how much taking into account correlations can help you. But we will also see that if you want to model correlations there are different ways to model correlations efficiently. And depending on how you model them efficiently you need then need different source coding algorithms to actually use these models for compression. So this will then play back into source coding algorithms. So to be more precise we will see that for some ways to model correlations you need a specific source coding algorithm which wouldn't work for other ways. So there are different ways of modeling correlations and for each way you can only use certain source coding algorithms. So you have to choose your source coding algorithm depending on how you model correlations in your data set. But that's for future videos. So see you in the next video.