 In this video, we'll introduce our first class of lossless compression codes, so-called symbol codes. This is the second video for course on data compression with and without deprobabilistic models. You can find a link to a playlist with all videos from this course down in the video description. There are also links to lecture notes, problem sets and solutions. In the last video, we discussed the general problem setup of communication over a channel. We stated that, according to the source channel separation theorem, data compression or source coding can be separated from error correction or channel coding. In this course, we'll focus on source coding and in this video, we'll start with lossless compression, that is, data compression where the receiver is able to reconstruct the original message without any errors or distortion. There are two main approaches to lossless compression. First, so-called symbol codes are conceptually simple, but we'll see later in the course that they achieve suboptimal bit rates, in particular when they are used in modern machine learning-based compression methods. So-called stream codes, on the other hand, achieve very close to optimal bit rates without sacrificing computational efficiency, but they are a bit more difficult to understand. So, to keep things simple, we'll start in this video with symbol codes and we'll defer stream codes to lecturers 5 and 6. Before we can define what a symbol code is, let's formalize the problem setup of lossless compression. It'll get a bit formal in a second, so in order to not lose track of the goal, you might want to keep a typical practical example in your head. Say you have a data source that generates ASCII-encoded English language text of variable length, and both the sender and the receiver know ahead of time that that's the type of data that they want to communicate. Then, once sender and receiver have agreed on a compression method to use for this kind of data, the sender picks some specific message, let's say the word draws, but, and the sender is now tasked with encoding this message in some way so that she can send it over some noise-free channel that, let's say, can only transmit bit strings. Let's generalize and formalize this problem setup. We'll denote by boldface x a message, and we assume that the message is a variable length sequence of so-called symbols. That summarized in this blue highlighted equation, and here the gothic letter x is called the alphabet. The alphabet is a discrete set of all possible symbols, and it is known to both the sender and the receiver before the communication starts. In our example above, the alphabet would be the ASCII code set. To be more precise, when I say that x is a discrete set, I mean that x is either a finite set or a countably infinite set, and we'll discuss why countability is important on a later problem set. Next, k of x is an integer that denotes the length of a given message x, and the symbols x sub i for i from 1 to k are the symbols that make up the message. The length k may be different for different messages, and we denote the countably infinite set of all possible variable length messages as x star, so this is the union of all product spaces x to k for all non-negative integers k. x star is called the clean enclosure of x, and the star is sometimes called the cleanest star. Finally, we assume that we have a channel that transmits variable length bit strings. For completeness, we allow for general bary bits, but in practice we'll mostly assume that b is 2, so the channel transmits strings of conventional bits that can be either 0 or 1. So let's now assume that we have such a discrete alphabet, a discrete message x from the clean enclosure of the alphabet, and a noise-free bary channel. And let's now formalize the goal of lossless compression. In order to encode the message x, the sender needs a function that maps from the message space x star to the space of variable length bit strings. This mapping must be injective, that is invertible, so that different messages get encoded into different bit strings, and there are no ambiguities when the receiver decodes the bit string. And a good compression method should in general create bit strings that are short, and we'll make this more precise towards the end of this video. We'll call a function from the message space to bit strings that satisfies these two properties a lossless compression code, or just code for short. Now the simplest type of lossless compression codes are so-called symbol codes. What we do in a symbol code is that we break up the message into its symbols x sub i, where i indexes the position within the message, and we map each symbol to some bit string c of x sub i, which we call the codeword of x sub i. We denote the length of a codeword, that is the number of bits that make up a codeword, by a modulo of c of x sub i, and we then do something very naive. We simply concatenate these codewords for all symbols in the message, and we don't bother about introducing any deliminatives between codewords. So the length of the compressed message, which is also called the bit rate, is just the sum of the lengths of the codewords for all symbols that make up the message. Since symbol codes are conceptionally so simple, you see them all over the place. The most obvious example of a symbol code that comes to mind is Morse code, which was used in the early days of radio communication. Morse code defines a mapping from letters and numbers to sequences of shortened long beeps, which are usually notated as dots and dashes. So you might think that Morse code is a symbol code for a binary channel, because you can think of dots and dashes as zeros and ones. But that's not quite correct, because Morse code actually does introduce deliminators between letters and words, and symbol codes do not account for deliminators. So if you want to interpret Morse code as a symbol code, you have to consider the channel as accepting three or arguably four different characters, namely dot dash and space. And one might argue that the convention even defines two different types of spaces. Another example of symbol codes is UTF-8, which is the preferred character encoding for the vast majority of text files and websites these days, because it allows you to encode any character defined by Unicode, where it's still resulting in reasonably short encodings most of the time. So here our alphabet is the set of Unicode characters, or more precisely Unicode code points, and I'd argue that the base is 256, because each symbol is encoded to a sequence of bytes. But don't get a wrong impression, not every compression method is a symbol code. In fact, if you took chemistry classes in high school, then you've probably already used something that one might consider a compression code, but that is not a symbol code. Chemists have various ways of specifying molecules, among others these detailed structural formulas on the left that look like little doodles, and these condensed formulas on the right. Notice that the condensed formulas are shorter, but there's no one-to-one correspondence between constituents of a condensed formula and constituents of the corresponding structural formula. For example, for ethane, we can imagine that the first C in the structural formula gets encoded, so to speak, to a C in the condensed formula. But then the second C in the structural formula gets encoded to something different, namely the substrict 2 in the condensed formula. Coincidentally, a somewhat similar technique where one identifies repetitions on the fly actually comes up as the first step in the so-called deflate algorithm, which is used in popular general purpose compression methods like DIP and GZIP and also in PNGs. So these were a few examples and one counter example of symbol codes that appear in the wild. To keep the following discussion simple, we're now going to introduce a much smaller toy example that I'll call the simplified game of Monopoly. We consider messages that are sequences of symbols from a finite alphabet, and we assume that the data source generates each symbol by throwing a pair of fair dice and recording the sum, like in a game of Monopoly. In the real game of Monopoly, one can throw 11 different numbers from 2 to 12, and writing out huge tables for all these possibilities gets old very soon. So to keep things simpler, we'll instead assume that the dice are only three-sided, so that our alphabet X only contains the integers from 2 to 6. If you want to specify a binary symbol code for this data source, then we have to define a code word for each symbol X in the alphabet. A naive code book might simply take the number X and write it out in binary, as I'm showing here. But you can easily convince yourself that this is not a good idea, because while the code words are all different, you can run into ambiguities once you concatenate more than one code word. For example, if you encode the message that consists of the two symbols 2 and 6, then it's encoding, so C1 star of 2, 6 is the bit string 1, 0 for the symbol 2, immediately followed by the bit string 1, 1, 0 for the symbol 6. If you transmit this concatenated bit string over a channel, then the receiver might decode it as the correct original message 2, 6, but it might also read it as the code word 1, 0, 1, followed by 1, 0, and decode it into the wrong message 5, 2. When a symbol code has ambiguities like this, then we say that the symbol code is not uniquely decodable. A simple way how we could fix this is if we just made all code words the same length, for example by padding with the leading zero bits. I'll call the resulting symbol code C2. This symbol code is uniquely decodable, but it has longer code words. Already in this simple toy example, it turns out that if you choose the code words carefully, you can make some of them shorter than others, and still end up with a uniquely decodable symbol code. And you'll look at a few examples of this on problem set 0, which is linked in the video description. So let's formulate what we've just seen. We say that a symbol code C is called uniquely decodable if encoding sequences of symbols cannot lead to ambiguities. Mathematically, this means that the corresponding code C star is injective, i.e. invertible. We will indeed focus on uniquely decodable symbol codes in the following bit, because only they are useful for lossless compression. Unfortunately, however, it can be quite difficult in general to analyze whether a given code is uniquely decodable, and you'll see this already on the problem set. This is because unique decodability is really a property of C star, so you have to prove that among all the infinitely many arbitrarily length messages, there are no two messages that C star maps to the same bit string. Fortunately, there's a property of symbol codes that's easier to analyze, and we'll see that it gets us just as far in some sense. We say that a symbol code is prefix free if no code word is the prefix of another code word. So more formally, for any two symbols x and x prime in the alphabet, the code word for one symbol does not begin with the code word for another symbol. Prefix free symbol codes are also called prefix codes, even though it kind of sounds like that would mean the opposite. The nice thing about prefix freeness is that it is the property of the code book C rather than the code C star, and this makes it much easier to check whether a symbol code is prefix free. You only have to compare all pairs of code words rather than all pairs of encoded messages. And you'll also see on problem set zero that prefix codes are easier to decode than non prefix codes. So on the one hand, unique decodability is the minimal requirement that we have for a symbol code to be useful for lossless compression. But on the other hand, prefix freeness is easier to prove and prefix codes are also easier to work with. Luckily, we can reduce one to the other. You'll first argue on problem set zero that all prefix free codes are automatically uniquely decodable, so they definitely satisfy the minimal requirement to be useful. Then in the next video, we'll prove a less obvious theorem that states that for every uniquely decodable symbol code, there exists a prefix code that is just as good in every regard. So there is no reason to use a symbol code that is not a prefix code. Let's now look again at our examples for the simplified game of monopoly from the last slide. The obvious question that you might ask yourself is, which one of these five symbol codes is the best one? You'll find on the first problem set that all except the first one are uniquely decodable, so they are all in principle usable. You'll further find that the last code is not a prefix code, so maybe we want to avoid this one to simplify decoding. Also, the code C2 has longer code words than the other two remaining codes, so let's avoid this one too. But for code C3 and C4, it's not that clear. Both are prefix free and both have two code words with length three and three code words with the length two, just for different symbols. What do you think? Which one of these should we rather use in practice? To answer this question, let's remind ourselves of what it is that we want to achieve. We discussed the general problem of communication over channel in the last video, and we said that source coding, which is the part that leaves with the data compression, requires a probabilistic model of the data source. So let's introduce a probabilistic model for the simplified game of monopoly. We said that each symbol is drawn randomly by throwing a pair of three-sided dice, so there are three times three equals nine equally probable outcomes of the dice throw, and we can read off the probability of each symbol simply by counting how many different outcomes lead to it. And now it's clear that the symbol code C3 is better than C4 because it assigns the three shorter code words to those symbols that appear with higher probability. To formalize this argument, we can for example look at the so-called expected code word length, which I'll denote by capital L. That is, we average the code word length by all symbols weighted by the symbol probabilities. This is not the only well motivated metric, but it's the simplest and most commonly used one. And if you do the math, then you'll find that C3 indeed has a lower expected code word length than C4, so we'd prefer C3. In fact, it turns out that C3 is an optimal symbol code for the simplified game of monopoly in the sense that it achieves the smallest possible expected code word length. And to wrap up this video, I'll now briefly show you how I came up with this optimal symbol code. I used a famous algorithm that is called Huffman coding and that is widely used in practice. This algorithm is our first example of a so-called entropy coder, that is, an algorithm that takes as input a probabilistic model of a data source and that outputs a lossless compression code that is optimized for this probabilistic model. The Huffman coding algorithm builds a binary tree whose leaves are the symbols in the alphabet. Here's how it works for the simplified game of monopoly. We start by writing out the symbols and here I'll already sort them in a way that will later turn out to be convenient. The sorting is not crucial for the algorithm, it will just make the resulting tree look nicer, but it won't affect the resulting code book. Next, we write below each symbol the probability of the symbol. And then we keep repeating the following step. We identify the two vertices with lowest weight, for now that's the two symbols with lowest probability, and we introduce a new vertex in the binary tree that becomes the parent of these two vertices. The weight of the new vertex is the sum of the weights of its children and we cross out the weights of the children. Then we repeat. So we again identify the two vertices with lowest weight and we now exclude anything that's crossed out, but we include also the new vertex that we introduced in the previous step. In this particular example there's now a tie because there are in fact three vertices with lowest probability. You'll think more about ties in Huffman coding in problem 1.2 on problem set one, which is linked in the video description, but the gist of it is that we can break the tie arbitrarily. I'll just pick symbols three and five and I'll introduce a new parent vertex whose weight is the sum of its children's weight and I'll cross out the weights of the children. Let's repeat again. The two lowest vertices are now 2 over 9 and 3 over 9, so the new parent vertex has weight 5 or 9 and we cross out the children's weights. Finally, there are only two vertices left, so they are trivially the ones with lowest weight, so we introduce the root of the Huffman tree which always has a weight of 1 because the probabilities of all symbols should sum to 1. We have now constructed a binary tree whose leaves are the symbols in the alphabet. To turn this tree into a symbol code, we'll now interpret this tree as a tri. Don't worry if you don't remember what a tri is, it's very simple. Each non-leaf vertex in a binary tree has two children that are connected to the vertex by two edges and we arbitrarily label one of these edges with a zero and the other one with a one. Then we read off the codeword for each symbol simply by walking along the unique path from the root to the corresponding leaf and picking up all the edge labels along the way. For example, to walk from the root to the symbol 6, we have to go along an edge with label 0, then an edge with label 1 and then another edge with label 1, so the codeword is 0, 1, 1. And you can do the same for each symbol to arrive at an entire codebook for this alphabet. It should be a simple exercise to convince yourself that any codebook that is constructed like this is prefix free. That's not even anything special about the Huffman algorithm. It's just a property of a tri that no leaf node can be a prefix of any other leaf node because all possible prefixes are associated with non-leaf nodes. What's not at all obvious is that this Huffman algorithm always constructs an optimal symbol code, that is, a symbol code that minimizes the expected codeword length. We'll prove this in lecture 3, but before we do so, I strongly recommend that you have a look at problem set 1 where you'll explore some more basic properties of Huffman codes. Remember that you can also download solutions to the problem sets in case you get stuck. Then in the next two videos, we'll first take a step back from the specific Huffman coding algorithm and we'll ask how large the optimal expected codeword length actually is. It turns out that there's a simple mathematical expression from which we can calculate the optimal expected codeword length of a symbol code, or, more generally, even the optimal expected bitrate of any lossless compression code for a given data source. The simple mathematical expression that we'll find in the next two videos leaves room for a small amount of uncertainty, but on the plus side, it is differentiable because it doesn't require us to explicitly construct an optimal code using some non-differentiable algorithm like Huffman coding. In fact, having such a differentiable expression for the optimal expected bitrate opens the door to modern compression methods that use machine learning models of the data source. In the subsequent videos and problem sets, we'll construct powerful probabilistic machine learning models that involve deep neural networks, and then we'll train these models by directly minimizing our simple differentiable expression for the expected bitrate of a corresponding compression method. In other words, rest assured, we'll do some exciting machine learning applications soon. That's it for now. Have fun with the problem sets and see you in the next video.