 Welcome back to the course on data compression with deprobabilistic models. In this video, you'll learn about a new lossless compression method that is called asymmetric numeral systems, or ANS for short. This method was presented only in the year 2015. This method will be our first example of a so-called stream code, that is, a lossless compression method that can outperform even an optimal symbol code. This will be a rather long video because it will cover material from two weeks of the course. So get some popcorn ready and enjoy the ride. One reason why I'm so excited about presenting the ANS algorithm to you is because it is a really powerful method. It achieves essentially optimal compression performance while at the same time being the fastest stream code that we currently know. But another reason why I'm so excited about presenting ANS to you is because I think this algorithm illustrates very well how important all the information theoretical concepts are that we've covered so far in the course. When we look at the final algorithm at the end of this video, I bet you'll be surprised by how simple this algorithm actually is. The core logic of the ANS algorithm can be implemented in just a handful lines of code. And yet it took the compression community until the year 2015 to come up with this algorithm. So while ANS is somewhat trivial to implement, it was not at all trivial to come up with it. And you'll see that a lot of information theoretical concepts are necessary in order to come up with this algorithm. So in this video, rather than just presenting the final algorithm to you, I'll actually walk you through some of the steps that may have played a role in coming up with it. And I hope this will help you appreciate how important all the information theoretical concepts are that we've covered so far in this course. So let's dive in. We're currently at a very exciting inflection point in this course. So far, we've introduced a lot of theoretical concepts. And now we've acquired enough background to finally be able to bear the fruit of our labor and to use these theoretical concepts to build up bigger things on top of them. So before I go deeper into stream codes, let me briefly recap where we are. And then I'll give you a brief preview of how we are going to use these things that we've learned so far in order to build up bigger things on top of them in this and in the upcoming videos. So if you've been following along with the course, you may have noticed that we've been going back and forth between information theory and source coding theory in the one hand. And on the other hand, probabilistic models and machine learning. And the reason for that is that these two things really go hand in hand when you think about compression. So we started the course by talking about the general setup of communication over a noisy channel. And we saw that this setup can be split into two parts that is source coding and channel coding. And then we looked deeper into source coding and we looked at our first class of very simple source coding algorithms that are called symbol codes. And we looked in particular at two examples of symbol codes that are Shannon coding and Huffman coding. We saw that Huffman coding always leads to an optimal symbol code and that Shannon coding may have some overhead, but we were able to bound this overhead. And we also saw that Shannon coding is a lot simpler than Huffman coding and therefore it allowed us to derive theoretical bounds for symbol codes. And we could then use these theoretical bounds for symbol codes and apply them to settings that were somewhat artificial in these block codes that we discussed, but so that were not useful in practice, but that allowed us to actually derive theoretical bounds also for lossless compression in general. In these theoretical bounds, we saw that in order to do compression or source coding, you always need a model of your data source. That's why it's called source coding. Otherwise, you cannot make any statements about theoretical bounds of compression performance and you cannot derive any good compression method. So that motivated us to look deeper into probability theory because we saw that these models of the data source have to be probabilistic models. And we discussed in particular the concepts of entropy and of random variables. And then we saw that good probabilistic models need to capture correlations between symbols in your data source, in your message. And we saw that capturing correlations can be difficult computationally. So an important challenge in compression is to come up with a model that can capture relevant correlations very efficiently. Then we started to discuss several methods to capture for capturing relevant correlations. And the first method that we came up with were autoregressive models. And in problem set two, in problem 3.2, you actually implemented your first deep learning-based compression method that was built on, that was based on an autoregressive model. In this example, you saw that autoregressive models are indeed capable of capturing correlations between symbols, but they have some problems. First of all, they cannot capture very well very long-range correlations. And second, they cannot be parallelized. And that's a problem on modern hardware because that means that they will run very slow on modern hardware. And that motivated us to come up with an alternative way of capturing correlations between symbols. And that was through latent variable models. So by marginalizing over some latent variables, we saw that we can also generate correlations or mediate correlations between symbols. But then we saw that compression with these, using these latent variable models for compression is kind of challenging. And we had to come up with a new compression method that can deal with these latent variable models in both efficient and effective way. And that was bit spec coding, which we discussed on the last video, and which you implemented in the last problem set. We saw that in order to do bit spec coding, you have to do what's called Bayesian inference. So that is basically inverting a probabilistic model. So that's where we are now. And now we've introduced really enough concepts that we can build more interesting and a higher level concept things on top of that. So in this and in the upcoming videos, we will build on top of these foundational concepts. We will build both source coding algorithms on them as well as machine learning methods. So on the machine learning side, we will, the first problem that we will tackle is the problem that Bayesian inference in can, while it can always be done in principle. In practice, if you have big models such as deep models that are parameterized by deep neural networks. In these cases, exact Bayesian inference is typically computational and feasible. And therefore you have to come up with some approximate methods that are still feasible within a realistic computational budget. And that's exactly what we will do in one of the upcoming videos is we will introduce the concept of approximate Bayesian inference. And we will see that one of these methods for approximate Bayesian inference, so called variational inference, will follow directly. We can kind of motivate it directly by minimizing the net bitrate of bit spec coding. So this method will be very well motivated from a compression perspective. And then we will use this method to perform inference and really deep learning based models and that will lead us to deep latent variable models. So kind of latent variable models generalized to deep neural networks. And an important class of this will be variational ordering coders or VAEs for short. That's on the machine learning side. On the source coding side, we will start today by discussing so called stream codes. And the first stream code that we will discuss is asymmetric numeral systems or ANS for short. So stream codes are in somewhat an alternative to symbol codes, but they can actually reach the fundamental lower bound of loss less compression more closely than symbol codes can. So they have a lower overhead than symbol codes. And the first symbol code that we will look at in this video will be asymmetric numeral systems. And we will see that we can think of asymmetric numeral systems as an application of bit spec coding. At the same time, you may remember that when you implemented bit spec coding on the last problem set, you actually already used asymmetric numeral systems through a library. So not only is ANS an application of bit spec coding is also what you would typically use for bit spec coding. So it's kind of a cyclic dependency here. And then in the next video, we will discuss an alternative kind of a stream code. And that is called arithmetic coding or a variant of it called range coding. And here we will see that you can think of arithmetic coding and range coding is basically a way of doing Shannon coding. So we'll come back to here to this idea of Shannon coding, except that we that arithmetic coding finds a way of doing Shannon coding on really a huge alphabet in a without sacrificing efficient computational efficiency. And that's what we will discuss in the next video. So once we've covered these important topics in both source coding theory and in machine learning, we will then be able to combine the two and to really think about how and talk about how these concepts are nowadays applied in machine learning research to really build the next generation of neural networks based compression codecs. So neural compression. And you will work on this in your group projects and you'll also hear from experts in both academia and industry what they think that the future will bring in this field. So with this brief preview, I hope that I've convinced you that there are still exciting things to come, but also that all the background that we've built so far with the theoretical concepts will really be important in order to understand everything that built on top of it. So that let's now jump into stream codes. To understand what stream codes do, we have to remind ourselves about the theoretical bounds that we've derived. Reminder, we've derived actually two pairs of theoretical bounds for lossless compression. The first pair of bounds regards symbol codes. So for an optimal symbol code, we derived that the expected codeword length, so that's what we denoted by capital L in the course. So this is the expected codeword length and you could also view this as the expected bitrate per symbol. So this expected bitrate is never smaller than the entropy of a symbol under the probability distribution of the data source. And for an optimal symbol code, we showed that we can always reach this lower bound within less than one bit of overhead. And here it is important to note that this is per symbol, so we have an overhead of one bit per, at most one bit per symbol. And we derived this upper bound by explicitly constructing a symbol code that was the Shannon code for which we could prove that it has less than one bit per symbol overhead. Then the second pair of bounds that we derived was that we then took these symbol codes and said that at least in theory we can apply them now in a setup where we just consider the entire message as a single symbol. That won't work in practice, we've discussed that, that will have exponential runtime, so it will be prohibitively expensive, but it allows us to derive theoretical bounds. And so then for any lossless compression, we derived that the expected bitrate of the entire message So this is the bitrate of the entire message. This bitrate is never smaller than the entropy of the entire message. And again, we can at least in theory construct a lossless compression algorithm that reaches this lower bound of the entropy within less than one bit overhead. But now the overhead is per message. So these were really important theoretical bounds that will be important throughout the rest of the lecture as well. But I already explained that while we are able to actually reach this lower bound of, or these bounds of optimal symbol codes in practice for any lossless compression, these lower, these bounds, this upper bound is hard to achieve in practice. So it is actually exponentially expensive. So let me state that as a problem. So constructing a practical and actual code that has less than one bit of overhead is exponentially expensive. So satisfying, let me state it like this, satisfying the upper bound. So I should say this is really the bitrate of an optimal is prohibitively expensive. Now, is this really a problem? I mean, can we just deal kind of use symbol codes and just deal with this one bit per symbol overhead? Is that really such a big deal? Well, it turns out for many traditional compression methods, that's not really that big a deal. Because the entropy per symbol in many traditional compression methods is high enough that one bit additional entropy per symbol is not such a big deal. So these traditional symbols, another way to think about this, these traditional methods kind of use symbols. They can use only encode a message, often in only very few symbols. This is a vast generalization, but just to give you an idea, they map kind of messages to very few symbols. And therefore the entropy per symbol is quite high. So one bit per symbol overhead is not that big a deal. But in machine learning-based compression, in ML-based compression, the entropy per symbol, so h of one symbol, is often much smaller than one bit. So for example, in works that I've been involved in, we've developed compression methods for images that were a state of the art at the time where we published them. And in these methods, we had an entropy per symbol, where the symbols were some latent variables and a deep probabilistic models. And this entropy was much smaller, over a wide range of qualities, was much smaller than one bit. So if you use a symbol code here, you would always need at least one bit to encode any symbol, so you'd have a very high overhead. So symbol codes have high overhead. If you're having trouble picturing what this means, an entropy that's much smaller than one bit, just encourage you to think about just a toy example, which would be a bent coin. This is more of a side remark toy example. So imagine you have a coin, you flip a coin, but it's not a normal coin, it's a coin that has been manipulated. And such that the probability of it turning up with heads is some value alpha, where alpha is somewhere within the open interval from 0 to 1. And then the probability of tails is therefore 1 minus alpha. And then you can just write down the entropy of the symbol, which is then therefore just alpha times the negative logarithm to base 2 of alpha and plus 1 minus alpha times the negative logarithm of 1 minus alpha. And this is often just denoted as, since this is such an important quantity, h2 of alpha. The 2 is for how many states you have in your coin, if two states, heads and tails, and then the probability alpha. And if you plot this, you should see this has for the entropy. If alpha is one half, so this is 0, this is 1, then the entropy should be 1, because that's just a regular coin that has one bit of entropy. It's either heads or tails. And then it should be symmetric because this equation is symmetric and obviously it makes no difference whether we rename what we consider heads and what we consider tails. And it goes to 0 on both ends of the spectrum because if you have the coin is so bent that it always yields the same result, then obviously there's no information in any outcome. You can predict the outcome without even throwing the coin. And then you can see, for example, if alpha equals 0.1. So if you already know with 90% certainty that it will come up tails, then you'll see that h2 of alpha will be something like 0.47 bits. So therefore, if you would use a symbol code, which always needs at least one bit for any codeword, you'd have more than a factor of 2 overhead. So this can be quite significant because the true entropy is less than half a bit, but you need at least one bit to encode it with any symbol code so you have more than a factor of 2 overhead. And this is exactly the regime where a lot of machine learning based compression methods operate. So we want to reduce this overhead. We want to go away from this one bit per symbol overhead that we had in optimal symbol codes. And we want to instead derive some compression method, some lossless compression method that comes closer to this overhead of one bit per message, at least closer to it than the symbol codes. But at the same time, we don't want to sacrifice computational efficiency. So let me state that as a formal goal. So our goal will be twofold. We want to alleviate the overhead of symbol codes. But at the same time, we want to do this while maintaining computational efficiency. And let's actually be bold here. We want to maintain linear computational cost. So K here is the length of the message. So we want to retain a setup where we have the cost for compressing. A message of K symbols is only gross linearly in K, which is the best thing that you can do because you have to look at every symbol at least once. And this is exactly what stream codes do. And stream codes follow, so there are various stream codes, but they all follow the same strategy. And that is to amortize. So we were amortized. So you may have heard of the word amortized in your algorithms and data structure lecture where you maybe amortize computational cost over operations. Here we're going to amortize something different. We're going to amortize the bit rate, so compressed bits over symbols. That means that we will no longer be able to assign code words to any symbol, which means that we will no longer be when we look at the compressed bit string. We will no longer be able to say for each bit in the compressed bit string which symbol it corresponds to, because the symbol, the bits in the compressed bit string will kind of be, each bit will kind of take care of several symbols. So the information content will be kind of spread out over the bits. So we will see that in an example very soon. But let me just state it here. There are no code words anymore. So we will see how this works in real stream codes in a minute. But I want to give you some intuition first. So let's start with some intuition. So let's say you have a message that consists of three symbols, X1, X2, and X3. And then you can, for example, plot the information content of these symbols. So each symbol has, we have a model of the data source, so each symbol has an information content. And let's actually introduce some scale for this information content. So let's say the distance between two of these gray lines is one bit of information content. So here we have kind of three and now four bits of information content, room for four bits of information content. Let's say the first symbol in our message has maybe an information content of 0.4 bits. Then let's say the second symbol has maybe an information content of 1.1 bits. And then the third symbol has maybe an information content of 0.3 bits. So you get a total information content if you just sum them all up of 1.8 bits. But what a symbol code would now do, and I'm going to use as an example here Shannon coding. If you used a Huffman code it might lead to slightly better bit rates depending on how exactly your probability distribution is. It distributes the information content on the other symbols that we're not encoding, but the rough idea will be the same. So what a symbol code will do is it will take this information content, but it will assign for each symbol in our message, it will assign for each symbol an integer number of bits. So a code word that has a length of an integer number of bits, in particular in Shannon coding, you always just take the information content and you round it up to the nearest integer. So you'll have this first bit where this first bit here now is completely reserved for the first symbol. Then you take the second symbol, you round up the information content to an integer number of bits and you encode that in a code word. So you have a code word of length 2 bits. And then for the third symbol you again take it and round it up to an integer number of bits. And then you have a code word of length 1 bit. So in total in this method you get 4 bits, a bit rate of 4 bits. The compressed message has length 4 bits. So it is more than twice as long as it needs to be by the lower bound. What stream codes will do instead, and here the details will differ from stream code to stream code, but roughly the idea is that they take these information contents and they pack them as closely as possible. So they take the first symbol, then they take the second symbol and they pack it in some sense neatly into this setup. And then third symbol will kind of come here now. And then at the very end you still need to basically seal the stream so you still have to round up to the nearest integer, but in total you now have a message length of only 2 bits. So how precisely this works kind of depends on this specific stream code. And this picture is kind of a simplifying picture. In the stream code that we're going to discuss today, these information contents will actually be kind of scrambled up. So we may end up having to cut the information content for maybe the second bit into two halves and maybe the first half will end up at the beginning and the second half will end up more towards the end of the total stream. It really depends on how you set some parameters. But generally the idea is that you pack all the information contents of all the symbols as closely as possible, which now means that if you look for example at the first bit here, you can no longer say that this bit corresponds to a specific symbol. There exists no symbol, no single symbol for which this bit is responsible for exactly this bit. And that's what I mean when I say amortization. All right, this was just a quick preview, so I don't expect you to understand how exactly this process works at this point. But I think it's very instructive to keep this in mind when you think about the difference between symbol codes and stream codes. Graphically speaking, symbol codes assign an integer number of bits to each symbol where stream codes try to pack the information contents as closely as possible, but at the same time they do it in such a way that you can still construct these bit streams in an efficient way. So you can look at each symbols at a time and modify your state of your stream coder. For each symbol, you don't have to look at all the symbols at once and then optimize some complicated bit. So in this in the following video, we will be discussing two different kinds of stream codes. So what are some important stream codes? In this, sorry, in this video, we will discuss a stream code that has been proposed very recently that is called asymmetric numeral systems or ANS for short. So one reason I want to discuss this method first is because I think you will find at the moment at least less material about this stream code, but I think it's because it's a very relatively recent method, but I think it's important to know because it's much more efficient than the methods that have been proposed before. And so this has been proposed by, I should give you the reference by Duda and collaborators in 2015. And also it will, while it is actually the simplest method to implement the simple stream code to implement this asymmetric numeral system, and we, and it directly builds up on the bits back coding algorithm that we discussed in the last video. Then the other group of important stream codes are arithmetic coding, and there's a variant of it that is called range coding. And these were developed at the same time, apparently independently by Wilson in 1976 and in the same year also by Paseo. And arithmetic coding and range coding really are almost the same algorithm, basically just the same algorithm with a different base. There was, I believe there were some legal issues about patents, which may be the reason why you may sometimes hear them being referred to as arithmetic coding or as range coding just for legal reasons. So one important difference between these two algorithms is that asymmetric numeral systems operates as a stack. So as a last in first out data structure. And that means that it is useful for bit spec coding. So if you want to do bit spec coding, I suggest you look at asymmetric numeral systems. And that means that you probably want to use this algorithm if you have, if you're probabilistic model of your data source is a latent variable model. Whereas arithmetic coding and stream coding and range coding, they operate as a queue. So a first in first out data structure. And therefore they are useful if you have an autoregressive model. As we had in the problem set where you implemented a compression algorithm for natural language. Because here in an autoregressive model, you want the encoder and the decoder. It is easier to implement an autoregressive compression method. If the encoder and decoder can process the data in the same direction. All right, this was an overview of stream codes. Let's now dive in and let's actually look at asymmetric numeral systems. So what is this asymmetric numeral systems algorithm? A and S for short. And here I'll use a slightly unusual strategy to present this algorithm. I'll actually mostly let you derive this algorithm on your own and I'll guide you along the way. And I think by letting you derive this algorithm on your own, you will both kind of practice the theoretical concepts that we've learned so far in this course. But I believe that it will also help you understand how this algorithm works. And it will turn out that once you've derived the full algorithm, it can actually be implemented in just a few lines of code, at least the core logic. But if you would just look at the final code, I believe without the derivation, it would be hard to understand why it actually works. So before you derive this algorithm kind of on your own, how should you think about this? Well, it's called asymmetric numeral systems and that has a reason. So A and S actually just generalizes positional numeral systems. Now what are positional numeral systems? Well, that's just a fancy way of saying something like, for example, the decimal system, as opposed to maybe Roman numerals, which are much more complicated. Or another example would be the binary system. Both of these are positional numeral systems. And A and S generalizes these positional numeral systems and we will see in a second how it does that. So I would like to encourage you now to have a look at the following exercises that I've prepared for you. So here the first exercise will be about the decimal system. And the setup here is that I want you to think about a data source that generates a sequence of symbols as we've, you know, finite length sequences of symbols as we've been used to so far. And the symbols are, sequences of symbols are very simple. The probabilistic model of the data source is very simple. It generates each symbol with a independently. So there are, the symbols are statistically independent and each symbol comes from this finite alphabet from, which just contains the digits from zero to nine inclusively. And the probability distribution for each symbol is uniformly. So that means just that any, any digit from zero to nine will appear with equal probability and independent of all the other symbols that you've already seen or may see in the future. Then the first two questions that I want you to answer here are really questions that you should be able to answer just by applying things that you've learned in the lecture. So apply what you've learned in this course. And these should be really simple questions. So if you have trouble answering these questions, then I really encourage you to go back to, for question A, the, or kind of recap the video that recaps probability theory. And for question B, the video that recaps halfman coding or that proves optimality of halfman coding. So the first question is here, what is the entropy for each of these symbols in our message. And the second question asks, now, if you have an optimal symbol code, so a code that signs a code word for each symbol, what is the expected code word length. Of this optimal symbol code. And there are no check questions here. These are really just apply what you've learned in the course so far. Then for the third question, question C, I want you to think a bit out of the box. And I actually encourage you to think to ignore what you've learned so far in the lecture in the course. So forget everything that you've learned in this course so far. So here, you really need to think a bit out of the box. And the question is, can you do better than the optimal symbol code. So in part B, you derived the expected code word length for an optimal symbol code. And now in part C is the question, can you do better than an optimal symbol code. And again, so forget anything you've learned so far in the course. In particular, if you start thinking about things like halfman coding or Shannon coding or block codes or things like that, you're already thinking of it in much too difficult terms. The answer that I'm looking for here is really very simple. And so I actually encourage you to first just describe your approach which should outperform symbol codes just in words. You should be able to describe it in just one or two sentences. But then also I encourage you to implement your approach in Python or in some pseudo code. And the reason I would like to ask you to do that is because implementing it will actually show you an important property of this method. And it should be a very simple method. So something of four lines of code which have no special function calls, just typical arithmetic. And if you get stuck, here are some hints. You can ignore these hints if you already know the answer. But if you get stuck, then maybe these hints will help you what you should think about. And then finally in part D, now that you've implemented and kind of come up with this better method in part C, in part D, the question is what is actually the expected bit rate of this method? So it should be better than the expected codeword length of a symbol code. But what exactly is the expected bit rate of this method? And for simplicity, just ignore any effects if you have to add the very end round up to full integer numbers or something like that. So I'm only interested really in the expected code and the expected bit rate for long messages. So where any final ceiling where you have to maybe round up to full integers that does not play a role. And then I want you to compare these results to parts A and B. So that means obviously this expected bit rate per symbol cannot be lower than the entropy per symbol because it is still a lossless compression method so it cannot outperform the entropy. But it should be better than the expected codeword length of an optimal symbol code and you should explicitly calculate that. So this is question exercise one. Then I added another exercise which is actually extremely simple. Once you've come up with a solution to exercise one. And that is just to exactly the same thing. Just now think about ternary systems. So now the alphabet of symbols is no longer the digits from zero to nine, but it's now the digits from zero, zero, one, two. And then the question is just what has to change? What do you have to change to come up with the same things? So pause the video at this point and then after trying to answer all these questions and then start it again and compare it to my results. All right, so here's what I arrive at as the results. So what is the entropy per symbol in this simple example? Well, you just really have to calculate it. So the entropy of this probability distribution, which is the uniform probability of the symbol. Again, since all symbols are distributed with the same uniform probability distribution, it doesn't matter which symbol I am looking at. So it's just the expectation over the information content where the information content is negative log two of the probability of that symbol. Now I have to take the expectation over all symbols in this alphabet from zero to nine, but the probability is always the same because they are uniformly distributed. So it is really just the expectation over the number minus log two of one tenth. Now you're taking the expectation of a constant. So it really just is the constant. So you can ignore this expectation value and then using the properties of the logarithm, negative logarithm of one over something is just logarithm of the thing. So it's log two of ten, which if you plug it into a calculator, you will get something like 3.32 bits. Part B, what is the expected codeword length of an optimal symbol code? Well, in order to calculate that you first have to come up with an optimal symbol code, and we know how to do this, we just use half my coding. So we have constructed a tree for ten different symbols. These should be ten, hopefully. So the tree, there are various ways, but they will all lead to the same expected codeword length. So you can, for example, construct the tree in this way. If you use the half my coding algorithm, the first step, you'll always get five independent trees, a forest of five independent trees. Then you have to combine them and then maybe you'll combine them in the following way. This would be one valid tree. Then this is the root at the bottom here. So the codeword length will then be the distance of each leaf node from the root. So that will be three, three. So these are the codeword lengths, three, three. Here they are further away from the root. Four, four, four, four. And here you get three again. So if you average over all of them, so the average expectation over L of Xi, just average over all these numbers, you will get 3.4 bits. And this makes sense because it's not lower than the entropy, but it's also within one bit of overhead, so exactly what we would expect. So that's, so far, this should have been just application of what you've learned in the course. Now, how can you do better than an optimal symbol code? So with an optimal symbol code, you have 3.4 bits in expectation per symbol. How can you reduce the number of bits that you need per symbol? Well, I already gave you a lot of hints. So when I introduced this isometric numeral systems, I said that they generalize positional numeral systems such as the decimal system. And I told you I gave kind of the title of the problem was decimal system. And then also the symbols here, I could have given them letters like ABC and up to maybe J. But instead I chose to give them the names that the labels for these symbols are actually the digits from zero to nine. So all of these hints, you should have hoped that you were able to get that really what you're getting here. The sequences of symbols, they are just sequences of digits. Why not just write them out, concatenate them and read that thing that you get off as a number and then convert that number into binary. So just interpret sequence of symbols, maybe have like a sequence of 3, 2, 7, 5, 6, and then just convert them into binary. And that results in some bit stream. Now there's only one slight complication and that is you don't know what if you don't know the length of the sequence, and maybe then there could be two clashing instances. So if, for example, another sequence has a leading zero, but then the same symbols. So this would lead to the same bit stream and this naive approach. So one very simple way to overcome this is just that you always prepend it with a one, which is just artificially introduced here. And then at the very end, you know, so you encode actually this bit string or this bit string, so they are now different numbers. And then when you decode it, you decode it until you find this until you end up with just a one and then you ignore the one. So that should, it should be very easy to see why this always works, why you can encode and decode and you then decoding will end up with the same sequences of numbers. It should also be clear why this is an optimal algorithm because it really maps to, it maps, it's a bijective mapping between the sequences of symbols to the bit strings. So since all the sequences of symbols have same probability, there's nothing better you can do. But when we discuss problem D, we will actually calculate the expected code grid length and we will actually see that it is indeed optimal in the sense that the expected code grid length is actually the entropy of 3.3 of log 2 of 10 bits. But you shouldn't actually have to calculate this. You should see this and there's actually a problem with the current problem set that discusses this in a more general setup. If you have a bijective mapping from your data source to all the bit strings and the distribution is uniform, then there's nothing better that you can do. So let's now actually implement this. So what I will do now through most of the rest of this video, I will keep this Jupyter notebook open on this right hand side and I will actually implement the things that we are talking about in Python. And this may seem very trivial at this point because we're just implementing how to parse a decimal number that shouldn't be too hard. But we will actually see that we can just then add kind of on top of this very simple implementation, a couple of extra tricks and with every trick that we additionally implement on this, we will come closer to the actual full asymmetric numeral systems algorithm until at the end we really arrive at an asymmetric numeral systems entropy coders on very close to optimal methods for compressing data. So let's first implement this very simple part of our decimal system so we for decimal numbers. So we're getting we want to implement a method function that is called encode because it takes a message and maps it to a bit stream and this so it takes a message. And then what it does it it will return some at the very end it will return some compressed message some compressed representation and this compressed representation will just be a number so in computers numbers are represented as in binary so can interpret that as a returned number as a bit stream but we will also think about more efficient ways if you have very large messages how we can store represent these numbers in a more efficient way in a second. So, here we're just going to return a number so we have to initialize this. And as we've seen, I get already kind of here on the left hand side is that we always have to prevent the message with a, a, a one bit, a one digit so that we can distinguish though that we can detect leading zeros. So we have to therefore initialize or compressed representation with a, a one and then we will kind of add onto it kind of pen to at least digits as they come in. So let's do this. So we initialize that with one and then we iterate over the message so for each symbol in the message. We're just going to do the following we're going to take the compressed representation and we're going to reset it so first of all we're going to multiply it by 10. So that we make room for the next digit and then we'll add in the next digit. And that should be it. Then how do we decode a compressed representation? Well, we just invert these steps. So we start from the compressed representation and we kind of shave off symbols until the compressed representation is just one. So while compress is not equal to one, we're going to invert these steps. So first of all, we have to subtract the symbol. How do we get the symbol? Well, this part is a multiple of 10. So the symbol is just a module when we divide by 10. So we'll get compressed module 10. And then in order, since we are going to return several symbols, we actually have to yield them. So yield is a statement, an instruction in Python that is basically like a return, but you can yield several times from the same function. And then, so that takes care of extracting the symbol and then we have to invert this multiplication. So we have to say compressed equals compressed integer division with rounding down 10. And this is notated as a double slash in Python. So a shorter notation for this is also just double slash equal 10. And we're not going to return anything. We're just going to yield elements. So let's try this out. Let's encode some messages. By the way, there's still some bug here that we'll see in a second. But let's encode, for example, the message 387, just a sequence of symbols. And we'll see we'll get out the number 387, 387, prepended with a 1 exactly as we expected. And then if we decode that 1387, we'll get a generator because we yield elements. So we have to collect that into a list in order to see what we get out. And we'll see we'll get the same symbols, but we'll get them in reverse order. And that's one reason why I wanted you to write this out is to realize if you write such a simple decoder for this decimal system, the easiest way to implement it is actually to emit the symbols in reverse order. So if we don't want to get them back in reverse order, one simple way to deal with that is actually to encode them just in reverse. So we'll iterate here in reverse order. Then if we do that, then the message that we get out from encoding is 1783. So the other way around, 1783. And if we decode that, we'll get the same digits out. So we could also write here decode of encode of 387. And that's exactly what we expected. Now, just as a hint, so if you want to actually look at the bit string, you can format kind of the output in binary Python with this format strings with a B flag for binary. You'll get out some bit string. And it's actually interesting to look at this now. So let's print this out now for various messages. So let's print it here for this message. You can print it this way. And let's now do this for, let's now change some of these symbols. And that will allow us to see what exactly I meant with amortization. So let's, for example, change, let's keep this center one the same, but let's change it in the first example where we change only the last symbol to a six. And in the third example, we're going to again start from this message, but we're going to change the second symbol from an eight to a nine. So what will be the result here? Again, for our original message, we get some bit string. Now if we go up, so if you change the last symbol from a seven to a six, then quite a lot of these bits will change. So all these two bits changed and also this bit changed. So all of these bits apparently have something to do with the last symbol, with the symbol six, that changed from seven to six, because they changed when you changed only the last symbol. But now if you look at the other direction where we can again start from the center here and then go down where we only changed the second symbol, you'll see that, for example, this bit also changed. So in some sense, this bit corresponds to both the last and the second to last symbol. So that is exactly kind of to illustrate what I meant with amortization. So you can no longer say that in each of these bits, you can no longer attribute a single code word to each of these bits as we were able to do in simple codes. We really can amortize the compressed bits string over all the symbols. And we'll see that that will allow us to actually outperform simple codes. Right. Before we do that, so before we calculate actually the bit rate, let's have a brief look at exercise two, because we can actually modify now our example here very easily to now also take care about ternary system. So you'll see if you look at the encode and decode functions, the only thing that has to change if you want to encode now symbols from a ternary system is this 10 has to change to three. So instead of doing that, I'll just introduce a new parameter that I'm going to call base. I'm going to choose this here and here. And then every time I see a 10, I'm just going to replace it with base. And then here, you always have to provide your base, which is in this case 10. And again here, the base is 10 for encoding and for decoding. So it all still works. But now you can also do the same for the base three. Of course, then you have to make sure that all your symbols are actually symbols from base three, maybe something like this. And you should get the same symbols out that you put in. So this also works very, very naturally. So it's very easy to change the base. All right, with this implementation, let's now go back to the left-hand side here. And let's just briefly discuss part D. So as I mentioned, it should be obvious to you that each of this step, what we really do is, in each encoding step, is we multiply the compressed representation with the base. So if you think of it in log space, when you multiply a number with something, the length of that number will grow by the logarithm, roughly speaking, of the base. So in each step, we increase the length by the logarithm of 10, which was exactly, here on the left-hand side, the logarithm of 10 was exactly the entropy of the message. So if that is already convincing you, then you can skip over the formal derivation. But for completeness, let me derive it kind of also more formally. So part D, the solution to part D here, is so you can calculate, the easiest way to think about this is you can calculate an upper and a lower bound on the compressed bitrate. So upper and lower bound on the bitrate. For any bitrate, we're not even thinking about averages, you're just calculating it for any message. So let's think first about what is the smallest possible number that we will arrive at. Well, the smallest possible number is if all of these symbols are 0s. All of these symbols that follow the initial obligatory 1 are all 0s. What is that number? Well, we have k 0s following a 1, so that's just 10 to the k. So let's go back here. So 10 to the k, that is the smallest possible number. What is the length of this number if encoded in binary? Well, the length is the logarithm phase 2. And then again, to be precise, you'd have to round this up to integers. But again, we will be interested in the limit for large k. So this will not play a role. So I'm just going to capture this rounding up with some epsilon where epsilon is smaller than 1. So the bitrate of this method for any message x is at least this thing. That is our upper bound. And we can write this as using the properties of the logarithm as k times log 2 of 10 plus epsilon. So if we take the limit of large k and we look at the bitrate per symbol, so r of x divided by k, then that is just log 2 of 10 plus epsilon divided by k. And since epsilon is bounded, it's smaller than 1, this part goes to 0. So for large k, this is just the bitrate per symbol is just log 2 of 10, which, if you remember, was just the entropy of a symbol, the entropy per symbol. This is the lower bound. So I should say here, larger equal. We're probably more interested in the upper bound. So what is the largest number that we could end up here on when we encode these numbers? Well, the largest number is if you have 1 followed by 9999. And we can write this more formally as 2 times 10 to the k minus 1, because that 1 followed by 9999 is 1 smaller than 2 followed by 0000. So again, we're interested in the bitrate for that. So we have to take the log 2 of that plus some small epsilon. And that is now the upper bound for the bitrate. So that is smaller. I'm just going to drop the minus 1 here. So that is therefore smaller than log 2 of 2 times 10 to the power of k, which plus epsilon, which is log 2 of 2, which is 1 plus k log 2 of 10 plus epsilon. And then again, we can take the limit. Bitrate per symbol is now smaller than or equal than log 2 of 10. So the entropy per symbol plus 1 plus epsilon over k. And that again for k going to infinity, since the epsilon is bounded by 1, goes just to log 2 of 10, which again is the entropy. And therefore, so we know that in the limit of large of long messages, which is exactly what we were interested here in this question, we see that both the upper and the lower bound go to the just go to the entropy. So therefore, the bitrate itself per symbol converges to just the entropy of the symbol. So therefore, this method is optimal. So this was a very formal derivation of why a conversion from the decimal system to the binary system is the best thing you can do. If all bit strings in the decimal, all sequences of decimal symbols have the same probability, but really it should also, the simpler argument is just what else could you possibly do better than just using a binary mapping if all your input data has the same probability. And again, this will be again discussed in the current problem set, which is linked in the video description in the first problem in a more formal setup. So let us not use the things that we've learned from this simple, very simple exercise and draw some conclusions. So what have we observed? Well, we've observed a couple of things. First of all, this conversion between different positional numeral systems. We saw a couple of things about that. First of all, it operates as a stack. So a last in first out data structure. The second thing that we observed is that it amortizes compressed bits over symbols. So that's what we saw here on the right hand side when we looked at which bits change if we change certain only a single symbol. And we saw that for different changes, sometimes the same bits change. So it amortizes, compressed over symbols, which tells us that it goes beyond what symbol codes can do. It obviously doesn't prove that it's better than symbol codes, but it tells us that it could be better than symbol codes. And in fact, we saw that it indeed allows us to optimally compress a sequence of symbols. It optimally compresses a sequence of symbols, but only under certain conditions. So it is a very limited method. So if these symbols satisfy very restrictive properties, three properties, you can identify three properties that they have to satisfy. First, they all have to be from the same alphabet. So all of them were either ages from zero to nine or all of them were ages from zero to two or two examples that we studied. So we haven't compressed symbols from mixed alphabets in the same message. The second limitation that we saw were that they have to be uniformly distributed or uniformly distributed over this alphabet. So that means that every symbol from the alphabet has the same probability of appearing in a message at any given point. So you can kind of see as a corollary already, the second one is therefore they are also statistically independent. So we were not able to capture any correlations. So our data source is a very simple model. It doesn't capture any correlations between symbols. So in other words, the symbols that we've compressed are IID. And on top of that, the distribution which they are distributed is even very simple. It's always a uniform distribution. There are obviously very strong constraints. So in the rest of this video, we will gradually remove these constraints. So let's remove these constraints. And let's first actually look at the first constraint here. The first constraint that constraint one was that the symbols are all from the same alphabet. Well, is that really necessary? I mean, all the examples that we've looked at so far used symbols from the same alphabet. Here we encoded all symbols were from the decimal symbols. Here all symbols were ternary symbols. But is that really necessary? Well, let's have a look at it. Let's try to just encode symbols with mixed alphabets. So let's rewrite. So these functions cannot do this because they expect a single base. So let's rewrite actually this encoder now as a class that encapsulates some state where the state is just the compressed message that has been encoded so far. So let's write a class coder which has some constructor. The constructor takes no argument and it just sets here compressed message to one and itself not compressed because it's encapsulated by the coder. And then we will write a function encode which now instead of taking a message takes just a single symbol. So we no longer iterate over the symbols and then it just sets self.compress to self.compress times base, which is also parameter that it takes plus the symbol as we did before. And it doesn't return anything because it actually mutates itself. And then finally we write a decoder. So it will also take a self argument so that they can mutate an access self.compressed. So what does the decode function do? It doesn't need to take a compressed message because that's already included in self. But it takes a base and then it just decodes a single symbol so it sets the symbol to a temporary variable and reduces self.compressed by the base and it returns the symbol in corrective. So does this still work on the example data that we've looked at so far? Well let's try it out. We've for example looked at this message in the decimal system. So let's just try this out, this message. So let's construct a coder as a new coder. And let's encode these symbols in reverse order. So let's say a coder.encode of 7 with base 10. And let's do this three times for the symbols 8, 3. And then we can just for one of it print the compressed message. So coder.compressed as a binary string just to see what we get that in principle we would be able to get the compressed message. But then we can decode it so we can then say coder.decode with where we only have to provide the base 10. And we do that three times and we print it each time. Sorry. So what do we get? We get the result. This is the compressed message and then we get 387 when we decode it. Exactly the message that we expected here. So this works. So our coder class worked. That's nothing new now. But now we can try something new. Now we can take the same example and we can encode some message where we change the compressed bits in the base mid-stream. So maybe we'll decode again the message 7, 8, 3. But then we encode some symbol 1 with the base 3 and maybe a symbol 0 with the base 3. And then maybe we'll encode another symbol 5 with base 10. So let's just try what happens. So now we have to decode these symbols and again we have to decode in reverse order. So we have to decode base 10, then two times with base 3 and then another three times with base 10. So we expect to get out these symbols that we encoded in reverse order. So we expect to get out 501387. Let's see if that's true. 501387 worked precisely. So this was kind of just an example. But now if you look at the code, I hope that you will realize that really it should be obvious why this works. Because really the decode method just inverts the encode message. So when you would call encode and then decode, afterwards the coder will be in the same state. And no matter what the initial state before that was. So this should always work regardless of which basis you use before that. So going back to our notes on the left-hand side here. We can just observe that this is not really a constraint mixing symbols with different alphabets. Just works. So this may now that we see it seems somewhat trivial, but I find it quite remarkable. We use the decimal system and therefore positional systems all the time. And we know kind of from high school that, you know, there are also other positional systems like for example the binary system and you can do it with every kind of system. But at least for me, I have never even thought about this idea that well, not only can you encode can use different positional systems with different bases, you can actually change the base for every single symbol in your sequence as long as you know which symbol corresponds to which base. So that also works. So let's now look back at the constraints that we've identified for this very simple compression method. And we've seen now that the first constraint can actually, we can actually alleviate that very easily. In fact, it's not even a constraint. Let's now look at the second constraint. That is that these symbols all have to be uniformly distributed. So just to be precise, what do I mean with that? Well, obviously we could use this method to compress data that's not uniformly distributed. If the maybe the symbol three or the symbol five comes with a higher probability than other all other symbols, there's no reason why we shouldn't be able to just convert that into binary. And then when in the decoded side kind of just converted back into the decimal system. But what I mean with the statement is with this constraint is if the problem if the probabilities are not all equal if the symbols are not uniformly distributed, distributed, then the method will no longer be optimal. So the reason for that is simply because as we've derived up here, the bit rate of this method will always be log two of 10. And that means that this, this is only the entropy of a uniform distribution. If you have a different distribution, then the entropy will be lower and then therefore, but the bit rate will still be locked two of 10 per symbol. So you will have a higher bit rate than the actual entropy that you have. So you'll be wasting something. So it will still work, but it will no longer be optimal. But how can we make a method that's still optimal or very close to optimal at least if the symbols are not uniformly distributed? So let's look at this now. Let's look at constraint two. So we are now considered going to consider non uniform P of Xi. We're still assuming that they're all independent for simplicity, but we're now no longer assuming that the probability for every symbol in the alphabet is the same. And here we will see that we can solve this if we apply our bit spec coding trick, which we discussed in the last video and on the last problem set, use bit spec trick. And in order to do this, we first have to introduce an approximation of the problem of this probability distribution of this P. So we will approximate P by some probabilities distribution P a and s. I'm going to call P a and s. And this P a and s has the following property that it stores all probabilities with fixed point arithmetic. So I'll say in a minute what that means, but let me just first write it down. So which represents all probabilities in fixed point arithmetic or fixed point precision. So what does that mean? Well, you choose a parameter n, I'm going to call n, which is an integer. Typically a large integer, like something like two to the 32 or something. And also typically a power of two, and we will see why in a second, but it doesn't technically have to be a power of two. So this n is a parameter of your entire compression cycle, so it will stay fixed throughout the entire cycle of encoding and decoding. And then what I mean with fixed point arithmetic is that all the probabilities that this probability distribution P sub a and s captures. So for all the symbols in your alphabet, these can all be written in the original probability distribution. These were just some real numbers between zero and one. But now in our approximation, they are all rational numbers and moreover they are rational numbers which are all have the same denominator. So the nominator is always this fixed parameter n and then the numerator is some integer, which depends obviously on the symbol. So this is an integer. And now if you choose n just large enough, then obviously so this will be an approximation of the original probability distribution. So if you choose n large enough, then you can approximate all probabilities with high accuracy. So what is the advantage of this approximation? Well, with this approximation, I'll just first state it as words and then I'll draw a picture of it to make it more clear. With this approximation, we are now partitioning the interval. This partitions the range from zero to n minus one into this joint pairwise disjoint subranges for each symbol in the alphabet. So what do I mean with that? Well, let's draw this range from zero to one. So we have a range that starts somewhere here, sorry, not from zero to one, from zero to n minus one. So on this range, there are n points that include both n points. You should think of that as kind of a dense grid, not dense in the mathematical sense, but dense in the sense that there are a lot of points in this interval, so that distances are in some sense small. That's just how you should picture it. And now what this approximation of the probability distribution does is these are still probabilities, so these probabilities still add up to one, which means since they all have the same denominator, that means that all these m's add up to, if you add them up, they add up to n. So you can take, for example, let's say you have an alphabet that contains three symbols, and I'm going to give them Greek letters so that we don't confuse them with any other definitions. So these are just three letters, three symbols in our alphabet. Then we will be able to, then we have three, this approximation assigns three integers, so an integer to each one of these. And these integers add up to n, so we can just say, let's take the first m of alpha points on this line here and kind of assign them to some region. Then let's take the next m of beta points and define a subrange for them, and then finally let's take the next m of gamma points on this range. And if we do this then in total we will cover the entire range without leaving anything out and also without any overlaps. So that's because these probabilities add up to one, so that will always be possible. And then just for to make things kind of, to make the notation explicit, let me define these ranges, these subranges. Let me introduce a notation for them. Let me notate them as, here the subranges are from some value, some lower bound a of x of the symbol to, and then they take exactly m of xi symbol. So they end at a of xi plus m of xi minus one. So that's what you see again in this picture. Now you see that here a of xi, in which in this case a of alpha, because this is the symbol alpha, and then here the upper end is a of alpha plus m of alpha minus one. And then for the next symbol the lower bound is, you could write it as a of alpha plus m of alpha, but kind of more generally we just call this lower bound a of beta, and then the upper bound a of beta plus m of beta minus one, and so on. And you can do the same for the third symbol as well. You can define subranges of this interval, which all have a size exactly m of alpha points. So you can always do that, but why is this useful? Well, now that we have this subrange we can now, we want to apply the bit spec trick. Now with this probability distribution we can do this because we can now interpret this probability distribution as the marginal distribution of some latent variable model. So I'm going to make a claim. So we will interpret p, a, n, s of xi as the marginal distribution of the latent variable model. And to specify a latent variable model I'm going to specify its joint distribution. So the variable model that I'm going to write down, I'm also going to call this p, a, n, s of xi and some latent variable z, which will have the structure p of some prior times some likelihood. And to be precise, I'm going to use the following prior and likelihood. So p of xi equals some value is always, so actually let me first say what xi even is. So where the values for xi are in the interval in the range from zero to n minus one, and then the prior is just uniform. So with uniform prior i.e. p of xi taking some value in this range is just one over n for all xi in this range. And with the following likelihood, so I'm still making a claim here, I've proven that you can write p, n, s as the marginal or interpret it as the marginal distribution of this. But we'll do that in a second. So what is the likelihood? Let me first specify what exactly all the parts of this definition are. So the likelihood is p of the symbol taking some value, given that the latent variable has some value. And it will be a very simple likelihood. It will only take values one and zero. And to understand this likelihood it's best if you look again at this picture up here. So the way to think about this is that the likelihood is some number, for example, xi could sit here, some number on this interval. So that's xi. And then the likelihood just says if xi is a deterministic function that just assigns then the symbol in that corresponds to the range, the sub range in which xi lies. So xi is here, then the symbol would be alpha. If, on the other hand, xi was somewhere here, then the symbol would be beta. So let me write that in mathematical form. So the likelihood, I'm back now down here, the likelihood is one if xi is in the sub range from, and that's why I gave these sub ranges names A of the symbol to A of the symbol plus M of the symbol minus one. And it's zero. This is a one. And it's zero otherwise. So now, again, to make it explicit, the claim that I'm making is that p, a, n, s of xi taking some value is exactly the marginalization. So c equals zero to n going from zero to n minus one of p, n, s, xi being xi. And so exactly what we do, we need in order for this to be a kind of consistent definition. And so just as a reminder, since it's out of screen now, so let me scroll up for a minute. So p, a, n, s that we defined was this part. So it was m over n. So let us notate that now. So here on this side, we're really saying this is just n of xi over, sorry, m of xi over n. And then the right hand side is here is what we defined up here. And it should be easy to see because if you now look up here, you see that the p and the prior is always just one over n. And the likelihood is, so this is always one over n. And this is one for exactly an n of xi cases. So if you then sum it up down here, you get always a factor of one over n. And then in this sum, you get a one for exactly m of xi cases. And you get a zero otherwise. So you will get the term that's on the left. So we've seen that you can indeed interpret this approximation of our probability distribution, which uses fixed point arithmetic. You can use this, you can interpret that as a latent variable model. And that means that now we can apply the spec coding. That's now apply. So what does that mean? Now, if you remember from the last video, in order to do bit spec coding, you need the posterior distribution. So what is this posterior distribution? So that is kind of the first step posterior distribution. Well, it is distribution of C taking some value, given that the symbol that you have a given symbol. And that is just the joint. So the prior times the likelihood divided by the marginal distribution of the symbol. So, and this turns out that this will actually be very simple now. So what is this? Well, this upper part is either zero. So this part, this likelihood is either zero or one. It will be zero if the values of C and xi are not compatible with each other. If Z is not in the range that corresponds to xi, and it will be one otherwise. So this part is just a zero and one. So let me actually write this down. This part will be either zero or one. Then this part is always just one over N. And this part is, we know that that's just M of xi over N xi. So what do we get? We get two cases, which are exactly the same cases as we had above here. So I'll copy them. So we'll get these two cases. And then, so we get, again, zero if these, if the C is not in this interval. But what do we get if it is in the interval? We get one over N divided by M over N. So we get in total one over M of xi, just if you multiply out these two parts. So that's precisely the posterior. And what's nice about this is this is actually just a uniform distribution, which means that we know how to encode it optimally. Remember with our simple method that we have here on the right in our notebook with a simple coder. We know how to encode data that is uniformly distributed. Well, both the prior and the posterior now are uniformly distributed. They just have different alphabet sizes. So the posterior is alphabet size M, whereas the prior is alphabet size N. So how does that, so let us actually state that as a statement. I think that's important to remember. Because both prior and posterior are uniform distributions, just with different, different alphabets, alphabet sizes. But that's not a problem because we already saw that here on the right hand side, if I remind you of this example, we saw that we can encode and decode alphabet symbols with different kind of changing alphabet sizes. So that should not be a problem for us. So how does now the bit spec coding method work in total? So let's first write it up kind of in general terms, and then we will see that once you then implement it for this simple method, it will actually be very boiled down to just a handful of lines of code. So what does that bit spec coding do? So on the encoder, we do the following. We do the first decode. This is important. We are in the encoder, but we decode. You remember from the last video, we decode value for Z using the posterior, P of Z given xi equals xi. So the encoder takes a symbol xi. And so its first step, it decodes Z using this probability distribution, which is, as a reminder, is uniform over, now it's uniform over kind of a shifted interval, over the interval from a of xi to a of xi plus m of xi minus one. Let me move this on screen. So we know how to do this using our simple positional systems encoder. Now in a general bit spec coding algorithm, the next step would be to encode X the symbol using the likelihood. So P of xi, this model of Z equals Z. But we can actually skip this part. And the reason for that is, so I'm going to cross it out. The reason for that is that if you look at this likelihood P of xi given xi, this is always four symbols in this uniform. We already know that Z will be in this distribution. This has always probability one symbols for Z in this a of xi to a of xi plus m of xi minus one. So the information content is always zero, which is a kind of very formal argument, but if you allow me to scroll up. So what we have here is sense our likelihood. This one is deterministic. Once we know Z, we actually know X precisely. We don't have to encode any additional information. You can also see this up here in this pictorial representation. Once you know that Z is maybe somewhere here, you know that the symbol for it will be alpha. And once you know that Z is maybe here, you know that the symbol may be beta. You don't have to encode any additional information for that. So in this setup where we have a deterministic likelihood that gives really probabilities one and zero only, we don't even have to encode xi because encoding xi, which we have to do anyway, will be enough to identify the symbol. So that's the last step that we have to do. We now have to encode Z using the prior, which again is a uniform distribution. Only now it is uniform over the range from zero to n minus one. So what is the total bitrate that we get out of this method? The net bitrate per symbol for this one operation? Well, we have to decode first Z using this uniform distribution over an alphabet of size n. So we get how many bits we save. So that's a minus the information content with this, which is the negative minus the information content, which is log two of m of xi. And then we have to encode the next step, Z using this uniform distribution with alphabet size n. So then plus log two of n. So in total we get exactly the log two, the negative log two of m of xi over n, which is exactly the information content under our approximated probability distribution. So this is exactly the negative log two of p, an, s of xi, plus xi. Let me move this on screen. And this is the negative log, so the information content under this model. So exactly what we expect for an optimal coding algorithm. Then the decoder just inverts these steps to decode the decoder. We have to invert them in inverse order. So we have to start here. So instead of encodings that we now have to decode it using the prior. And then we have to encode using the posterior. So after decoding we can then identify the symbol because we have a deterministic likelihood. So once we know xi, we know the symbol. And then we can use the posterior to encode xi. Now when you put it together in this kind of general form, it may look a bit confusing. So in reality, when you once you put this into code, you will see that this will actually be a very simple algorithm. So in order to see this, let's actually implement this method. And you'll see that it turns out to be just a few lines of code. So our implementation will initially build upon this code that we already have. As a reminder, this one only can only encode data optimally if the data comes from a uniform distribution. So let's actually rename this into a uniform coder so that our namespaces are not polluted. And then let's define ourselves a class which we'll call an ANS coder. And before we do that, let's kind of sketch out how we want to use this class. So the way we want to use this class is very simple, similar to the uniform coder. So we'll define, first define initialize a coder, ANS coder. And this ANS coder has a parameter, this value n, that tells us how fine, with how fine grained we want to approximate the probability distributions. So let's provide this value n. Let's actually provide kind of, assume that n is always a power of 2. So it's always some power 2 to the power of some precision. And then we only provide the precision here. And let's set, for example, precision for this simple example 2, let's say 4. So then therefore n will be 16. So in practice, you will want to set precision to something higher so that you can approximate more, true probability distribution more precisely. But here for debugging, it will be easier if we can actually understand what's going on. We are still able to do all the calculations in our head. So we start a coder, initialize a coder, and then we want to encode some data. So we would call coder dot encode, we want to encode some symbol. And then we have to provide, since now, so as a reminder in the uniform coder, we would now have to provide the base. And that was enough in the uniform coder because then once we know the base, we know that the probability distribution over the base is uniform. So providing the base was enough to specify the probabilistic model that this encoder should use. But in this setup, we now no longer assume a uniform probability distribution. So we now have to actually provide these probabilities. And for simplicity, let's not discuss about how we actually approximate these probabilities. That's kind of a non-trivial task in itself. Let's say that we already have these approximate probabilities. So we just provide a list of kind of scale probabilities. So what are these scale probabilities? Let's say, for example, they are, so they should add up, so these are these n's, the enumerators. There are scale probabilities because their probability is multiplied by n, roughly speaking. They should therefore add up to n, so to 60. So let's just say there are maybe three, seven, two, and four. That should add up to 16. So four symbols with these scale probabilities. So probabilities three over 16, seven over 16, and so on. So if we provide these probabilities, what do you mean that the true probabilities are exactly three over the n every time? So we encode these some symbols. So maybe some symbol one, followed by some symbol two, followed by some symbol three. And we could also use different probabilities for every step here. That wouldn't change much here, so that shouldn't make a difference. And then in the end we decode. And for the decoder, we assume that we only have to provide the scale probabilities, and then we want to print them out. So this is roughly how we want to use this A&S coder. So let's type it up. So it needs an initializer or a constructor, and the constructor takes this precision. And then it sets self.n equals two to the power of precision. And we will optimize that in a later step. And then it constructs also a uniform coder, which we will use for all these bit specs coding steps. The uniform coder will be set to just a uniform coder. The constructor has no parameters. Then we define an encode method, which takes self, and then these scale probabilities. So what does the encode function do? Well, first let's just go through the steps. First it will decode C using the posterior distribution. So what is this posterior distribution? Well, it is a uniform distribution over some range. Where the range starts at some non-zero value, typically. And it has size m of x, where m of x is exactly the scale probability. Actually the encode function also needs to take a symbol, obviously. So what do we do? We first, let's first use our uniform coder. Let's work up to this step here. Let's first use our uniform coder to decode a value that's uniformly distributed, in this range, but instead over kind of a shifted variant of this range, where that starts at zero and then goes to m of x minus one. Because that's what we can do already. So that's called this uniform coder, dot decode, and then the base will just be the scale probabilities of the symbol. So we interpret the symbol as an index into these probabilities. So the symbols, we use it here, will maybe be, so they should be between, we have four symbols, and they range from zero to three, inclusively. So let's say they're maybe one to one zero three. So we first decode them. We have to set that to itself. So to shift it, see. Then once you have the shifted, see, we have to shift it now by the left-hand side. So let me scroll up. What is this left-hand side? So this left-hand side is kind of, if m is maybe, let's say, the third symbol, then this starting point will be just the number of points that were already consumed by everything that came before it. If the symbol is the second symbol, then again the starting point here will just equal to the number of points for z that were already consumed by earlier symbols. So we can shift z now by saying, maybe that just set this to z and then let's just mutate it and say for m for earlier in scaled probabilities from zero to symbol. So that includes the symbol itself. Let's increase z by, so here we're shifting. We're starting from something, starting from, let's say, this is the z that we will end up. We're starting just from kind of this value but count it where this is zero, one, two, three, four. And then we add these and we add these, then we'll end up at exactly this position. This is not the nicest way to do it, obviously, but I think it's the easiest way to understand it. Okay, so that was the first step of the encoding function. Now the second step will be to encode z with the prior and that's easier to do actually, so let's not just say self.uniformCoder.encode and now we have to provide the base and the base here for the prior itself because the prior always has probability one over n. So that should be the encoding function or method. Now let's define the decoder. What does the decoder take in our example? It takes just the scale probabilities, which makes sense because we don't know the symbol yet, so we just want to take these probabilities. Now we have to invert these steps from the encoder. So first we have to decode with, sorry, here we actually have to provide also the value that we want to encode, obviously, that was z and then the base. And then for decoding we only have to provide n and that way we will get out z. So that's the easy part. Now we have to, in the decoder, encode with this posterior distribution where again, as a reminder, the posterior distribution is uniform over this shifted interval. So kind of to invert these steps we now have to subtract probabilities until we arrive, we find the interval. So how do we do this? One kind of silly way that works is you say for each probability in scale probabilities. So again, scrolling up, we are given now a value for z, maybe this one, and we now have to find the symbol, maybe this one, we now have to find the symbol that's within for which the associated range contains this target value. If the target value was here then we'd have to find this symbol. So the one way to do this is we just start from z. We have this value z, we've just decoded it using the prior distribution. Now we can just continuously subtract these things until we find, until subtracting it would lead to a negative number. So if the probability is actually larger than z, then subtracting it would lead to a negative number that means that we would kind of overshoot, then we just break. And otherwise we say c, we reduce c by probability. And now also what we have to do is we have to enumerate these because we want to also be able to find now the index. So that would be the symbol. We said symbol because now we found the symbol and we break. And actually symbol would have to be, yeah, that's right. So again, if you didn't fully understand this part, this is not kind of the crucial part, it just searches kind of given a target value here, just searches for the range that includes this target value. And again, this way of implementing the search is definitely not the best way to do it. There are definitely better ways to do this, but here we're just trying to understand what's going on. So we have now, where are we in the bit spec algorithm? So we've decoded z using the prior, and we have identified the symbol. Now we have to encode the symbol using the posterior. So let's do this. So self.uniformcoder.encode, the symbol, and the base will again be, the base will now be the range of, we actually don't have to encode the symbol, we have to encode the value z. And we have to encode the value z shifted by these probabilities, which is what we've arrived at with here. So we have to encode z with the base, where the base is now scaled probabilities of symbol. So that's exactly the m that we have here. And then obviously we have to return the symbol. I'm sure there are still some bugs in this, but let's try out how far we come. Actually, it worked. That's a surprise to me, but I was sure that I made some type of somewhere, but you see we encoded the symbols one, zero, and three, and then we decoded them. And since it's a stack, we get back the symbols three, zero, and one. And again, now we can kind of play around with this and we can say let's use a different probability distribution for every symbol. So let's say scale probabilities one is this, scale probabilities two, scale probabilities three. Maybe now we have set this to five, and this also to five, and maybe here we set this to two, like a completely different one, what's missing 13, so we need seven and six. And then we encode with these probabilities, and then we have to decode in reverse order, see if this still works, it still works. If we used the wrong probability here, it will likely not work. Yes, it produced a different output. So it really makes a difference here. So if now almost implemented everything that you kind of need to know for an ANS coder, there are just two additional points, one very trivial one that I will do right now and then the other one will be something that you will do on the next problem set which will come out next week and which will then be linked in the video description. So the first thing that I want to do is I want to kind of do a small refactoring here. So we're building on this uniform coder. This uniform code is really kind of trivial. So let's just factor them out and then we can see that we can do slight optimization. So what does this, what methods do we need from this uniform coder? Well, we call the initializer, the constructor, and then we call encode and decode. So the constructor just sets basically compress equal one, if you remember from here, then the encode, let's just copy them. So the encode function, so this is now from the uniform coder. So the encode function just multiplies by base and has the symbol. So that's what we need to do here, for example. So in order to call encode, let's just set self.compress equals to self.compress times self.n and then plus the symbol. The same down here when we encode, so here instead of calling encode, we set self.compress equals self.compress times this escape probabilities and plus z. That's the encoder and then the decoder. So whenever we call decode, we have to do the following. We have to, for example, here, set c equals self.compress modulo this value and then we have to set self.compress to divide that by self.n. That's what's being done here because n in this case is our scale probability. We can make this bigger. And then we also use the decode function here. So here we set c equals self.compress modulo self.n and then self.compress divide by self.n. So far nothing should have changed. So maybe wondering why are we doing this? Well, first of all, let's test that everything still works. Yes, it does. You can see why I set n always to a power of two because this is something that I think a lot of people don't know is that, so first of all, typical compression setups, we are worried mostly about the, when we think about computational efficiency, we're mostly interested in the decoding efficiency because in most setups, you will decode data more often than you will encode it. The only exception to this rule that I can think of are backups where you expect never to need them, but in most other cases, you will encode data once and then decode them several times. So we want to make this fast. And what many people I think don't know is that integer division is actually a surprisingly slow operation. So you want to avoid it anytime you can. So it's actually much slower than integer multiplication, for example. So what we can do here is now that instead of dividing by n, we can just, since n is a power of two, we can just do a bit shift. This is self.precision and we'll just set self.precision to the actual value. And here, this is also a, can be seen as a bit operation where we set maybe the mask, self.mask plus one to precision, shifted by precision, minus one. And this operation is just a bit-wise-end operation. So this is self. The other point where we used self.end was here. So again, there's multiplication. You can also write this as a bit shift. But the multiplication is not that big a deal. Multiplications are faster, but divisions are actually quite slow. So let's see if this still works. It still works too. And again, if we change, for example, this, we'll get a different result. It's not a trivial test. So this all still works. And we now have, oh, here we actually still have, right? So in the encoder, we still have an integer division because these values are not, obviously not limited to powers of two. But in the decoder, which is, it's usually the more constrained part, you can now get away without any integer division. So let's go back to our notes. And I have copied and pasted the implementation that we just wrote up in these notes. And if you download the lecture notes, which are linked from the video description, then this will be part of it. This implementation will also be included on the new problem set and problem set seven, which will be linked in the video description. And the reason why it will be included in problem set seven is because on that problem set, you will improve upon this method by implementing the one thing that's actually still missing from here. And the one thing that's missing is that if you look at this implementation more closely, you will figure out that it is actually the goals with which we started this endeavor of developing stream codes. So reminder, what were our goals? We had two goals. The first one was that we wanted to develop a lossless compression method that has a lower overhead than symbol codes. So we wanted to reduce overhead, compression overhead, so the overhead and bit grade compared to symbol codes. And indeed we achieved this. So if you allow me to scroll up really brief. So we derived, for example, here, the net bit grade for encoding a symbol, which is really the relevant quantity because in practice we want to encode a lot of symbols. So the initial bits problem will not be an issue, and you really only care about the net bit grade. And we derived that this net bit grade is in fact exactly the information content of our model. The only overhead here comes from the fact that this model that we used is not exactly the original model. It's a fixed point arithmetic approximation of the original model. But if we choose the denominator n large enough, then we can make this approximation extremely close to the true model. So this will be a very low overhead. So we did achieve that. That's really nice. So let's mark that. But we also had a second goal. And the other goal was that we wanted to achieve this reduced overhead without sacrificing efficiency. And in particular we wanted to maintain linear runtime costs. So O of K, where K is the number of symbols. So the number of symbols that you want to encode. And it may seem like we've already achieved this, but if you look at this more closely, you'll find that if you call this encode method several times so that you acquire more and more compressed data, that in total this method will actually not be linear in the number of symbols, but it will grow faster. And the runtime costs will grow faster. So I encourage you to pause the video at this point and to think about why it grows faster and what the asymptotic runtime costs actually is. All right, if you've paused the video, then now you can compare to my argument why I would argue that this is not linear runtime. So it may seem linear because we can just call the encode method to encode a long stream of symbols. We can just call the encode method on each symbol individually and the same on the decoder side. And so we don't have to consider kind of the symbols together and kind of do some complicated optimization over all the symbols together. But it turns out the cost of the encode function actually grows. The more times you call the encode function, the more expensive it will become at some point. And the reason for that is that every time you call the encode function, you increase the amount of data. So you increase the number, the integer that's self.compress, which stores all the compressed data. So if you think, for example, about compressing an image which may have the size of something like a megabyte in compressed form, then at the end of the last couple of times that you call the encode method, this self.compress will be an extremely large integer. It will kind of have a million bits, something on that order. And it turns out that operations like if you write out, for example, this integer division, if you remember how you do long division by hand, you'll see that this operation, this integer division here or also on the decoder side, this integer multiplication in the addition, they scale actually linearly with the number of bits, number of compressed bits that are already available because you have to, in order to do this division or multiplication, you have to loop over all these bits. In fact, in most programming languages, this would actually overflow at some point. The only reason why this works in Python is because Python seamlessly switches to what's usually called an begin implementation instead of overflowing. So this works in Python, but as you call the encode method several times, it will become increasingly expensive. So since kind of the amount of compressed data grows linearly with the number of times, with the number of symbol study encode, both of these operations are, they in itself are o of k, and therefore the total runtime cost, if you call them k times will therefore be o of k squared. So this is not yet achieved. We have o of k squared. And that's not what we want. So that is for practical purposes, this will be prohibitively expensive. So that's what you will solve on the problem set in problem set seven. Let me briefly, for completeness, explain how this can be solved and the idea is actually very simple. So the idea how to solve this, so how to improve the runtime cost of A and S, and in fact, you will be able to improve it really to reduce it to really o of k. So linear cost again, which is what we wanted. And a simple way to think about it is in terms of a metaphor, and the metaphor here is that of money. So think about how you would typically deal with money. So when you have, for example, bank accounts, many banks actually encourage you to set up two accounts. One is called the savings account and the other one is called the spending account. Or if you live in Germany, you may have noticed that a lot of people are still obsessed with cash. So that would then be the analogy that you use cash money. And you use these accounts for different purposes. So your spending account or also your cash, you usually use that to kind of deal with kind of small or small-ish, depending on what is small for you, amounts, and they are usually kind of odd payments. So I don't mean odd as the opposite of even. I mean odd as the opposite of round numbers. So let's say you pay something like, you pay something like buy something in a store and maybe you'll pay something that's 70 euros and 34 cents, something like that. Or in the other direction, you may use this for kind of odd incomes or maybe if a friend, so if you go out for dinner with a friend and they pay you back what they owe you, and it will usually maybe not quite as odd in value, but some typically small-ish odd value. That's what you use your spending account for. Now your savings account is usually something you don't access directly. Instead you just use it to transfer money back and forth between that and your spending account. And these transfers will typically be kind of bulk transfers. And what I mean with that is that they usually occur in kind of bigger and also round amounts. So you wouldn't, I suppose, take these 70 euros and 34 cents and then transfer them to your savings accounts that's kind of silly. Maybe you wait until you have maybe 100 euros and then transfer that as a nice round amount. And turns out you can now do pretty much the same thing for ANS coding and that will actually make it performant because you can make these bulk transfers if you make them an integer, if you make them of round amounts, then you can make the cost of them independent of what's already in your savings account. So what is the analogy for ANS? So let me actually put it on the next page. Similar for, and this is sometimes called streaming ANS. Although this naming convention is not very widely used. So in ANS you can do something very similar. So you will also have a kind of bulk. So now instead of splitting up your money, you will split up the compressed bits that you've already put on your coder so that you've already encoded. You split it up into one part that has only kind of round amounts of compressed bits. The most important part of this is that it has actually only an integer number of bits. So remember the important reason why stream codes work better than symbol codes is because they can actually amortize over the compressed bits over symbols. So for each symbol the number of bits can be a fractional number. It doesn't have to be an integer. So in total when you add one symbol at a time you will think of the amount of compressed data on your ANS coder as some fractional amount of information content. So not an integer amount of bits, but a fractional amount of information content. But that's what makes it expensive to transfer these information contents. So what you do is you take kind of an integer part of that and you store that in some kind of bulk storage. So this will be an integer number of bits. Typically actually for just make it actually fast on real hardware it will not only be an integer number of bits, but really an integer number of your machine words. So this would be like the word size. So maybe 32 bits or 64 bits. And so you give these kind of filled to all the words that are on this bulk storage carry valid kind of compressed information. Then on top of that you have what I would call the head because you have a stack. So the stack has some head which is on top of the stack. And this will be if you think of that as some kind of small storage in practice it will actually be just a register on your CPU. And this head this will be since in total you have some odd you will always have almost always have some odd amount of information content that's encoded in your coder. So this head will keep this fractional part. So maybe it's filled kind of to this part where one word is filled completely and then the second word maybe is filled to some fractional part. And then when you encode or decode you encode kind of on this end. So encode or decode and this moves so every time you encode you kind of move this bulk part this is a small part bit to the left and when you decode you kind of move it to the right that's the picture you should have. And that's why encoding and decoding is kind of cost-grows with the amount of data that you have on here because you kind of have to move all these bits over and since you move them typically by a fractional number of bits you actually have to recalculate how they what each of these bits how each of all of these bits are set and that's exactly what this multiplication and integer division do. So that's what you do when you encode so then sometimes after if you do this a couple of times so after encoding several symbols you will be in a new state where maybe the head is filled it's kind of almost filled so maybe it's there's two words let's say and kind of capacity there's a finite capacity and at some point if you encode enough symbols this capacity will be almost full and now imagine you want to encode a symbol with this information content such that if I plot this information content if I draw it here it would overflow your your head what you then do is instead of before it even overflows is you take this you take this lower part which is completely filled and you move it onto the bulk and this is again completely filled so you uphold the invariant that the bulk only ever always only holds kind of an integer number of words of fully filled words and through this operation after that the head will now have a lower filling and then you can now encode onto this position so again maybe this information content it will still fit so that's what you do on encoding on the decoding side you will have to invert this process so this is for encoding if you do decoding then you have to realize that after you've decoded so after encoding actually you will then end up with this situation where now it's filled up to here if you don't decode maybe this is hard to distinguish let's make it this color then you take off this information content and you should then realize oh now I can refill something so you should then be able to take this part and move it back on to the bulk so that you arrive at this situation again so that's kind of the life cycle of this streaming version of asymmetric numeral systems and this version is really has now a linear cost because the cost of a single encoding operation consists of two parts one part is the cost of encoding onto the head but that is bounded because the head can never grow above its capacity so that is bounded and then the other part is every once in a while you will have to transfer some integer number of words onto your bulk in fact you will find out that it will never be more than one word so you can have to transfer that onto your bulk storage but since that's an integer number of words that is kind of completely leaves these words that are already on here unchanged so it is this transfer is independent of how much you already have here and therefore you achieve O of one cost per in encoding and decoding operation and therefore O just linear cost per message so this would be per symbol whereas in the original implementation we've had O of K per symbol so that's what you will do on problem set 7 implement on set 7 and in particular in this problem set you will then figure out exactly what are these useful settings for this capacity and when exactly do I have to transfer from the head to the bulk and in the other direction so what exactly are these lowest and largest filling level and when actually I do it before I encode or after encode all these details you will figure out on the problem set and it will again turn out to be just a handful of lines of codes but you have to be very careful of when you do exactly what that's what you will do on problem set 7 on problem set 6 just to briefly mention you will have another look at why this positional systems coder on which all of this has kind of built why this is optimal of the positional numeral systems coder so as a reminder we started all this discussion with the exercises where I asked you to hear to think about how to encode numbers from the decimal system and in the answer was just convert them into binary and then we proved kind of very formally that the bitrate of this really in the limit of at least for the ignore boundary effect really is just the entropy of the symbol but we proved this in a very formal way and really the argument is much simpler and you will derive this on the problem set on problem set 6 so with that have fun with the problem sets