 Welcome back to the course on Data Compression with Deep Probabilistic Models. In this video, you'll learn about the so-called BitSpec coding method. This method became popular in the neural compression community only very recently, and it is a prime example of the intricate interplay between, on the one hand, declarative aspects around probabilistic models, and on the other hand, imperative aspects of source coding algorithms. So let's dive in. In the last video, I gave an overview of the topics from probability theory that we will need for this course, and I introduced random variables. And then we use these concepts to quantify what we call the modeling error. That is, how many bits do you lose in expectation if your model of your data source doesn't completely capture all the properties of the data source completely correctly? And the quantity that we defined for this modeling error then is called in the literature, the KL divergence. From these observations, we saw that one important thing that we have to be able to model for our data sources is, our models of our data sources need to be able to capture correlations between parts, between different parts of your message. So for example, between pixels in an image or between words in a document. And here I would like you to kind of advertise the previous problem set, which is problem set four. And in problem 4.2, you looked at different kinds of entropies. And in particular, you looked at what we call the marginal entropy. So HP, H sub P of some random variable. And you can do this for you can, if you ignore correlations, then the bitrate that you get, the lower bound for the bitrate that you get would be the sum of all the marginal entropies for your message. And then you compared this to the, what we called the joint entropy, which is defined, which is denoted as the entropy of x and y, the tuple x and y. And you will show on actually the problem set that will come out in about one and a half weeks in problem 6.2 that the left hand side is never smaller than the right hand side. And typically, it is really larger. So if you ignore correlations, then you pay a price in bitrate. So from that, you can see that it is really important that your model needs to be able to capture correlations. But we also saw in the last video that that can be difficult because capturing arbitrary correlations can become extremely expensive. So we looked at different ways to make modeling of correlations less expensive. And the first model that we arrived at was in the first model class was so called autoregressive models. That's depicted here. In autoregressive model, you generate symbols. So for example, words in a document or characters in a document from a probabilistic model, so a stochastic process that is conditioned on some hidden state. And then this hidden state, it evolves as you generate more and more symbols. And but importantly, in an autoregressive model, the hidden state evolves in a deterministic way. And that allows us to use these models relatively easily for compression because the decoder, as long as it generates the same sequences of symbols, it can exactly follow the same sequence of hidden states as the encoder did. So with these models, we can model relatively long range correlations. And we looked at how these hidden states can mediate correlations between not only neighboring symbols, but even over longer distances. And another nice property of these autoregressive models is that they are relatively, despite being able to capturing long range correlations, they can be stored in a compact way. On the flip side, well, I said that they can model long, kind of longish correlations, but depending on your precise architecture and general autoregressive models will still struggle with very long range correlations because they have to be squeezed through this bottleneck of these only one dimensional dependencies along this chain of hidden states. And another problem that came up with autoregressive models is that in order to encode or decode data with an autoregressive model, you have to really encode and decode each symbol at a time because each symbol depends on everything that has happened before. Therefore, these methods compression methods that are based on autoregressive models are not very well parallelizable, which makes them very poorly suited for modern hardware because modern hardware is very good at doing the same kind of operation on a lot of data in parallel, but it's not so good at doing an operation over and over again, which depends on the outcome of the previous operation. And this is, both of these are problems that we are, for which we are going to discuss solutions in this video. So in this video, I will introduce latent variable models and we will briefly talk about the concept of Bayesian inference because this concept will then be important in order to understand the bit spec coding mechanism, which is a compression mechanism for latent variable models. So let's start with latent variable models. And to start with, I want you to think about the following newspaper headlines. So consider these hypothetical news headlines. So parliament votes on new labor bill, labor union votes to extend strikes, soccer players scores first goal since joining new team and guest team is leading by one goal. When you think about these newspaper headlines in the context of correlations, one thing that you will immediately notice is that the words in these newspaper headlines do seem to be correlated. What do I mean with that? Well, if you look, for example, at the word votes, it appears in two of these newspaper headlines, but it appears always in conjunction with the word labor, not necessarily in the same order, but these two words tend to appear together, so they tend to cluster. And then similarly, the word goal tends to appear together with the word team. Team, whenever team, the word team occurs, it appears together with the word goal and in the other way around. So we can state as an observation here that words within a headline are correlated words within at least in this small sample of headlines. A headline appeared to be correlated. So let me make this very precise because this was actually a topic of confusion in the Zoom lecture. So what do I mean with that very precisely? Consider two positions, i and j, where i and j are just marked kind of two positions in these headlines. So this would be first position, second position, third position. This would again be first position, second position, third position, and so on. And then the claim that I'm making here is that xi and xj, with which I mean the words at these positions, these positions, they are not statistically independent. So in order to show that something is not statistically independent, you only need to kind of show one example where the product of their probabilities is not the same as the joint probability. You can find this easily. So for example, the probability that the word one word is goal and the other word is team. And you can easily convince yourself kind of from this example that if you think about these kind of newspaper headlines to continue in a larger set, that this joint probability is you would expect this to be larger, strictly larger than the product of the marginal probabilities. So that probability that you find word goal at the first position, at position i and the word team at the other position. You can already kind of see this here. So there are two sentences where the word team appears and two sentences where the word goal appears. So even if you ignore positions, you would kind of assume that out of these four sentences, kind of half of them have the word goal. So this, if you ignored positions, this would kind of be one half. This would also be one half. So you'd have one quarter on the right hand side. But on the left hand side, there are actually half of the sentences have both words together. So this would actually, if you ignore positions, would actually be one half. And if you take positions into account, again, you will see that this left hand side is strictly larger than the right hand side. So these have to be, cannot be statistically independent, which means they are correlated. Now, how would we explain these correlations if we kind of talked about these newspaper headlines loosely? Well, it's kind of hard to explain these kind of in an order aggressive way that we say, well, when we see the words votes, then that changes the probability of labor. That's certainly one way to think about it. But I would argue that it's kind of a difficult way to think about it. In particular, you will see here then the order changes. So it's apparently not even that important that they appear in this order. I think I would argue that an easier way to explain how these correlations come about is that you think about how did these, how were these newspaper headlines created? And they were created because there was a journalist who was working on a certain part of the newspaper, and that certain part was maybe the politics section. So here, both of these come from a topic which is on politics. Whereas both of these headlines are in the topic, we would call the topic here, sports. And then once you know that you are in the sports topic, then it is reasonable to assume that words like team and goal will appear more often, more frequently in headlines and in worries in the politics topic. They will not appear that frequently. So we can kind of draw a picture of how we think about this generative process. And this is actually, we'll actually see a lot of these pictures. So how do you draw this? So the model of what I call the generative process, and this is a technical term, generative process. So not just something I made up for this particular video, but you will hear this a lot. The notation kind of, you can either notate it like this, you have some variable, some random variable that is the topic. And then based on this conditioned on this topic, the generative process, kind of the how you model the generation of these newspaper headlines is that conditioned on this topic, you draw words from a vocabulary that depends on the topic. And now, obviously, in reality, these words would also have, there would also be a spatial dependency, so depending on where you are in this, what the position of the word is, you would have a different probability, but let's give it simple for now. And then in many, so these are the words in the headline. And the topic the topic is not really part of the message. And that's what's called a latent variable, variable that is in some sense not observed, it's not in the message, but we can kind of think about a process where these words are generated based on a topic. And then typically, in many papers, you will see that words that are observed or are part of your message, they will be shaded, and I will denote this by kind of drawing in some shading here, where these latent variables will not be shaded. This is a kind of one way to draw it, kind of an abbreviated, abbreviated depiction would be as follows, you have your topic Z, and then you generate based on the topic observed words, and you draw to denote that there are multiple words drawn from this topic, you draw a plaquette around these words, where i is run from one to some k. Let's let this run to k. So whenever you see such a pictorial representation of a of a latent variable model, you should think of this as kind of just being the kind of illustrating an equation. So you could equivalently also just write down the joint probability distributions, so this kind of, this both of these pictures pictures pictures, illustrate or denote, denote a joint probability distribution that factorizes as follows. The joint is p of the message, and the latent topic is some probability of the latent topic times the probability of the message conditioned on the latent topic, and now specifically in this case, we also assume that this, for simplicity that the message is probability of the message conditioned on the topic is just a product of probabilities of each word conditioned on the topic. Now this may seem like a very simplistic model, and it somewhat is, but models of this kind are actually very powerful, and they're used a lot in order to sort models, sort collections or large collections of documents into interpretable categories. So these are called, these kinds of models are called topic models, just as a side remark, we're using it here just as an example, but just so you know, these are actually widely used in practice. So for example, one model is called a very popular model is called latent Dirichlet allocation or LDA, which was first used for natural language by Pley and In 2003, and before that it was used in genetics before in genetics by Pritchard and collaborators. But similar models of similar structure are also used to describe, for example, images or videos, so this is really a very generic form of models and all the methods that we're going to discuss now will apply to any kind of data. So why can a model of this, now we're claiming that a model of this structure can mediate correlations between these symbols, because that was our goal to model correlations between symbols, and that may seem kind of surprising at first sight, but the important step here that you have to keep in mind is that really the model that we're then finally interested in for compression, so it's really the marginal probability distribution of the message. So the marginal distribution of the message X, that is P of X, it follows directly from this kind of factorization, it's just, I mean in any case, it is just the sum over all values for Z of P of X, Z equals Z. So in this specific case it would be the sum over all Z, all topics, P of Z given Z, and then the product inside the sum P XI given Z equals Z. And I'm now claiming that a model of this kind, let me actually highlight this marginalization, this is important, I'm claiming that a model of this kind can indeed capture the kinds of correlations that we discussed here, for example, like that words are correlated in this way. So let me formulate this as a formal claim, this kind of model can capture nations like the one we just had, like this one. And I'll leave the proof as an exercise, but it shouldn't be too hard to see this, so proof as an exercise. Essentially, just think of a kind of the simplest model that you can come up with, in this case maybe a model with that has only the vocabulary, goal, team, vote, and labor, only these four words, and then you can make, for example, the vocabulary, and it has only two topics, politics, and sports, and then you can make up some probabilities for the conditional models, and kind of choose them in such a way that you can then explicitly verify that this inequality holds and that it's a strict inequality, so that they are really correlated. So the proof is really just come up with an instance of this model class where you can explicitly show that this is the case. So now we've kind of seen two classes of models that in model correlations, one was order regressive models, and we've seen how we can use them for compression, and it did implement a compression mechanism that used order regressive models. And the new class of models now are these latent variable models. But the point here is that these are not that obvious, it's not that obvious how to use a latent variable model for compression, and the problem here is that this probability distribution of the data, which is really the only thing that we're interested in in order to compress the data, the message x, this is kind of implicitly defined. So it's kind of a very complicated model, if you assume that maybe we have a lot of topics, it's kind of a complicated equation here, how do you actually use this for compression? And that will be kind of the topic of the rest of this video. So the topic here will be data compression with latent variable models. So again, to keep it self contained, we have a model, we assume that we have a model of a message and some latent variable that we're not really interested in transmitting, but that is kind of allows us to formulate the model in a compact way. And this is given as what we call a prior distribution on the latent variable or topic, for example, times what we call the likelihood of, so the probability of the message once we know the topic. So ideally, we would want to use kind of this model, right? We would like to compress x with this model. So what I mean with that is we would like to, for example, build a code book that is optimized for this probability distribution, and then encode our model or message with that code book. But obviously the problem is that problem is that we don't know the value of C. Maybe on the encoder side we do, maybe we know that the message, the newspaper headline, isn't some part of the newspaper. So we know the topic, but we assume that the decoder, so the receiver of our message doesn't know C, so they wouldn't be able to know which kind of these what to condition on here. So we don't know the value of Z. And that brings us, will bring us to a couple of different ways how you could compress latent variable models. And I'll discuss several ways in this video, but I would also like to highlight once again now to the new problem set, point to the new problem set. So that is problem set five, and it's linked in the video description, because in this on this new problem set, you will literally implement three different approaches to compress data with a model of this kind. The model will have a very similar structure to our topic model from the previous page. It will just be a not for natural language just to keep things simple, will be a more toyish model so that you can easily try out different ways and don't get distracted by intricacies of natural language. So in the problem set, you will implement and compare three compression methods for latent variable models, latent variable models. The first one will be kind of very simple. That will be on problem 5.2, 5.1 is just some setup, some technicalities. So here in five problem 5.2, you will just ignore any correlations. So you'll treat the words X i or the symbols X i as independent and therefore ignore correlations that are mediated by this marginalization over the latent variable. And of course, that's something you can always do, but as we saw that will you will pay a price for that in terms of compression performance. The next method that you will implement is in problem 5.3 and we will discuss this below is what we call the map estimate. I will discuss in detail below what this method does, but suffice it to say for here this method leads to a bit rate, an expected bit rate of the message that is given by the negative, the information content of some topic. So you kind of make up some topic, then you have to transmit that topic plus the entropy of the message conditioned on that topic. And I should say here in this independent method, the expected bit rate will be, since you ignore correlations, it will just be the sum of all symbols where you treat the symbols independently. As we saw kind of in the recap from previous video, we saw that that is just the sum of the marginal entropies, which is larger, typically strictly larger unless you really happen to not have any correlations of the true lower bound. So you pay a price here because you're ignoring correlations. And then finally in problem 5.4, you will implement bit spec coding, which is the new algorithm you'll learn today. And it turns out that this method actually leads to an expected bit rate, and I should call this the net bit rate for reasons that will become clear later, is really actually just the entropy of the data. So different to this first problem where you have this overhead, you no longer have this overhead, at least the net bit rate will be exactly the entropy of that data. And you will both be able to understand this from theory, but you will also evaluate it empirically. And here are the results that you should obtain. You should find that you will evaluate it on messages of different lengths, and you will find that for very short messages, for example, this problem, this map estimate, which you implement will have a very large overhead, which will be a very bad method. It will for very long messages will become similar to bit spec coding, but the naive method, which completely ignores correlations will always be worse than, even for long messages will be worse than both. And you will be able to understand where this comes, all of this comes from. So with this kind of interlude, let's now actually get into these methods. And I will kind of refer back to this picture a couple of times because we will start to understand more and more parts of this graph, but you will really only understand it completely once you implement the methods on the problems that yourself. So let's, I will leave out the first method because this is something, this is exactly what we've done so far. We so far we've always considered symbol codes where the probability distribution for simplicity assume that we just have symbols coming in with some probability that doesn't depend on anything that happens with the other symbols. So this is what you're already familiar with. Now let's start discussing this map estimate method. So let's start with a naive approach, which is the map estimate. And we will learn why it's called map in this chapter. So we again have a model just so that we don't, because it's out of the view. So the model, the joint probability is again a prior on Z times what we call a likelihood p of x given Z. And now the idea behind this naive approach is very simple. We will just kind of make up a latent value for this latent variable. And then we will transmit that and then encode the message conditioned on that made up latent variable. So the idea is encode some value Z for this latent variable capital C. So some, just some value. And we will encode that using a lossless compression method that needs some entropy model. And the entropy model will just be the prior distribution. And then we transmit this. And then once we've transmitted it, we can encode, or once we've encoded it, we can then encode the message X using now a model that we can define that is now the model p of, so using a compression code that uses the model that is based on the model p of x conditioned on Z equals C. So once we conditioned this on a specific value, this is just a probability distribution over the sequence of symbols. So we can now encode as we're used to. So in particular, in our case, this is a particular symbol because this model is not just the product of all the symbols. So now it is, now that we're given a latent value for this latent variable, these, the symbols are what is called conditionally independent. So if we do this, then obviously we can decode the message because then the decoder can simply decode, first decode Z using, again, this model, this is part of our model definition. We assume that both the encoder and decoder know this model. And it can then, once it has decoded C, it can use Z to decode X, the message using p of x conditioned on Z equals C. And I actually encourage you, when you think about such kind of composite compression methods, I mean, this is a very simple one. But once you get to more complicated ones, it's often, in my experience, easier to first think about the decoder. And then in the decoder, you can, for example, see here, well, here in this step, it needs this value Z in the second step. But it can do that because it got this, this value of Z from the first step. So you can kind of verify more easily that these things work, because the decoder is kind of on the coding side is often more difficult because the decoder has less information, the encoder has the entire message, so it can encode things in arbitrary order, but the decoder is kind of more complicated because it first kind of more constrained because it, it couldn't swap these two steps, that wouldn't work. All right, so you should know how kind of to implement it. And in fact, in the problem set, you will be guided how to implement this, just you just have to fill in a couple of steps that were left out. What is the bitrate that you will get from a method like this? So what is the bitrate? Well, the bitrate will certainly, I mean, it's a function of the message that you actually you encode. I should use a lowercase layer here. It's a function of the message, certainly. But it will also depend on this value of C that you use. And what it will be? Well, it will in the first step you encode C, so it will be the information content. If you use an optimal compression method, it will be the information content of the latent variable, plus the information content of the message, the conditional information content of the message conditioned on the latent variable that you transmitted. And this is obviously if you use the now the properties of the logarithm, you can multiply these two together. And then using the definition of the conditional probability, this is just the information content of the joint of the joint information content of the message and the latent variable. So again, this depends on the latent value of C that you kind of made up. Now the question is, which value C should you choose? Well, obviously, you should choose the one that minimizes this bitrate. So let's then choose Z such that C star that minimizes, so it's kind of the argmin over all Z, of this bitrate. So for any given message, this can depend on the message. The value of Z that you will choose will depend on the message that you want to encode. And for any given message, you just minimize this bitrate here over the latent variable. So in effect, what you do is you maximize, due to this minus sign, this negative sign, you maximize the joint probability. Just a word of warning, if you did this, I left out the logarithm here because it's a monotonically increasing function. But if you did this in practice, you actually want to maximize the log joint distribution and not just the joint distribution, because the joint distribution will be, for long messages, will be very small because it's kind of exponentially small in the number of symbols. So you'll have encounter numerical underflow here. So in practice, it will be easier to maximize the log joint. And this Z is called the C star. If you maximize this, it's now called the maximum a posteriori estimate or also the map estimate C. And we'll see in a second why that is the case, what this maximum a posteriori means. But just for now, keep in mind this C star, if you maximize the joint probability for some given data, for some given message over the latent variables, it's called the maximum a posteriori or map. So this is the bitrate that you get. Now, is this a good bitrate or a bad bitrate? Well, in order to answer that, you have to calculate kind of the overhead. So what is this overhead over the theoretical bound? Well, it's the bitrate with C star of X minus the theoretical bound is the information content of the marginal distribution. So what is that? That is the bitrate is given here. So it's then the negative logarithm of the joint plus the log, the two minus signs make a plus of the marginal distribution of the data. So you can simplify this to just the negative log of the fraction between the two. And then the fraction between the two is nothing else but the P of X, sorry, P of Z given Z conditioned on X being the message. And this is called, this distribution is called the posterior distribution. And this will appear later again, and we will understand better why it's called the posterior. But for now, at least you understand why this C star is called the maximum a posteriori. Because by maximizing this joint, you also maximize this posterior because the only difference between them is this marginal distribution of the data, which is a constant because we assume that the message is given. So whether you maximize this part or this part, you will get the same result. It's just that this joint is easier to calculate than the posterior. So in practice, you want to maximize the joint. So that's the overhead that you have. And obviously this will be in practical situations. This will typically be positive. So this will, I mean, it is not negative, but it will typically be actually larger than zero because it's a, if it were zero, that would just mean that the probability of C having some specific value is one. So that would mean that there's only one, that this probability is kind of a degenerate probability for any typical probability that you will get out here. There's some non-zero overhead. And you can actually see this in the problem set. If I paste the image again, that you will kind of the final result that you will get from the problem set, the overhead that you will find is kind of for this map method, which is the orange line. You will see that the overhead is very large. So this is a number of bits per symbol. The compression rate, the kind of normalized bit rate is very large for very small messages. The overhead becomes lower and kind of becomes negligible for long messages. But for short messages, this is kind of an important overhead. So this is what you even find empirically here. So if that's not kind of the, if you see in this graph that, you know, this method is kind of over a long kind of regime, wide regime of message length. This is maybe not such a good idea. Let's think about a different idea, how we could do these things. And the, before we get to this other idea, which will be the bit coding algorithm, we have to have a closer look at this posterior distribution that will become play an central role in the bit spec coding algorithm. So let's have a closer look at this posterior distribution and the process of calculating the posterior distribution is called Bayesian inference. So that will be our next chapter. Again, just to keep itself contained, we have a model. The joint distribution is given by x, z is p of z times p of x given z. And again, we assume that we know x, at least on the encoder side. And let's assume that we don't know z, the value of z. And that's why we get this overhead. So if we knew z, then this probability would have, you know, only a single peak with probability one, and then the overhead would be zero. But if we don't know z, then this probability distribution is kind of broader, has kind of values of that are not exactly one. And then the logarithm will be a smaller than one of something that's smaller than one will lead to a positive value. So we'll get a positive overhead. That's why therefore a map estimate method has an overhead. But even if you don't know z precisely, just observing this message x does typically tell us something about the latent variable. So but knowing x typically reveals some, so we may not be able to, we may not know z precisely, but we may know something about some information about z. So you can even see this, I will scroll up again to see our examples, kind of our motivating examples of these newspaper headlines. If you consider these examples, then we were able to kind of deduce that the first two sentences were probably from the topics, the politics topic, even though nobody told us about that, but just kind of by looking at these words, we were kind of able to deduce what topic these came from. And also for the other two sentences, we were able to deduce there probably from a sports topic. So again, let me just paste this here. So again, even in this toy example, we were able to deduce some information about these topics. But now there could be, this is not always this easy. So for example, if you, you may want to look at some the following sentence, which I pasted here, which is, so, however, so here we were able to kind of find out what the topic was, what the latent variable, what the value of the latent variable was, but however, there can still be some ambiguity, even if you know the message, there can still be some ambiguity about, see, even after you know X. And that's kind of illustrated in this example, where we see parliament votes on aid for community sports teams. So we see the word votes here, which is kind of from the politics topic, but we also see the word teams, which we kind of expect to find in the sports topic. So this sentence could kind of be either kind of appear in a newspaper either in the politics or in the sports section. So we are somehow uncertain about this. And to, so that means that in order to make a statement about the latent variable, we shouldn't say, you know, the latent variable is politics, the topic is politics or is sports or is something, we should only make kind of probabilistic statements, or we can only make probabilistic statements. We make probabilistic statements, see, and these probabilistic statements are precisely the posterior probability distribution. So this is P of Z, given that the message, this the distribution over all latent variables, given that the message has a certain value. So the posterior, it's called the posterior distribution, because it looks at the distribution of Z, not kind of in general, that would be the P of Z. This is the prior, this is what looks at the distribution of probability of latent variables in general, kind of before we see any message, therefore it's called the prior. But now we're looking at this measure at this probability distribution. After we've seen a specific message, and by seeing the specific message, we typically know more about this C. So this is by just laws of definition of the conditional probability. This is just, you know, joint, sorry, X equals X, given Z divided by the marginal P of X equals X, where the marginal you would typically have to kind of calculate it explicitly by summing over all, so marginalizing over all topics, I'm going to call them C prime, so that there is no confusion, P of Z equals C prime, P of X equals X, Z equals C prime. So this step is called base in inference. And this is called the posterior. And kind of in this context, it should be easier to understand that this is the, called the prior, because it's what we know about the latent variable before we see any data. And this term is called the likelihood. And base in inference is something that we will kind of be concerned with a lot throughout the rest of the course. So you should in particular understand kind of the following two notes about base in inference. Two notes, two remarks. The first of them is that we haven't really defined anything here, right? This is just following the typical definition of conditional probabilities. So once you know the model, so in principle, the posterior distribution, this is really known as, known once you know both the joint distribution, so P of X and Z, which is often then just called the model and X, so the data on which you condition. So in principle, you can always calculate it because it's just defined by this equation, and you should know all parts of this. In practice, however, it can be prohibitively expensive to calculate this posterior. Calculating the posterior can be, is often prohibitively expensive. And for that reason, we will introduce in lecture seven, just as a remark, we will introduce approximate ways to kind of more efficiently calculate this posterior on large models. So we will introduce approximate. And interestingly, one way to arrive these, derive these approximate engine methods is really again just to minimize the bit rate. So kind of very well motivated from a compression perspective. Approximate basin. So, okay, so this is just a few remarks on basin inference. We'll now use this basin inference to come up with a better compression method that has a lower overhead or that in some sense actually has no overhead. And to understand this, I think it's a good, idea to understand where this overhead really comes from kind of in an intuitive way. And for that, I think it's very instructive to look at this kind of example sentence again, which said parliament votes on aid for community sports team. And we said that for this example sentence, we cannot really say with certainty whether it's a politics or a sports topic. So if we were to compress the sentence with our naive method, then we kind of have a choice. Right. So understanding thing, the overhead of the map estimation method. So at this example, we could encode the sentence star, I'm going to call this star, in two ways, two different ways. One way would be that we say, well, we encoded as something, we first say the politic, we decide that the topic is politics. So with, you know, see, we encode z equals politics. And then we use, then we use model p of x, given z equals politics to encode the actual message. Or other way, the other way would be that we encode z equals sports. And then we use this other model p of x, the conditional model, which is conditioned now on the sports topic. And what's important now to realize is that the very fact that there are these two different compression, these two different ways to compress the same method, they will lead to different compressed bit strings. So two different, they lead to two different compressed bit strings, compressed representations of the same message. Now, even if your encoder is as deterministic, and it may maybe always choose the politics topic, because for this particular headline, it will maybe assign a slightly higher posterior probability to politics than to sports. But even if your encoder is the only ever choose one way to encode it, the very fact that these two different compressed bit strings exist for the same message shows that it's not an optimal compression method, because assigning two different bit strings for the same message is wasteful. And the reason is simply because now that other bit string, the bit string that you get out from method B, which even if you never use it, that could otherwise have been used for some other message. But now that other message, now it's not available anymore, it's kind of wasted. And that other message will probably have to use now a longer bit string will have to be mapped to a longer bit string, because this one is no longer available. So this is kind of how we can understand the overhead of the map estimation method. So the overhead is because we have different kinds of ways to encode the same message. And if you remember, the overhead was actually, remember, the overhead was actually exactly the information content of the message under the posterior C star, I'm sorry, the posterior. I was confused here. So it's the overhead was actually information content of this. So kind of the more broad your posterior is the less certain you are about C, the more options you have here, the more information content you will have. And the larger your overhead will be. So the fact that the more kind of this intuitive picture, if you have more ways to encode the same message, that that the more ways do you have to you have to encode the same message, the more the bigger your overhead will be, this is exactly reflected in this equation that says that, well, it is actually the information content, the amount of information that sits into the choice, you know, whether you use A or B involves a choice. And this choice includes some kind of can be measured in some information. And this information is exactly the overhead that you have. So I formulated that kind of in a negative way. But now you can think of now that you understand this and kind of Oh, the overhead is exactly because I have this choice. Now you can also think of that in a positive way. And that actually leads us now to bit spec coding. So let me now come to bit spec coding. And the idea here is, well, if I have this choice, and this choice is exactly what leads to this overhead, why can't I use this choice to my advantage? And why can't I just piggyback some additional information, not part of the message, but something else that I may also be interested in transmitting into this choice of the latent variable. So I will make this more concrete in a second, but the rough idea is piggyback, piggyback some additional message into the choice of C. And this is kind of a very general method, we will actually see in the next video that just with this idea, you can already, if you take this to an extreme, you can come up with a very simple compression method, lossless compression method in general for that is the outperform Huffman coding. But here we're applying it to these kind of more larger latent variable models, which could also be deep latent, deep probabilistic models or really kind of represent large messages. So this was kind of an important idea that came up already in 1990 by Wallace and by kind of, it was kind of again presented by Hinton and Camp in 1993. So these were kind of the early approaches to bit spec coding, but really practically at this point, it was kind of more like a theoretical idea to kind of piggyback some information into this. It was not really clear how you would do this in practice or kind of in a scalable way. So a practical application of bit spec coding really only came around with the so called BB, ANS, so bit spec asymmetric numeral systems methods by Townsend, James Townsend and collaborators in 2019. So a very recent method. And we will actually learn this ANS method in the next video. You will learn about this ANS method in the next video. And then for a long time, or since then it was believed that bit spec coding was kind of only useful for lossless compression. But he would like to highlight full disclosure. This was a paper where I was also involved in, but I think it's kind of important to highlight this. So here we showed that the student, Ibo Young, that you can also use bit spec coding for lossy compression if you add some additional tricks. So let's make it this a bit more concrete. So what do I mean with piggybacking some additional message into the choice of Z? Well, let's consider some more realistic setup of where you want to be, maybe want to transmit some data. So let's consider a setup where you want to communicate not only a single message, but actually a kind of a sequence of messages, so multiple over the same channel. Now, multiple messages, when I say multiple messages, you could think of, you know, you have maybe a web page that contains several images, so you may want to communicate all these images. But even within that's an image or a video, you could also, for efficiency, you know, computational efficiency reasons, it may make sense to actually treat kind of patches of that image as individual messages. So for example, so then these multiple messages could also just be multiple image patches in a practical setup. Let's see, you want to communicate them over a single channel. Well, how would you do this? Well, usually you would do this as follows. So the usual setup, you would take these messages, message one, message two, and message three. And in the first step, you would encode each one of these, so compress them to some sort of, you know, compressed bit string. I'm just going to make up some bits, and then you would concatenate them. So you'd end up with some long bit string, zero, one, one, zero, zero, one, one, zero. And in practice, you'd have to, you know, think about how you can, about some deliminators here, but we will actually see in the next lecture that that is actually easily taken care of. So don't worry too much about these deliminators, how to preserve where the message boundaries are. That is easily taken care of. So this is the concatenation. And then, so this is on the encoder side. And then on the decoder side, you could then take these concatenations and split them off again and decode. So you get again, message one, message two, three. So this is kind of the usual setup. Now, the idea in bits backcoding is that you do things a bit differently. In bits backcoding, you think of it in a more holistic way. So in bits coding, you again start with your messages, and you encode maybe the first message. I'm sorry. Start with the messages. And you encode the first message, like in the usual setup. But then you remember when you encode the second message, in this encoding step, you had to make this choice what to choose for this latent variable C. So wouldn't it be nice if you could encode some of the parts of this earlier message actually into this choice of the latent variable? So wouldn't it be nice if you could maybe chop off some bits here, these bits, and take them into account when you now choose the value of the latent variable in order to encode message two? Then you get some different bit strain out. And then once you encode the third message, you can again have to make a choice for the latent variable, what was the topic, for example, in our newspaper headlines examples. So again, why don't you just take some bits here. And depending on the values of these bits, you make your choice between, you know, which latent variable you use to encode the message if there's some ambiguity. And then again, you will get some bit strain out. Once you have done this, then you can again concatenate the message that the compressed bit strings, I think I made this green. But now you only have to concatenate these leftover parts. And again, don't think too much about the eliminators here. We'll take care of them in the next lecture, in the next video. So again, remember here, these bits are only kind of this part that's not already encoded in the choice for message three. So why does that work? Why is that enough? Why don't we have to encode kind of include these bits in the next message? Well, because now a decoder could come along. So this is the concatenation. So now if you're the decoder, you can start at the end. And this will be important and important property of this method. It has a decode in reverse order. And it can now for this message, you can just decode it. So you get back message three. But what you will also get back is this choice of the latent variable. Remember the decoder first has to decode the latent variable because only once it has the latent variable, it can then decode the exact message. So it will also get back these bits from here that were used in order to choose the last part of kind of the latent variable for encoding message three. And then you can as a decoder now take this bit string. Actually, you can take the entire message here and just concatenate it here. So now you have the message 0110011100. And now you can in the next step, take these bits and decode them into message. So this is decoding into message two. But again, you'll also again get some bits back and those will be exactly these bits here. So this is exactly this one. This will be too confusing. But remember, this is the 110, which is exactly the one that I'm going to write here now. And then you can again concatenate it with this part, 01100. And you'll have back the original encoded representation of message one. So you can decode that now into message one. So to kind of recap, you start with your sequence of messages. You have to now encode them in order, message one, then message two, then message three. Because when you encode message two, for example, you take, you shave off some bits from the bit string from the part of the that was already compressed. When you encode message three, you again shave off some bits from the already available bit stream. And then you get some shorter bit string in total, because it doesn't include these shaved off parts. So in order to decode this, you can now have to start at the end, because only the last message was kind of fully encoded. But once you decode the last message, you get back these bits. And we will see in a second how that will be done kind of more technically, but kind of intuitively, you can think of it as it's, encoded in the value of the latent variable that you get back. So once you get back this part, you can then decode message two. You get back again, the saved bits with that you can decode message one. So this bits back coding operates as a stack. That's important to remember. As a stack, i.e. a first in last in first out data structure. Okay, so this is kind of the overall idea. How do you actually do this? Well, let's actually just write down an algorithm that does precisely this. So let's actually write out here, right next to it, so that you can compare it, the algorithm for bits back coding. Let's first write out the encoder. But again, when you go through these notes, I actually encourage you to read the decoder first, because it's probably easier to understand why things are done on the decoder side. And then from that follows what you have to do on the encoder side. So you have a subroutine encode, which takes a message X, a part of the kind of already compressed bit string. I'm going to call this just compressed. And it will take a model. It needs that model to find the optimal compression method as usual. And then so we are in the encoder here. But actually the first thing that we do in the encoder is that we shave off some bits of, so this compressed is kind of a bit string. This is a bit string, bit string of the things that we've already encoded. So if we are encoding here, the second message, then the compressed part will be the part that's already been encoded. And if we're encoding the third message, compressed part will kind of be the concatenation of these, like this part. So in the first step, we will shave off some bits from the already compressed message. And we will use that in order to decide which latent variable we will choose. And so how do you shave off some bits? Well, you can just decode them. So we are in the encoder, but we are actually decoding now. That's the first step. We are decoding. So we're setting C to the decoding from compressed. So compressed is a variable name here. Using a coder that is built up on and now I'm just going to state it and we'll see in a second that that's really the optimal model to use that is built upon the posterior distribution. So that is optimized for the posterior distribution. So remember, the encoder knows the message. So it also knows the model. So it can, in principle, at least calculate the posterior. And then it can build a compression method, maybe like a Huffman code or something, from that probability distribution, and can just use that Huffman code or whatever other compression method it uses to take the spit string and just read off a code word that it then decodes into a value of C. Then in the next step, the encode, we actually now do precisely what we are usually what we did in the map approximation as well. So we now encode X using P of the model P of X conditioned on C equals C, where C is precisely this one. And then the encode Z. And you'll see in a second why we do that at the end using the prior distribution. All right. So in so far, I haven't really stated much. I'm just saying that that's something you can do. And certainly you have all the information at every step in order to do these things. In order to show that this is actually a useful compression method, I now have to present you with the decoder. And I have to convince you that it will decode the same original to the same original method. So let's look at the sub routine decode, which takes a compressed spit string and a model, the same model. Actually, we have to return, obviously here, the compressed spit string. So here we decode from compressed. Here we encode on to compressed, on to compressed. And then we return the compressed spit string. What do we do in the decoder? Well, we have to reverse these steps. But we assume that since the whole setup is a stack, you assume that this encoding and decoding also they encode to the end of the compressed spit string. And then when you decode, we assume that we can read off from the end again. And again, in the next lecture, we will see a method where we can actually very easily encode and decode in a stack manner. So with stack semantics. So therefore, if we want to invert all these steps, we have to do them in reverse order, because we can pop off things from the stack now in reverse order only. So we have to first reverse the steps, which means that we have to take our compressed message and encode, decode C, decode it from the end of the compressed message. And we will use, we have to use the same model. So we have to use this prior model to construct our decoder. Then we have to invert this step. So now we know C. So we can now decode from compressed. So again, compress is just a variable name using so which shrinks every time we decode something from it. Here we use this model, again the same model, which we can do because we know C. And finally, we have to kind of invert this step. So now we have to do inference again, based on inference. And we now have to encode, even though we're in the decoder, we have to encode Z onto compressed using again the posterior, this model. And then we return. This is important. Now we can return two things. You can return the decoded message, but we can also return the compressed bit string, because that will actually be interesting to the caller because this compressed bit string now contains these additional bits that we encoded back to it. This will be exactly the same bits that we decoded from it in this step, because we use the same model and the same value. All right. So with this half of the proof kind of is done. So I've kind of shown you that you can now implement a routine that does kind of roughly what is depicted here pictorially. And on the problem set, you will actually implement this in practice and you'll see that it's actually very easy to implement. As long as you have kind of a decoding and encoding kind of primitive routine. But now I have to convince you that this is actually a good idea to do. What I mean with that is I now have to convince you that this actually has better compression performance than our simple map estimation method. So let's look at that. So what is the compression method performance of this method? So what is the net bit? I'm going to call this the net bit rate of bit spec coding. So why do I call this the net bit rate? Well, bit spec coding here in this scheme, it maps a message to a bit string, but that message also contains some bits from the previous message. So in total, if we're only interested in encoding a lot of messages and we're only kind of interested in then to create a short bit string in total, we should kind of discount these bits from the length of this message. So the net bit rate will be the length of for message three will be the length of this encoded representation minus the number of bits that we piggybacked into it because that's kind of some additional information that we will get out in the end when we decode. So net bit rate is therefore our net of value x, the message x is the information content of actually quite similar to our map estimation method. It's we have to encode, we encoded first the message conditioned on z and then we encoded the value of z using its probability distribution, its prior probability, but then in order to get the net bit rate, we have to deduct from that what we get back, which was we decoded some bits from, we got z by decoding it from some original compressed message and we did that with the posterior distribution. So that decoding consumed a certain number of bits and that number of bits is precisely the information content of the latent variable that we decoded. So you notice that I'm leaving out a lot of rounding up to total bits and that actually has a reason. So in the next video, we'll actually learn about methods that can really deal with these kind of so-called stream codes that can really deal with as effectively deal with fractional numbers of bits. So it really makes sense to say that we really got out a fractional number of bits here. We'll learn this in the next video. So what is this? Well, let's calculate that. So that is then the negative logarithm. Let's use the properties of the logarithm. The first part here is similar to the case from the map estimation method. The product of these two is just the joint and then we have to deduct. So this will appear in the denominator now the conditional distribution, which I can write as again the joint normalized by the marginal distribution of the data. So this part here comes from these two terms and then these two parts here come from this term. Just writing out the definitions of the conditional probabilities. So you see that this simplifies to just the information content of the marginal information content of the data. So we see and this is I think a really exciting result that the net bit rate of bit spec coding is so our goal was to encode this message x in order to do this. We kind of worked around a lot with these latent variables and kind of decoded latent variables from unrelated kind of bit strings that were given to us from previous messages and things like that. But in the end what we get out is that the net bit rate is just the information content of the data, which is exactly what an optimal lossless compression method would get. So therefore, bit spec coding thing is optimal. It reaches the optimal compression performance, at least net. So for a sequence of many messages, the total compressed bit string will be kind of the sums of the total kind of the lower bounds. And you can actually, when you implement this just to advertise one last time, the problem is that when you implement this, you will indeed see that with bit spec coding, which is the green line here, you'll actually have the lowest, empirically really have the lowest, this is really the real number of bits, not some hypothetical net bit rate. It's really we encoded, you encode here a sequence of messages and then at the very end, you really just count the number of bits in that message. And you'll see that bit spec coding will outperform both will be kind of the best method throughout this spectrum of all message lengths. And kind of in extreme cases, it's good enough to use one of the other methods that we discussed, but kind of overall bit by coding will always be optimal. And this result that bit by coding is optimal. If you allow me to scroll up again, now also justifies why we used the posterior here, because we showed that by using the posterior distribution to decode values z from the compressed bit string, by using this distribution, we obtain optimal compression performance. Certainly if we had used some different distribution, we would not in general get optimal compression performance. And that is indeed something that we will be concerned with later. So what are the kind of the next steps here? So this kind of this result concludes our discussion of bit spec coding for now. But there's some interesting things to think about now. So one thing is how do we actually do these kind of encoding and decoding in a stack way? And also how do we kind of encode and decode fractional number of bits? Because all the time here, I was talking about kind of getting some information content back, which is certainly not an integer. How does can that actually work? So that is something that we will learn in the next video. N, D, code, fractional number of bits, numbers of bits, the stack semantics. So last in first out. And the reason why I haven't shown you this yet is because we will see that the method that we learn asymmetric numeral systems can actually be interpreted itself as an application of the bit spec trick. So you will see that again, applying this bit spec trick to some in some kind of extreme case will very naturally lead to a method that can encode and decode very easily in the stack semantics. And it makes sense in a certain way to think about fractional numbers of bits in the sense. So you no longer have to round up as you did in simple codes. But then also another question that will be interesting to answer is I told you that in practice, so we needed this posterior distribution to use to implement bit spec coding. But I told you that in practice, you can't often you often can't get the exact posterior. So what if we don't know the exact posterior, and we will see that, well, you can actually use any distribution instead of the posterior, it will just not lead to such good compression performance. Therefore, what you can do is you can just use any kind of a parametrized model, and then optimize that model for optimal compression performance. That was kind of quick, but in lecture seven, and in the video seven, number seven, we will kind of make this more concrete and we will introduce approximate Bayesian inference methods. And again, you will see that one very popular method, which is called variational inference, will result simply by from directly fall out from minimizing the net bit rate of bit spec coding. No additional argument required. And then finally, we will be concerned with, you know, how I kind of motivated everything like in here with kind of toyish models. And also on the problem set, it will be a very toyish model. And in particular, we will assume we so far have assumed that all these probability distribution, this P of C and P of X given C, that they are kind of given, that they have kind of a given mathematical form. But in machine learning, you're often, you know, dealing with kind of complicated data, and you can't really write down these probability distributions. So we want to learn them. And that's kind of a question that will we will be concerned with is how can we efficiently train these models? And in particular, how can we train deep latent variable models? And this will lead us to two important concepts, again, in lecture seven and subsequent. On the algorithmic side, this will lead to an algorithm that's called variational expectation maximization. And on the modeling side, we will discuss various so called deep generative models. And these models will be very different from what you may be kind of familiar with from, let's say supervised machine learning, where you're typically interested in the opposite direction where you get some, let's say some image, and you want to kind of then detect something on that image. But in deep generative models, you're more concerned with the opposite direction. How can you generate an image? Or a video or some text? So these are all topics that we'll discuss later, but they will kind of, a lot of these things that we'll discuss will directly follow just from considering this bit rate, this net bit rate of bit spec coding, and then minimizing it or including some additional constraints and then thinking about it, how can we do things faster or better? So before I wrap up, just let me one last time kind of highlight, I think bit spec coding, if you were a bit confused by this derivation, anytime, in my experience, anytime bit spec coding comes up at either a conference or a seminar or somewhere, it is people get always get generally kind of the first time when you see it quite confused by this. And I really encourage you to do the problem set where again the link is in the video description, where you only have to really fill in a couple of key steps. Most of the things are already implemented for you. But you will really see how it works. And I think it will help you a lot in understanding this algorithm to really kind of get your hands dirty and actually just implement it and see that it works. And then you can play around and see, oh, but what if I did some steps differently in this algorithm, would it then still work? And you'll probably see that then it will not work. And then you can understand why the every single step in this algorithm really has to be done in the way that it is done. So with that, have fun with the problem set.