 Welcome back to the course on data compression with deep probabilistic models. This video marks a pivotal point in the course. So far, we've covered mostly topics from source coding and from information theory. But starting with this video, we'll also think about probabilistic machine learning methods. This video will cover the basics of the connection between source coding and probability theory, and I'm aware that these basics may be a repetition for some of the viewers. But in subsequent videos, we will then build upon these basics, and we will go back and forth between source coding theory and probabilistic machine learning, because we will see that these two sides of the coin, they really interact in intricate ways. So let's jump in. In order to understand how source coding and probabilistic machine learning play together, let's remind ourselves about the bigger picture. So in one of the first lectures, we looked at this general setup for communication over a noisy channel, where we have a sender and a receiver, and the sender has some data source that's not under our control. The data source generates a message, and we want to send that message over a noisy channel so that the receiver can reconstruct that message. And we saw that we have to encode and decode the message. And then I told you, and we will show this in a later part of the lecture of the course, that we can always split the encoder, at least in theory, into two parts, one that's called source coding and one that's called channel coding. And the name for these two parts reflect that, for example, the channel coding part only needs to know properties about the noisy channel, so it can completely ignore any properties of the data source. And then the source coding part can assume that if it can assume that channel coding is already done optimal, then it can assume that all properties of the noisy channel are already taken care of, and the source coding part itself only has to know properties of the data source. And then when we looked at these symbol codes, we already saw that the properties that we need to know about the data source are encoded in a probabilistic model, so we need a probabilistic model of the data source. So for example, in the symbol codes, we had to look at the probabilities of symbols so that we can then construct an optimal symbol code using a half-ment tree. And kind of on a higher level, what that means is the idea of source coding is basically you have a probabilistic model that makes some predictions of what's going to come next. These predictions aren't necessarily deterministic, they can still have some uncertainty, but they're still predictions and then the idea is you don't transmit what you can predict. So that's, I think, a good kind of slogan to keep in your mind when you think about source coding. It's don't transmit what you can predict. So if you can predict that the next, for example, letter in your natural language model is very likely going to be an E, then the probability of the next letter being an E is very high, which means that the information content of that letter being E will be very low, so you will need very few bits to transmit the next letter. Now, when I say we need a probabilistic model of the data source, I mean, that model could be good or bad and actually the models that we've looked at so far are actually quite very simplistic. So we kind of assume that if we make these models better, if we can model the data source better, then we would assume that we get better compression performance. So qualitatively, we kind of have this idea that, you know, qualitatively, we would assume that better models, better probabilistic models, would lead to better compression performance. And I think that kind of qualitatively this should be clear, but the first goal of today of this video is that we want to make this statement quantifiable so that we want to quantify this statement. And that leads us to our first topic for this video that's quantifying the effect of a modeling error. So what happens if the probabilistic model is not correct? Quantifying, I mean, I really want to measure how many bits do we lose from having a non-incorrect model? And the quantity that will turn out to be relevant here is called the so-called Kahlbach-Liebler divergence. So the Kahlbach, let's just say B, Liebler divergence. That's our first topic for this video. In order to quantify the modeling error, let's consider actually a very general lossless compression setup. So consider the general lossless compression setup. What I mean with that is we're no longer restricted to symbol codes, so we no longer restrict ourselves to the goal that every symbol has to be mapped to an integer number of bits. We've talked a lot about symbol codes in the previous videos, but we will actually be able, although we kind of released this constraint now, we will actually see that we will be able to use a lot what we've learned about symbol codes now also in this more general setup. And so we have a general lossless compression setup. So that means we have some data source and that now generates some messages. I'm going to call them X again. And these messages have some, it generates these messages with some probabilities. We don't know what this data source generates. If we knew, then we wouldn't have to transmit the data at all. And I'm going to call the probability now P data because I want to be explicit that it's the probability of the data source. Now also again to be explicit, this is now a probability distribution of over entire messages, entire. So it's not just over symbols. So far we've always looked about probabilities of symbols, but now we're really thinking about the probability of an entire message. So not just single symbols. And then we're interested in, since we're doing compression, we're obviously interested in what I will call the bit rate. So I'm going to define the bit rate R of a message X. And that's just as you would expect kind of the total number of bits, number of bits in the compressed representation of X in compressed representation of X for some given lossless compression method. So we assume that we have some given lossless compression method. We want to see how well it performs so we can just apply it to some message and then count how many bits it spits out for a given lossless compression method. So this bit rate is kind of similar to the code word length that we had in symbol codes, but now I'm really interested in the bit rate of the entire message, not just of a single symbol in that message. And in many cases what we're interested in is the expected bit rate. We were interested in building a compression method that minimizes the expected bit rate. So the optimal, and we're kind of interested in what can be the, what is the lower bound, the theoretical lower bound. So the lowest theoretically possible expected bit rate. That's not necessarily always the best metric to look at. Sometimes you may want to be interested in giving a general upper bound on all messages that you could encounter, but for many cases you are interested in the expected bit rate. Now what is that optimal expected bit rate? Well, it's the expectation of X coming from the data source. I'm going to denote this as X sampled from P data. Then of the bit rate of some, I'm going to call this Ropt, because I'm assuming now we have an optimal lossless compression method. Now what can we say about this optimal bit rate? Well, we haven't really discussed this so far because we've only kind of been considering symbol codes, but we can actually use what we've learned about symbol codes because we can say, you know, at least in theory you could construct a symbol code where you consider the whole set of possible messages. So basically you're considering the entire message as a single symbol, and that means that the entire set of possible messages that your data source could generate, you can consider that as an alphabet, which will probably be a very giant alphabet. If you think about image compression, then your whole alphabet is just the all possible images that you could possibly have. So this is not really a practical symbol code. You would have to construct an enormous code book, but at least it means you can apply all the theory that we've learned about symbol codes. And then the theory tells you that in this set of kind of X, the message X is now a symbol, a single symbol from this kind of entire alphabet, from this giant alphabet. And then we know that the expected codeword length, which is the expected number of bits for that symbol, that that is in the optimal case essentially that it's the entropy of p data plus some epsilon. So we get the same result now also here. So then the expected bitrate in this other implementation is now the entropy of the data distribution. Plus some epsilon where we learned that this epsilon is less than one bit. And in symbol codes we kind of looked at this overhead quite a bit and we kind of compared it between, for example, Shannon coding and Huffman coding. And we saw that it really makes a difference for symbol codes. And that is because for symbol codes we would hear have the entropy per symbol. And that entropy per symbol could be very low. So even an overhead of less than one bit could make a difference. But now that we're looking at kind of the bitrate of an entire message, which could be for an image, something on the order of megabytes, one bit is really irrelevant. So this can now be really typically be neglected, irrelevant, typically irrelevant. All right. So this is an important relation to keep in mind. That is just follows from what we've learned about symbol codes just kind of reinterpreted symbol codes in a different way. Kind of tells us that there exists an optimal compression method that essentially in expectation leads can compress data to its entropy number of bits. And I'm actually, and what follows, I'm actually going to leave out this epsilon part because it is so small. Now, of course, we've also kind of seen that these symbol codes would not really be practical, but we will learn in later videos that there are so-called stream codes where you can, in many situations, actually construct a lossless compression method that reaches this lower bound very closely. That comes very close to this lower bound and that can be implemented efficiently. Another thing that we learned when we looked at symbol codes and that directly kind of generalizes to this kind of more realistic setup is that, as a reminder, this is only a statement about the expected bitrate, the optimal value for the expected bitrate. But we also saw when we derived this for symbol codes that in order to reach this optimal bound, we actually have to satisfy this equation not only in expectation but actually for every value. So what do I mean with that? Well, I mean, as you remember, this entropy is defined as the expectation over P data of the information content. And I typically mean the log with base 2 when I just write log. So what we saw that if you actually want to reach, if you want to build a lossless compression method that reaches this lower bound on the expected bitrate, then the only way to do this is actually if you satisfy this equation for every message that you can have. So to reach this optimal expected bitrate, a compression mechanism, a compression, and that's what we saw when we derived kind of this lower bound by taking the derivatives of this optimization problem over real numbers. We saw that you have to be close to this lower bound and then there's only a small rounding error on top of that, which in our setup now is kind of negligible. So in order to reach this optimal expected bitrate, a compression method has to satisfy this relation that our opt of x equals... I'm not talking about the expectation value, I'm talking about every individual bitrate for every individual message has to actually be the negative log, so the information content of the probability of that symbol. And of course, again, plus some small epsilon. So again, this is also an important relation to remember if you, I'm not saying that so you can actually beat, and we saw this in the problem sets, you can actually beat this bitrate for individual symbols, but if you do that, then you no longer have an optimal, then you can no longer reach this optimal bitrate in expectation. So if you have this optimal bitrate in expectation, then you also have this relation for every data point. So for every message x, that actually should also be highlighted. And now we want to look at how does this picture change if you actually don't know p data? That's what I mean with what is the modeling error, so how much does it cost if you don't know if you have a wrong model of the data? So we now want to distinguish, so we want to, again, remind ourselves about the problem here, that in practice we don't know the data distribution. In practice, we don't. So we have some data sources, could be maybe a camera, and if you just look at a camera, I bet you won't be able to write down a probability distribution over all images that it can produce. You don't know p data. So we have to distinguish two probability distributions. One is p data, which we just looked at, which is the true probability distribution of the data source, of data coming from the data source. And then we have to distinguish that from what we use in order to construct a compression method, which is just a model of the data source, and in general, that model will not be perfect. So again, the true probability distribution is something that we don't know. But in many cases, we will have some samples from the data distribution. So again, if you think about image compression, we may have some directory of images that have been taken with the same camera, or with a similar camera. But we may have samples, so a set of samples from p data. And what that means is, well, since we don't know this probability distribution, we cannot evaluate it anywhere. So we cannot actually calculate what is the information content of any image or any message that we get from that data source. But since we have a data set, what we can do is we can evaluate what I'm going to call empirical averages. So we can evaluate empirical averages. So we can take these samples, then for each sample, calculate the bitrate. That's something we can do if you have some compression method. So we can just apply the compression method to all these samples, then count how many bits we get, and then average them over these samples. And that would be an empirical average. And that will help us to estimate true expectation values. So if you allow me to hide this for just a second. So here we were looking at expectation value of the bitrate under the data distribution. We cannot evaluate this expectation value because we cannot, for that we would have to know the probability under the data distribution of every single message. And we don't know that, but we can draw some samples from our data set and then average the bitrate. And then that gives us an estimate of the left-hand side. The reason why I'm being so explicit here is because I want to highlight these words. Empirical averages and estimate the true expectation values that those are kind of words that will appear a lot in the machine learning context. And then on the other hand, if you look at the model distribution, this is, for now we're going to make some very simplifying assumptions. For now we're going to assume that we can explicitly evaluate this. Assume we can explicitly evaluate p-model of x for all x. We will later see some scenarios where we can't even do that, where even that has to be, that model distribution has to be estimated in some way, but for now for simplicity we're going to assume that we can always evaluate this. So when we now build an compression method, you can only make it optimal to this p-model probability. So the compression method will now be optimal to this probability. So its bitrate for every data point will be, up to some small rounding error, will be the information content under the model distribution, not under the data distribution. So an optimal code will therefore have an optimal lossless compression code will now be optimal with respect to p-model. And that will therefore have the bitrate r of x being the information content of that message under the model distribution. So now we may be again interested in, so how should we, what is the kind of the expected bitrate that we get out if you have some model that might be a bit wrong? Well the expected bitrate, and this will kind of build the bridge to machine learning because you will now see a quantity that will appear often in machine learning. So we have to take the expectation over the messages of this bitrate r of x sorry, r of x which is exactly this r. But now importantly, when we are interested in the expected bitrate, we are not interested in the expectation under the model distribution because at the end this model is just something that we need to build as a proxy to the data distribution which we don't have. But what we're really interested in is once we have this compression method and we apply it in the wild and we get new samples from the true data distribution, what is the expected bitrate that we get from that? So here we have to take the expectation under the data distribution. So in total we get the expectation of x from the data distribution but then the information content under the model distribution. And if you train some machine learning models before then this term will look familiar to you because this is nothing else but the so called cross entropy between the data distribution and the model distribution which is in many machine learning models kind of the loss function that you minimize anyway. So this is again an important relation if you now look at how many bits does it cost to compress some data with some kind of not quite correct model, well it's really the cross entropy. And that's really great because now you can train a machine learning model that you want to maybe use for compression and the nice thing about it is you don't even have to build a compression code for it. So you can separate the task of training the machine learning model from the task of compression because you don't need to build a compression method in order to see how many bits you need in expectation to compress some data under the model. You can just evaluate or estimate the cross entropy and the way to estimate is to just calculate the log probability of your data points under the model distribution, we assume that we can calculate that and then average that over the empirical samples from the data set that we assume to have which we would call a training set in machine learning. So you've now seen that the loss function that in many cases in machine learning that you minimize anyway can actually be motivated from a compression perspective because it actually is many machine learning methods are actually minimizing the cross entropy so the expected bit rate that we would get out of that. So let me actually state that explicitly. So if you have now in your model some free parameters and you want to tune these free parameters then the way to do this for compression is to minimize the cross entropy. We have to minimize the cross entropy P model over parameters so for example neural network rates of P model because that's the only thing we can change. We cannot change the data distribution that's given. So this is the bit rate that you get. Now let me give this a name. Let me call this actually let's just refer to it as cross entropy. So this is the bit rate that we get and we also know that this cannot be lower than the theoretical lower bound which if you allow me to scroll up we argued was the actual entropy of the data. So with this lower bound and then this actual practical bit rate that we will get we can now calculate the overhead and that will allow us to quantify how much it costs if our model does not reflect the data distribution which it will never perfectly. So the overhead due to P model not being exactly the data distribution which in practice it will never be is called the Kahlbeck Leibler divergence. So it is often denoted as D I'm going to denote as D sub KL you will sometimes also just see it as KL or something but then here the notation is fairly consistent in the literature that you first note the data distribution and then the model distribution separated by kind of a double bar that's just the notation you will find in the literature and the way this is usually pronounced is that you say this is the Kahlbeck Leibler divergence from the model distribution to the data distribution so it's pronounced from right to left and this is so this is the actual bit rate that we get which will be the cross entropy minus so this is the actual bit rate minus the theoretical lower bound and then another way kind of to write it more explicitly kind of several ways to write this one way that I think is worth keeping kind of in the front of your head is that you can write this as the expectation under the data distribution we always in both of these terms we're taking the expectation under the data distribution the only way how they differ is what are we now taking the information content of and you can then rewrite this as the log of the fraction between the data distribution p of x and the model distribution and again of course we cannot actually calculate the p data of x so we cannot calculate this KL divergence in practice but we will need it to derive some theoretical statements for this so this is the Kalberg-Liebler divergence which is given both of these definitions are important to remember so since this is the actual bit rate minus the theoretical lower bound we know that the actual bit rate cannot be lower than the theoretical lower bound so we know that the KL divergence cannot be negative but to convince you about this in a more direct way we also show on problem 3.1c on problem set number 3 which is linked in the video description you will actually show this in a more direct way so you'll prove that dKL for any two distributions p and q so from q to p is never negative and this is called Gibbs theory so now that we know how much it costs to have a model that doesn't precisely reflect your true data distribution let's actually look at the models that we've considered so far when we discuss symbol codes so so far we've actually looked at very simplistic models which will therefore be in general and in most cases actually be very bad so the models that we've looked so far were on messages where we assumed that these messages are sequences of symbols from some alphabet and we have assumed that these models are just a product of where K is the length of the message of probabilities I guess you need also a p of K here kind of as a side remark but the main part here was that it's just a product of the probabilities of all these symbols and that's what's called in statistics these kinds of probabilistic models are called IID so we have assumed that the symbols so this is the probability of a single symbol we've assumed that these are IID IID which is short for independent and identically distributed what does that mean? well identically these are kind of two restrictions that we have put on ourselves so far so identically it's easy to understand and this is also kind of just a minor point here so identically it just means that we've so far for every symbol in that sequence we've used the same probability distribution but that's actually something that we could easily drop so this could be easily dropped easy to drop this constraint or this restriction so we could just say that instead the p model should be of the message should be again some probability of the length of the message i equals 1 to k probability p i so a different probability for every symbol and then here this is the new part that this probability distribution of the symbols now depends on where you are on the position where you are in the message and if you go through the proofs you'll see that at least for prefix codes everything will still work out prefix and we've seen that prefix codes are really all we need and the simple way to think about this is if you have a prefix code you just have to kind of for decoding you just have to kind of look at the message and read off bits until you've read enough bits so that they match some of the one of the code words in your code book and then you chop these bits off from your message and that doesn't even need that can be done in a greedy way so it doesn't have to look forward at the next symbol which will then be distributed with a different distribution therefore use a different code book so this is easy to drop this restriction the only reason why I had this restriction to begin with is because I wanted to keep the notation simple and I didn't want to have these eyes floating around everywhere but the more important restriction is this independence so what that means is, so this is really the difficult part what that means is that well the technical term means that we are not able to model correlations we haven't been able to model correlations between symbols and of course I haven't told you yet what correlations mean so if you haven't had a course on probability theory yet so correlations and kind of loosely speaking you can think of that as if you know some of the symbols does that tell you something about the other symbols so for example if you think about some English text if you know that one of the symbols is a Q then you will know that with high probability the next symbol will very likely be a U because an English text Q is usually followed by U and that is a correlation so the probability of the next symbol changes depending on what the value of the previous symbol was so that's kind of an intuitive understanding of correlations but this topic of modeling correlations is actually so important so this will really haunt us through basically the rest of the course so I will now introduce kind of an interlude where I recap some of the topics some of the important topics of probability theory so that we can properly define what we mean with correlations and then also quantify how correlated certain symbols or certain parts of your message are so I will now start with an interlude on probability theory and on random variables I'm aware that these topics may be a repetition for some of the viewers so if you're confident that you already know enough about probability theory then please feel free to skip this interlude and you can do this by jumping directly to the 1 hour and 51 minute mark just keep in your mind that one goal of this interlude on probability theory will also be to introduce the precise notation that I'm going to use throughout the rest of the course so if you do skip this interlude and then later on in the course you find yourself being confused by some of the notation then you may want to consider coming back to the interlude to understand exactly what the notation that I'm going to use will mean so let's go ahead and start with an interlude on probability theory and what we are going to call random variables so again the goal through much of the rest of this course will be to efficiently model correlations between parts of the message and there are really two important words here that is correlations and we will learn what exactly these is but also efficiently so we will see that the theory how you model correlations that's really kind of completely understood in a sense but it can become extremely expensive and prohibitively expensive if you want to do this in the most general way so we are in the rest really throughout the rest of the course we are going to learn different techniques that allow us to kind of model the most important correlations in an efficient way and we will then see that for each of these techniques each of these techniques will only kind of work together with certain source coding algorithms so here probability theory and kind of probabilistic models and machine learning methods will go hand in hand with very specific source coding algorithms and then correlations is what we learn in the next step so in order to learn that we first have to kind of introduce and fix some notation and introduce some fundamental concepts from probability theory so one way to introduce probability theory that I find kind of the most intuitive is really to construct it from the bottom up so the way you do this is through measure theory so in this theory you will have a sample space what is called a sample space and I should say this will not be kind of a full-fledged course on probability theory if you have never heard anything about probability theory I really encourage you to attend a course on this or to read a book on this I will just introduce kind of the concepts that are most important for this course in a very kind of pragmatic way so we start here from what we call a sample space typically denoted by capital omega which you can think of as an abstract typically abstract could also be something you can really picture that is some space of states of the world now this is obviously not really a mathematical definition mathematically it's just a set but I'm just giving you these intuitions so that you can as a guidance of what you should think of when you see this omega how you can then interpret it and then we define a subset of the space we call an event so an event E is a subspace of the sample space and again as kind of a guidance of how you should think about events is we say that in event E occurs so something specific happens that just means that the world is in a state is in a state omega from this event from the subset of the sample space so the world is always in some state of omega because it's the space of all states of the world maybe I should make this clear all states of the world but it's not necessarily always in a state from the subset so if it is in a state of the subset then we say that event E occurs and then once we have these events then we can talk about the probability that an event occurs in order to talk about that we introduce a probability measure the word measure here is really a technical term so a probability measure I will denote it by capital P is a function from some set sigma which I'll go to in a second to the set sorry the interval from zero to one so real numbers that are in this interval zero to one and this set sigma is called a sigma algebra I'm not going to go into detail what a sigma algebra is kind of in brief it's just making sure it's a set of subsets set of subsets of omega but the set of subsets also satisfies some probabilities some properties which basically just make sure that everything that I'm going to say next is even well defined what am I going to say next so this probability measure p this has to have some properties in order to make it really a probability measure so first of all the probability these are kind of the axioms of probability theory so the probability of the event omega which is a subset of omega this has to be always one and that kind of I hope makes sense that we as kind of require this as an axiom for a probability measure because omega is the space of all states of the world so the world is always in a state in this omega so it should have probability one that this event occurs and then part of the properties of a sigma algebra on omega is that this full set omega is actually part of that sigma algebra so that we can do that this statement is even meaningful next statement is kind of the opposite extreme that's the probability of the empty set which means that the world is in some state omega from the empty set that obviously has to be zero and again then the property another property of the sigma algebra is that the empty set is actually part of that sigma algebra and then finally the last property that we require from the probability measure is kind of the more interesting property that is if you have a union of events and it could even be a countably infinite union then the probability of this union has to be the same as the sum of the probabilities provided that all of these events EI are pairwise disjoint and that just means that there's no state omega that appears in more than one of these events so this is also kind of intuitive if we have events that were that exclude each other where it can never be the case that more than one of these events occurs then the probability of their union should be the sum of the probabilities and just as a quick remark so from these two you can follow immediately that also if you have the union kind of the more typical case of just a finite union I equals one to K EI then that is also the sum of these probabilities provided that they are pairwise disjoint simply because you can always extend this finite union to an infinite union by extending it with a lot of empty sets which will be disjoint with everything else and they have probability zero so they drop out of this sum so that's not an additional requirement that it immediately follows so with that we have defined how probability measure works and it may seem kind of odd at first why we define probabilities only on these subsets and not directly on the states omega of the world like this single state of the world why don't we just say each state of the world happens with some certain probability why do we kind of restrict ourselves to these why do we only define it on these subsets and the reason for that is in some cases it may not make any sense to talk about the probability of a single event and that is always that typically happens if we have a continuous probability distribution so as a remark so for continuous states of the world the probability of a single event for a single doesn't make a lot of sense typically doesn't make much sense so why is that the case well think of an example let's say let's say you are modeling the arrival time of a bus so omega is the real numbers you can model times by real numbers and it models the so a state of omega so omega element omega is so you are modeling the arrival time of a bus then what is can ask yourself the question what is the probability with what probability or let's make model let's make omega actually just you know so that we can yeah let's leave it like this so what is the probability that the bus arise exactly 4 p.m. today well that's the probability of the event that only consists of that one number which would be kind of 4 p.m. today and then you have to translate that somehow into real numbers maybe using numbers number of seconds since the unix epoch something like this expressed as a real number according to some standard so what is that probability well you can easily convince yourself that that probability should really be 0 and it should not be 0 just for 4 p.m. but it should really be 0 for any time why is that the case well you can kind of draw the possible arrival times so t here or which would be the omega somewhere on the spectrum you have the time 4 p.m. but you have kind of a continuum of times around this and now if you assigned some positive value to this some positive probability to this specific time then I mean you should also probably assign some positive value to the time 4 p.m. plus 1 second and it should probably be pretty much the same probability because I mean whether the bus arrives at 4 p.m. or 1 second later should probably be pretty much the same probability and then but then you can kind of kind of go in finer then you can say ok what about this time which is 4 p.m. plus half a second well it should probably also be pretty much the same probability it should probably even be kind of somewhere in between these two probabilities and then you could continue this process and you could really find an infinite number of time steps in between here and if you want to be consistent you have to assign to them some probability that's very close to these probabilities at the ends so unless these probabilities are all 0 at some point you will run out of probability budget so you will exceed the total budget of one probability because all these events are obviously disjoint so that they have to sum up in total to 1 and cannot be sum up to something larger than 1 so therefore it doesn't make sense to assign anything other than probabilities 0 to an event that contains only a single point on a continuous spectrum but what we can do and this is why we defined our probability measure on sets and not just on single value so what we can still do is define for example a probability of the interval from let's say 4 p.m. to 4 p.m. plus 1 second so that would be this tiny interval here and even though every single point in here has 0 probability the whole interval so this whole interval has kind of this probability then which there's no reason why that has to be 0 so that can be larger than 0 and there's no inconsistency here because we only assumed even though all of these individual probabilities are 0 and they are all disjoint but we assumed if I may scroll up only that the sum sums up to the union only if the union is countable and this interval is certainly not in countable set it's a continuous interval on the real axis all right so this is just kind of to explain why we introduce probabilities on sets so on events which are subsets of the space and not just on single single events sorry single items from the sample space and the reason is because for continuous spaces we really need to think about always kind of extended regions in the space otherwise it doesn't make much sense but for all intents and purposes in this course we will always think about probability whenever we think about continuous probabilities we will always think about probabilities that admit what's called a probability density and I will come to that in a second and with this probability density you can assign some value which is not a probability but a probability density to individual events and that will make it much easier these cases I'll come to that in a second when I define expectation values so after this kind of brief remark let's go back so we defined our probability measure and with this probability measure we can measure the probabilities of subsets of events or subsets of our sample space and now I told you that our really old goal was to quantify correlations between parts of our message and in order to do this we will have to introduce random variables so we have to treat our message as a kind of sequence of random variables so let me define what a random variable is so a random variable mathematically the definition is very simple it's just a function and I'm going to note random variables with capital letters from kind of the end of the alphabet so x, y, c and mathematically they are just functions from the sample space omega to some space that takes the values of the random variable so for example it could be a real value random variable then it's a function from omega to the real values and that's really all a random variable is but what do we mean with that when we talk about a random variable and I think that's again best understood as an example so let's consider an example and if you have missed it here it is back let's think again about our simplified game of monopoly right and if you kind of join this series of videos late or if you've forgotten what this means it's actually very simple you have two dies and you throw them and you're mostly interested kind of in the sum of the dies so the sample space is just the space of pairs a and b and let's assume so that we can talk about them let's just assume that the first that the dies have different colors so a is then the value of the red die and the b is the value of the blue die blue s and b or b is in blue blue die and also for simplicity so that we don't have to write out huge tables we assume that these are not standard six-sided dies but instead they are fair three-sided dies so a and b can only take values from 1, 2 and 3 then in order to define let's go through all the steps of defining a probability measure so in order to define that we need a sigma algebra which is a set of subsets of the sample space and for discrete random variables there's really no reason not to make the sigma algebra just the set of all subsets which is also called the power set so p of omega sometimes denoted as this but actually p is already used this so much I'm actually kind of I'm going to avoid this notation I'm just going to write it as 2 to the power of omega which is also a common notation so this is just the set of all subsets of omega so in particular includes omega itself and also the empty set as it has to for a sigma algebra and then what's the probability of any of these events well we assume that these are fair-sided dies so all combinations should really individual combinations should really appear with same probability so we can say that the probability of some event which is a subset of the sample space is just the number of different states in that event divided by the number of states in our sample space which is 9 so this is the number of sets that e divided by 9 and now we can define some random variables so the obvious random variables that we might want to define are random variables the obvious ones are certainly kind of the values of the two dies so the value of the red die I'm going to call this x sub r for red and so as a reminder so random variables are functions from the sample space to well in our case even a subset just of the real values but we can treat it as a real value so it takes a random variable a sample from our sample space which is always a pair a and b so let's just write this out what is this so it takes a and b and then in this case it will project on the value of the red die which was this value a here so it is just that maps it to a which is a real number it's even just an integer from 1, 2 or 3 and then obviously you can also define another random variable which is the value of the blue die and that x sub b takes also a and b and maps them to b and so far we haven't really gained anything from defining these random variables we could have just thought of a and b from the start with but another random variable that we are actually more interested in in this simplified game of monopoly is really the total value of the throw so the sum of red plus blue die I'm going to denote this random variable as x sub s for sum again takes a value from the sample space and then it maps it to a plus b which is now in 2, 3, 4 5 or 6 and with these random variables we can already kind of think about some properties of these random variables and we can think of properties of any single random variable but then importantly we can also think about how these two interact and we will see that for example these first two kind of they are what we will then define as independent, statistically independent but the third one will then no longer be independent with any of the previous ones so let's think about that so let's first cover the easy part that will be properties of single random variables or of a single random variable and here the most important property that we will define is the expectation value and I'm going to define it here mainly for a discrete random variable of a discrete so one where the set of values that it can take only takes a discrete set and for simplicity we will also assume that the sample space is discrete so the expectation value of p of x of some random variable x is just as we are used to it kind of the sum of all the states in the sample space probability of the state that only contains that sample times the value of the random variable so it's a weighted average these ones add up to one so this is a weighted average of the values that the random variable can take this is assuming that the probability is really defined on all of these single value events single state events another way you can define an equivalent way you can define the expectation value of a random variable is that you say you sum over all the values that the random variable can take so those are all the let's call them x lowercase x in all the values that if I evaluate a random variable on all samples from the sample I get a set of values that the random variable can take and I'm going to sum over all of them and then I'm going to take that value let me write it at the end to be consistent with the previous one and then weighted by the probability of kind of all the event that includes all the states that would lead to this value so you can write that as kind of inverse of this random variable the function that now maps back from these values like these, these or these maps back to the set of to a subset in of the sample space so this would be the x inverse of little x so this is an event here this x inverse of little x is the set of all omega such in the sample space such that x of omega equals x and if you define it like this then you don't have to assume that the probability the measure is defined on all these single value events it only has to be defined on these kind of on these events that lead to a consistent value of the random variable and you can convince yourself using the axioms of the probability measure that these two definitions are equivalent if the probability measure is defined on both of these types of events that the reason why I went through this in such detail is now if you think about kind of expectation values of continuous random variables expectation values value of a continuous random variable so something like a real value of random variable like not integer value, really real value of random variable continuous random variable then the way this is defined in order to define this you need to introduce a measure theory I'm not going to go into the details I'm just kind of right on how it is defined so expectation of p under of some x is then kind of formally defined as the integral of x of omega under what's called the integration measure of the p omega so this is called the integration measure so again I'm not going to go into details how that is defined but it is kind of intuitively it's kind of based on this form of the expectation value and then you make these sets you kind of allow values to be slightly to differ slightly and then you make this interval on which you allow values smaller and smaller and you detect that limit so if you're more interested in that I encourage you to take a course or read a book on a measure theory but for all intents and purposes in the for all the continuous valued random variables in this lecture so in this lecture in this course we will always assume that we have random variables that admit some what's called a density function a probability density function and then this integral would just be the integral from negative infinity to positive infinity kind of the usual Riemann integral that you might be used to of x and then I'm going to call this lowercase p of x dx so this is just a regular integral and this function p of x is called a probability density function which has this property that p of x is non-negative for all x but it can be larger than one that is important to remember but actually the other property is also that the integral of p of x itself the x is one from the negative infinity to infinity but at any individual point p of x can be larger than one be larger than one I'm not going to go much more into detail here for real values we will discuss this more once we actually arrive at real valued random variables but it's important to keep in mind that these probabilities can be larger than one because they are only densities normalized by the size of some small interval the size of this dx essentially so with that let me first before we move on actually give you an example of an expectation value now for discrete random variables because that's kind of the easier case to think about so again here the expectation let's just think about the expectation value of for example the throw of our red die which still be the same obviously of the expectation value of the throw of the blue die simply because you know they are identical dies they just happen to have different colors so their expectation value is you know if you follow either of these sums you should find that this is two because you get the numbers one, two and three with equal probability so on average you get two then the expectation value of the sum should be four if you do the math so this was the definition of the expectation value for both kind of discrete random variables and continuous random variables and again this is a property of a single random variable let's now make a final definition for a single random variable and then we will move over to definitions for multiple random variables and that's where it will become interesting because then we can define what we mean by correlations so the final definition for property of a single random variable is just it's kind of probability distribution so probability distribution of a single random of a random variable and this is more kind of just a fix in notation so what I mean I will introduce two notations here one will be kind of the probability of x equals some lowercase x I will always denote with lowercase the values of a random variable and with uppercase the actual random variables and that is just the probability of the event that includes all samples from the sample space where random variable takes that value as you would probably expect it to be and then importantly another definition in a shortcut that will be used a lot is if you just write p of x and this is also standard in the literature what you mean with that is the function that maps this little x to this value on the right hand side so it's a function from for example the real values if you have a real value random variable to probability space which maps x to p of x equals x so it's the function you can think of it as the the function p of x equals dot so p of x equals the argument of that function so don't get confused if you see kind of this notation just p of some random variable without any value that's then considered as a function that takes an argument and maps it to the probability so we've introduced properties of single random variables that were the expectation value and the probability distribution of that random variable now let's get to the more interesting part which is properties of multiple random variables and we'll start with properties of two random variables so which covers most of what we want to discuss but there will be one additional concept that will require us to think of three random variables first properties of two random variables and here we will really see how random variables interact and that will allow us to define what we mean with correlations so first thing is kind of a generalization of this probability distribution and if you think about two random variables you can directly generalize this to what's called the joint distribution and the main reason why introduce this is because I want to introduce the name joint because that will come up a lot so the joint probability distribution of let's say two random variables x and y is denoted as p of x equals x comma y equals y and that is as you would expect just the probability of the event of all states in the sample space that satisfy the property that both x takes the value lowercase x and y takes the value random variable y takes the value lowercase y and then similarly if you just write p of x comma y capital X comma capital Y that is a function from for example if they are both real value random variables r cross r to the interval from 0 to 1 which maps x comma y lowercase the values to this joint probability distribution so nothing unexpected here just fixing the notation and then we wanted to define what does it mean to look at correlations so let's think about what does it not mean so so far we've always considered messages where the symbols each symbol kind of follows its individual probability distribution we said that these could be different that doesn't really change much of the picture but they shouldn't depend so far on other symbols that we've seen so far or that we might see afterwards and if whenever you have something like this then that's these kind of random variables are called statistically independent so let me phrase that as a definition that is an important definition two random variables X and Y are said to be statistically independent or just independent if you want to be kind of brief precisely if the joint probability distribution p of X equals X or let me write it in brief first and then kind of extend what that means so p of X comma Y equals p of X times p of Y so what does that mean i.e. p of X equals X so this is an equality between functions so it means that p of X equals X comma Y equals Y is equal to these functions are point wise equal so p of X equals X times p of Y equals Y for all X comma Y so this is a very important definition to random variables are statistically independent if the joint distribution is the product of what's called the marginal distributions these are called and you will see on the problem said why they are called marginal distributions because if you write a table of these distributions then those will kind of naturally have a natural place to put these probabilities as at the margin of that table so let's look at some of our examples and see if they are independent or not and whether that matches kind of our intuition so let's go back to our kind of examples of the simplified game of Monopoly so and you can easily convince yourself that the red throw and the the value of the red die and the value of the blue die that they are independent and that kind of matches our intuition but you can also verify it explicitly by literally just verifying that this equation holds for all X and Y going from 1, 2 and 3 in all combinations but now if you look at X, red and X sum or you could equivalently also look at X blue and X sum and you will find that these two are not independent they are not independent which is kind of a double negative but it's the usual way how this is expressed and the way to prove this is that you only have to find a single example where this equation is violated so for example so for example if you take P of X, red equals 1 and X sum equals 3 well what is that probability? well it's a probability that contains the event that contains all the states now where both of these are satisfied well in order to satisfy both of them clearly the red throw has to be 1 and then if the red throw is 1 and the sum of them is 3 then the blue die has to have value 2 so it's the probability of that single event which we said is now has probability 1 over 9 because it has only one event and the probability was always number of samples number of states divided by 9 so that's the left hand side here left hand side is but P of X, red equals 1 times P of X sum equals 3 what is that? well that is X, red equals 1 has probability 1 third it's a fair die times the probability of the sum equals 3 well that's the probability of all the events that give a sum of equal 3 which are the events 1, 2 as well as 2, 1 so that has value 2 over 9 sorry 2 yes 2 over 9 so in total the probability will be 2 over 27 which is not equal to 1 so that 2 are not so this equation is not satisfied it's violated for some example so they are not statistically independent and that's what we kind of also expect because once we know, once somebody tells us so how can we think about them being not independent well if somebody tells us if we don't know the value of the red row then we kind of know that kind of no way probability distribution of the sum but once somebody tells us that the red row has for example value 1 then we know that the sum is somewhat then it's probably lower it's probably a low value so we learn by learning the value of the red row we learn something at least statistically about the value of the sum and that actually directly leads us to the next definition that is how does a probability distribution change once we know probability distribution of some random variable how does that change once we know a different random variable and that is called the conditional probability so additional distribution and that is an important definition so we can kind of define it for events and what I am going to write out how this is pronounced so we're talking about the conditional probability of event e2 given event e1 that is notated as p of e2 the bar which is pronounced as given e1 so what is this probability well it's the probability that both of these events happen so both of these events happen then we know that the world is in some state that appears in both of these events so it must be in some state from the union of these events but we already know that event e is happening anyway then we have to normalize by the probability of event e1 so therefore one way to make sense of I mean it is a definition so it could be anything but one way to make sense of why this is defined in this way is therefore if you think about p of not e2 given e1 that is then the probability of p of e e2 without the event e1 divided by the probability of e1 according to this definition and therefore if you add the two up you can convince yourself that probability of e2 given e1 plus the probability of not e2 given e1 so not e2 is really just not e2 is really just omega without e2 that that is one as you would kind of expect from a sensible definition that if you add up if you already know that event e1 happens and then you want to add up the definitions that given that e1 happens either event e2 happens or it doesn't happen the sum of those has to be exactly one and as an exercise you may want to kind of prove this relation using only the axioms of the probability measure so this is an important definition of the conditional probability and I kind of try to motivate why this definition makes sense in practice we will not deal that much with individual events we will think more about random variables but you can immediately then see how would it translate to random variables so conditional probabilities for random variables is p of x2 given x1 is probability of x2, comma or x1, comma x2 probably easier to read divided the probability of x1 so again what does this mean this is an equality between functions so it really just means that the probability of x2 or equals little x2 or let me actually call them x and y so it's less confusing because we used x and y before so p of y given x is the probability of x and y divided by the probability of y there's a shorthand notation of saying p of y equals little y given p of x equals little x is the probability the joint probability of x equals x comma y equals y divided by p of y equals 1 again pretty much the same definition but kind of important so let me just because it's so important so how do we interpret this and I kind of tried to motivate this interpretation with this calculation so the way we interpret this is we say what is the probability of y being taking value lowercase y if I already know know that some other random variable x as value lowercase x so how can you answer this question well it's given by this definition but one thing you can immediately answer is well if x and y are independent or statistically independent then we know that by definition of statistical independence p of x comma y is the product of p of x p of y and therefore this conditional probability p of y given x which is the joint divided by the marginal which is now 4 and this is important this is only if you have independent random variables then it's the product of p of x times p of y divided by p of x so then it's just p of y so that is if they are independent but if they are not independent then this is and then we know that this is not the case so we figure up before I move on one common source of confusion so when I say that what is the probability of y having some value y if I already know that x has some value x a lot of times when people read this they think that okay then x has to happen first kind of has to you know be set first let's say think about this game of monopoly where we have two dice so x is one of the dice so first I have to throw these dice and then I can take their sum because it kind of sounds like this is the cause and then this is the effect but this is really important to keep in mind that that's not true so let me make that as an important remark and it is important because we will actually encounter the opposite of that fairly often in the rest of the course so just because I can write so writing p of y given x does not imply any causality i.e. it does not mean that x is the cause of y or that y is the effect of x so really you can define this is just you know defined by this ratio and you can define this ratio anytime in particular you can even even if you know that even if that should be the case that y is the effect of x then you can always invert this relation so even if x is the cause of y calculate the opposite which is p of x given y so for example in our simplified game of monopoly we could say oh now we know let's say somebody tells us that the sum of the two throws y is some certain value is maybe 6 and then from that we know oh therefore both of the dies have to be 3 the only way you get to 6 somebody could tell us that the sum is 5 and then we know oh with probability 1 half each die takes the value 3 and with the other half it takes the value 2 and with probability 0 they take the value 1 because that couldn't lead us to that sum of 5 so we can always calculate this which is then I mean given as always as probability of x, y divided by y and explicitly the way how we would calculate this is then somebody gives us this thing is we would now use this definition and kind of invert it bring this part to the left hand side and this is called the chain rule of probability so we can write this always as p of x times p of y given x and then in order to get to this part well what you have to do is we have to also look at this joint which is p of x p of y given x and we have to actually evaluate it at values for x and sum over all these values actually let me call this x prime so there is no confusion and that's what's called Bayesian inference and this is I mean this was just very briefly I mean if you haven't followed this part we'll go kind of more into detail how you actually do Bayesian inference in practice later but I think this is a good point to first introduce you to this idea that even if you have some causal relationship that's even if if x is the cause of y and y is the effect of x you can always kind of principle at least calculate what's called the posterior distribution which kind of inverts this causal relationship so this is called Bayesian inference all right one last remark about these conditional probabilities and then we briefly look at a way how you can a first way how you can actually use all these concepts to model complicated probability distributions for compression methods so the final remark is actually what I already did kind of in the step from here to here is just another way of writing the definition of the conditional probability and that is called the chain rule so chain rule of probability theory and that really it just follows directly directly correct can't write anymore from the definition of conditional probabilities and this chain rule tells you that whenever you have a joint probability p of x,y you can always write this as p of x times p of y given x and again to verify this just insert the definition of the conditional probability and it will directly fall out but also as I said you can always kind of also write out the opposite kind of direction of the conditional probability so you can also always write this as p of y times p of x given y and then if you have three probabilities so three random variables p of x,y,z you can for example always write this as p of x times p of y given x times p of z given x,y and it should be clear now how this part is defined which conditions on both of them you just divide the joint distribution of x and y take the joint of x,y and z and divide by the joint of x and y and then you can there again all permutations all three possible permutations here are valid this is called the chain rule of probabilities so the reason why I introduced this is because now we can take a step back and kind of go back to compression so let's conclude this interlude so here we come back now to compression back to compression source coding and in particular on problem set on the problem set three I want to really advertise this problem because I think it's a very instructive thing to actually walk through this problem in any exercise whether you will implement your first or if you've already done it you've implemented your first really deep learning based compression method for and this will be a compression method lossless compression method for a natural language text and you will see that with very simple tools you can already implement something that in kind of a niche application for just this method will be able to outperform anything that you can use off the shelf any compression method that you may have used before and so in problem what is the problem 3.2 you will implement and I really encourage you to do this because it should be a very simple problem actually because most of the code is already given but it will be very implemented but I left some really critical steps out where you have to fill in how you then use this model for compression so you will implement a compression method for natural language written natural language and it will actually be quite performant even though it's a very simplistic model but it will already achieve this performance it will already exploit correlations between so it will use a probabilistic model that explicitly models correlations between characters in your message it will exploit correlations and I should have maybe noticed this so correlations whenever we have defined what an independent random variables are if two random variables are not independent then we say they are correlated symbols where the symbols in this model are characters so how does this model do this well this model in order to exploit these correlations you have to have a probabilistic model that can capture these correlations so it cannot be one of these simplistic models and so far instead it will be a model that kind of uses this chain rule so the model is the message is will be some and I'm now going to model kind of the message in general as a random variable x underlined which is some sequence of random variables up to some length k will be given as so then the model of this message will be a probability distribution will kind of assign a probability to every message and the way it does this is kind of using this chain rule so using the chain rule by saying okay I can always write this as p of x1 p of x2 p of x3 given x1 x2 and so on until p of xk given x1 x2 up to xk minus 1 so this is always possible in principle so this is always correct for any model but now we kind of see the fundamental tradeoff that we'll have to do a lot now in this lecture is that if you actually want to do this you can see that these probability distributions that become more and more complicated so this is now a function that takes 3 arguments the value of the symbol x3 the value of the symbol x1 and the value of the symbol x2 and then for the last symbol it will be a function that takes a lot of arguments so you somehow have to restrict the arguments so issue that we have here this is not the general like while this is possible in general we cannot actually do this in practice so this general exactly correct factorization of the joint distribution is not computationally feasible so this last probability for example p of xk x1 conditioned on x1, x2 up to all these xk minus 1 is an extremely like in general would have to be an extremely complicated function and what you do in this exercise I mean the model is kind of given already but you will see kind of one trick to do this and all these tricks you will see many ways to kind of construct models that can still capture important correlations but that are still compact so the need ways to on the one hand capture relevant correlation while still remaining kind of a model that can be stored in a compact form and then also evaluated computationally efficient so while maintaining compact storage packed model representation so what I mean with that is just if we actually save this model on a computer it has to it shouldn't take up gigabytes of our RAM and we also have to be able to be able to evaluate these probability distributions in an efficient way so computational so we need kind of general ways to do this and that will kind of be a lot of the goal of the next parts of the lecture and kind of the general strategy for that is to enforce conditional independence and this is now a property what is conditional independence, this is now a property of three random variables so what does this mean for X so random variables X, Y and C we have that Y and Z are conditionally independent so this is a definition are conditionally independent given X if the probability distribution and now this is kind of very similar to the definition of conditional of standard independence but I'm going to write it in a kind of a slightly different way so the way to think about this the most natural way to think about this is to say that if P of C given X and Y is the same as P of C given X so knowing Y, if you already know X then knowing Y doesn't help us, doesn't give us any additional information about the probability Z and as an exercise you may prove that this is equivalent to that makes it more similar to definition of independence is you can also write this as saying that P of Y given X sorry, P of Y and C given X is the same as P of Y given X times P of C given X looks very similar to the definition of regular independence of Y and C it's just the joint of Y and C is the product of Y probability of Y probability of Z just that everything is conditioned on X so for the context that we're interested in I mean these two are equivalent this formulation is actually kind of easier to think about because you can now go back into this model where we decomposed our probability distribution as this chain and it probably makes sense to write this in a more compact pictorial notation so the way we've looked at this in the general setup the chain rule is that we have these random variables X1, X2, X3, X4 and so on until Xk and I'm going to notate them as kind of circles and the way we decomposed this probability over all these is that we said we had one probability of X1 and then so P of X1 and then we take that times some probability of X2 left hand side we have P of all the X's so we take the times the probability of X2 conditioned on X1 and the way we notate this is we introduce an arrow here that points from the thing that we're conditioning on to the thing that we're modeling and then times again the probability of X3 conditioned on both X1 and X2 so pictorially you can depict this as introducing these two errors and then in the next step we introduced yet another dependence so we said then times probability of X4 given X1 X2 X3 so that kind of introduces this arrow this arrow and this arrow and then this goes on and now you can ask yourself well are all these dependencies really always necessary and you know not thinking about natural language for a second I mean you could for example say well maybe you know that there is X3 actually this process that transitions from X1 to X2 X3 maybe you know that this follow some process that actually doesn't have any memory so if that is the case then you get to our first kind of simplified model so I will now introduce three different ways to simplify this three possible simplifications and this will be kind of a general theme that you have to now think about you know what do you actually expect in your data set then you have to model it and then you have to measure by evaluating you know the cross entropy between estimating the cross entropy between some data set and your model whether your assumptions were actually good but something you can now think about is you know once you have a data set you can assume for example what happens if A if there is no memory so the process yeah the if you know X I are generated are actually generated in sequence and this generating process has no memory this will certainly not be a good model for natural language but for many physical processes this is certainly a good model and then what you talk about is a Markov process actually let me capitalize it it's a technical term so a Markov process assumes assumes conditional independence X I and all X J with J less than I minus 1 given X I minus 1 so what does this mean it assumes that all of these links that jump more than one by the distance more than one that they don't exist so what that means is that you can now draw a model that has the following form has only these connections these direct connections so the joint probability of that model is just the product of I equals 1 to K P of X I given X I minus 1 where for the first step we are just saying if your conditional X 0, X 0 doesn't exist so that's just the marginal distribution of X 1 so that might be a good model for processes that are memory less but for natural language that's certainly not a good model for natural language if you model it by characters there are certainly dependencies even if you know this letter you can even maybe knowing this letter may give you some hint about what the next letter is but if you knew also the past letters before that that will certainly help you to predict this next letter that it will remove some ambiguity about the next one but you can now take this Markov process and kind of make it relatively kind of more complicated so the next kind of way to model things would be by what's called a hidden Markov model and that has the following form here you say okay maybe there is some memory less process but it doesn't operate on the things that I directly observe instead it that memory less process operates on some other random variables I'm going to call them H hidden so this is still assumed to be memory less but then what you actually observe is some symbols that are only kind of a result of you don't observe the whole hidden state you only observe some result that could even be kind of a stochastic function of the hidden state so just knowing the hidden state doesn't tell you so knowing the symbols that you observe doesn't tell you tells you something about the hidden state but not everything about that and as an exercise again you should convince yourself that by constructing maybe a small example so exercise you should convince yourself that this can actually capture capture long range correlations what do I mean with that in a second so what I mean with that in a model like this let's say x1 and x3 do not have to be be conditionally independent given x2 so knowing this thing in between doesn't tell you everything about that you could know about x3 actually if you know x2 and x1 it could tell you more about x3 than just about x2 alone and this is different from these Markov models where knowing x2 tells you everything you could know about x3 knowing an additional x1 doesn't help you because that only explains where x2 came from but it doesn't then help you anything further to understand where x3 came from and this is different in these hidden Markov models but now these hidden Markov models are actually a bit difficult for compression so they are very popular models and very important models that's why I wanted to introduce them but they are difficult for compression for compression because you know in order to encode something with this model you would have to transmit these hidden states because in order for the decoder to follow these steps would have to these are in a hidden Markov model these are stochastic processes so they are not deterministic non-deterministic so in order to be able to reconstruct a message on the decoder side you would have to transmit these hidden states using some symbols and then this step is typically also non-deterministic so you also have to transmit this and you actually have to transmit some more data and we will discuss a method for this in an upcoming lecture which will be called bit spec coding so this is possible but it's kind of a bit difficult but there is a kind of third option and this is the one that you follow in the problem set and that is that often what's called an order regressive model so an order regressive model is somewhat similar to a hidden Markov model so let me actually copy parts of this but what you now have in an order regressive model you have two you deviate from a hidden Markov model in two ways first these steps for the hidden states actually become deterministic now and that means a decoder that wants to follow along it only has to have the initial state which is could be fixed in a probabilistic model so that could be part of your model and that it could follow these deterministic transitions these will still be non-deterministic but now if you make this deterministic then it's not really a very powerful model anymore right because then these states are always the same for all regardless of the message that is encoded so you can't really do much with that I mean all you will get out of that is that the probability distribution for each symbol will be will follow a different distribution but they will still be independent so they will actually be completely uncorrelated these symbols will all in such a model be independent from each other but what you do now in an order-aggressive model you additionally so this is the new part this is the first new part that this part is deterministic but then the second new part is also that you then condition the hidden states also on the symbols that you produce and always on the previous symbol so this transition it is still a deterministic transition from the symbol so this connection is also deterministic so the whole transition so the function what I mean with that is h2 is or h i plus 1 is a deterministic so not a stochastic process of h i and so of the pair h i x i and then this part is still non-deterministic so that you can model not only a single a message but a whole plethora of messages and these are very powerful models and actually some of the highest performing methods also for images performing compression methods actually use autoregressive models and as I've kind of alluded to in the problem said you will implement a compression method then you will see how you can use such an autoregressive model to actually kind of character by character compress a message in such a way that the decoder can follow along exactly and really kind of perform each of these steps and then decompress these messages because it can recover all the hidden states and then it has a different model for each of these symbols which will depend on all the past symbols so it can cover complicated correlations so you will implement this in the problem set but you will also already see in the problem set so nice part about this is it can cover can capture long range correlations again mediated by these hidden states i.e. X1 and X3 are in general not conditionally independent given well let's actually make it about X2 and X4 given X3 so here somewhere here is X4 X4 and then it so that we don't have any boundary effects and then it goes on to some hidden state HK which produces a symbol XK and it gets kind of its input also from some previous state and this goes to the next state and next next symbol so these symbols are not conditionally independent given X3 in general so knowing if you know this value you don't know everything about the next hidden state because the next hidden state also depends on this hidden state and you don't know even knowing this value doesn't tell you everything about this hidden state so knowing this value too tells you a little bit more about this hidden state also not everything but more so knowing both of these tells you more about the next symbol than just knowing this just knowing this one so it can capture long range correlations but you will already see that this is hard to parallelize and this is more of a practical point so if you want to now apply this for example to images where you walk through every pixel that would be on actual real hardware nowadays this would run extremely slowly because modern hardware is very good at doing the same thing on a lot of data is not very good at it's not very efficient at doing kind of a sequence of things where every next step has to wait until the previous step is finished because that would mess with your pipelines and maybe also with your multi-core setups so this is kind of a downside of this auto regressive method that it while it can capture very complicated probability distributions it kind of is computationally still in practice not very good so in the next lecture we will learn a method that kind of fixes this next video you will learn about a method that uses what's called latent variable models which have kind of a structure that looks like this so you have your symbols x1, x2 and so on till xk and they are now all kind of they don't depend on each other in a chain but instead they all depend on kind of one higher ranked random variable which is not part of the message and you will learn kind of about two trivial aspects here one is that if you have something like this it really can generate correlations between these symbols even though each symbol is generated kind of by an independent conditional probability that's conditioned on z once you ignore when z is not part of the message these actually become correlated or can become correlated in the general sense so this can capture correlations between the xis it can be parallelized obviously but the difficult part here is now how do you actually compress data with this it's not so obvious how you transmit data and for that I'm going to draw this as a neither smiling nor frowning face but it's kind of something that gives you a lot of headache is you have to think about it but once you find out how it works it's actually kind of a very neat and surprising answer is how do you actually use this for compression how to use this for compression and the method that you will learn here is called bit spec coding so again I would like to really encourage you to do the problems on problem set let me see problem 3.2 we really implement this kind of the autoregressive model is given for you but you implement these how you actually use this for compression I would also really encourage you to do problem set 4 which is also in the description which kind of practice it should be a very simple problem set I actually got the feedback that it was perceived as very simple but it really practices a lot of the notation that we introduced for probability theory and it introduces new concepts for conditional entropies and things like that and that will then become important in order to understand how this bit spec coding mechanism works and I hope I see you in the next video where we will go over the bit spec method with that have fun with the problems