 Welcome back to the course on data compression with e-probabilistic models. In this video, you'll learn about a very popular class of e-probabilistic models that is called a variational order encoder. You may have heard about variational order encoders before, but stay tuned nevertheless because at least in my experience, a lot of people have misconceptions about variational order encoders, especially when it comes to their application to data compression. If you've never heard about variational order encoders before, then stay tuned too, we'll derive everything from the ground up. So let's dive in. The course is called Data Compression with e-probabilistic models because these two aspects really go hand in hand. In the past few videos, we've been focused mostly on the data compression side, and we've learned several modern methods of compressing data, but we've always seen that in all of these methods, we always need a probabilistic model of our data source in order to be able to compress the data. So in this video, we'll focus more on new machine learning based methods to parameterize and learn complicated probability distribution over data. As a reminder, we've discussed something like this already before in this course. So for example, on problem set 3 and problem 3.2, you actually developed a compression method that uses a deprobabilistic model, and here you used a so-called autoregressive model in order to perform compression. This autoregressive model has the structure that's depicted down here. So it generates symbols one time after, like one symbol after the other, and it always generates them conditioned on some hidden state that evolves with each step. What is important now to realize is that for each symbol, these arrows here, they represent a non-deterministic relationship. So they really parameterize, the hidden state always parameterizes a probability distribution. Therefore, these models really parameterize probability distributions over the messages, and therefore they can be used for compression. So on the problem set, you use these probability distributions to define a codebook using Huffman coding, but you could also use these same probability distributions now that you know about them for stream codes and to compress the data with a stream code. So for example, when you then apply this for compression, then for example, the decoder side would look roughly as follows, you would again perform the same transitions for the hidden state, but then rather than sampling from the probability distribution at each step or just evaluating the probability distribution at some given symbols, you instead generate deterministic output for each symbol, and you do that by first using these probability distributions to, for example, define a codebook, and then you use that codebook to decode some bits in a compressed representation of your message. So these autoregressive models, they parametrize a complicated or a complex probability distribution as follows. So autoregressive, and this is a reminder from when we discussed autoregressive models, they parametrize a model of the data, which in this case is a sequence of symbols, and the structure of the model is that you start with the first symbol, and then after encoding the first symbol, you parametrize the second symbol conditioned on the first symbol, and then you parametrize the third symbol of the probability distribution over the third symbol conditioned on the first two, and so on. And then this specific architecture just denotes both computationally and memory-efficient way to store a subset of these classes of models. Now, what's important here is that all of these models, these probability distributions, they depend on some parameters because they are parametrized by neural networks. So I'm going to denote the neural network weights of these models as theta, and these are just generally, these are just model parameters, but in the specific case, they were neural network weights. So rather than hand-crafting these models, these parameters so that they, you can use them for compression, what we did is you actually, on the problem set, you actually learned these parameters, and you did this by optimizing these theaters, by minimizing the so-called cross entropy, the cross entropy, H between the data distribution and the model distribution, which is really now P theta. So you optimize this over theta. And then in practice, we don't really have access to the data distribution and we discussed this, but we do have maybe access to some training sets so we can at least minimize an empirical estimate of this cross entropy. So empirical. So this was fairly straightforward with autoregressive models, because these models literally just parameterized the probabilities of distribution over the data. Now the question that I'd like to answer in this video is can we do the same thing with latent variable models? So with the same thing, what I mean is can we post latent variable models that have some free parameters, for example, some neural network weights because they're parameterized by neural networks and then learn these neural networks so that we can perform optimal compression with them. And now your first reaction might be, well, of course we can, why shouldn't we? But you will see in a minute that latent variable models are a bit more complicated because they parameterize a probability distribution not only over your message, but also over latent variables, which are actually not part of your message. So this brings us to today's topic of deep latent variable models and scalable approximate Bayesian inference. This video will derive a fairly involved, serious of modern methods for probabilistic machine learning. So in order to kind of give you an idea where all these derivations will lead to, let me start actually with a brief spoiler. So what we will derive at the end of this video will be the concept of so-called variational autoencoders, which you may or may not have heard of already. These are probabilistic machine learning methods that are a form of representation learning. So you have some data and you want to map this data to some semantic representation space that captures important features of that data. Now, when you hear about variational autoencoders, these models are often introduced with an explanation that I sometimes find a bit misleading. So the explanation that is often given is that you learn in a variational autoencoder to map the data to itself and you since mapping, so you learn the identity function in some way sense, but since the identity function is trivial really to model, there's an additional constraint and this constraint is often introduced and saying that, well, to make it non-trivial, you squeeze the data through a bottleneck and then victorially the model architecture would look something like this. You have some input X and you map it with a neural network to some output and part of your training objective will be to make the output look as similar as possible to the inputs, so you try to learn the identity function, but at some point in the middle of this, or in between this architecture, you have a bottleneck and what people usually mean with that is just that they say it has a lower dimension than the data X or also the reconstruction X prime, which both have the same dimension. And then kind of hand waving, the idea is that since the model is forced to squeeze the data through this high dimensional input through some lower dimensional bottleneck, it, these two parts of the network are then forced to only really learn aspects of the input data that are specific to that specific input, but all parts that are kind of the same across all inputs, they are then learned into the network architectures. So they will then be represented by neural network weights rather than this activation in the intermediate layer. And then in the setup, the two parts of this network are often called the encoder network and the decoder network. And I guess you can already guess that these networks will play a role in then the encoder and decoder if they are used for compression. So how can these models be used for compression? Well, these can be used for compression. So use cases of VAEs variation auto encoders for compression. Well, they can be used one for lossless compression, which is what we've so far been focused on. So lossless compression and they are actually used for these types. So here, the idea would be that you first map the input data to what's called the latent representation. So the, you can think of them roughly speaking as the activations in this inner layer, but we'll see in a minute a better picture for them. So you map them, the input to this semantic representation and then you encode that. And then you can map C to its reconstruction. And now this reconstruction will in practice not be perfect. So you will, since you squeeze the data through a bottleneck you will not be able to reconstruct the exact original input data. So you will then have to encode this residual also. And the hope is kind of that this residual is rather small or if you think about compression that it has low entropy because most of the data is already captured by the reconstruction itself so that encoding this part will only take up a few bits. But then also an important application of these variational autoencoders is for lossy compression which we will cover in the next video. And here the idea is that you just leave out the residual. So you just encode this representation Z and you just live with the fact that on the decoder side when you then decode C and map it back or map it to the reconstruction that the reconstruction will be similar to the input but not exactly the same. So in both of these methods you will have two training objectives. The first one will be that there obviously in both in lossy and in lossless compression you want this reconstruction to be close to the original data. In lossy compression you want this residual to be small because then your compression method has a lower distortion. It doesn't change the data as much but also in lossless compression you want the residual to be small because then you can kind of expect that it has low entropy so that you can encode this residual in fewer bits. So in both cases and some and I'm being a bit vague here on purpose we will want the decoder network so which maps from Z back to X prime it should reconstruct the data well. So to make residual X prime minus X either small or low entropy. But then also your second objective will be that you will want the encoder network to decorrelate the data. So what do I mean with that? So when you map the input data to this representation the semantic representation Z and then you encode the semantic representation then you need a probabilistic model of the semantic representation and we've seen that using compression methods for strongly correlated data becomes complicated. It's much easier to compress data if each symbol in this each dimension of this latent representation is statistically independent of all the others. So we don't have this property on the input we assume that we have some input data that's maybe images where all the pixels are strongly correlated but we want to make a mapping where in this latent representation or this semantic representation all the dimensions are decorrelated. So that is so that we can then easily compress them. So that is the second objective that the encoder network that it decorrelates. And for that we really need a probabilistic model. So this naive picture of just describing the autoencoder as a sequence of neural networks will not be enough. We really need probabilistic models. So when I say decorrelated what I mean with is we want so with decorrelate I mean we want the probability of Z to be a product ideally. So we want them to be independent. And now there's I started this discussion of variational autoencoders by this somewhat hand-wavy argument or intuitive picture that we want to map the data to itself while squeezing it through a bottleneck. And this idea kind of has this intuition that well if you squeeze it through some low dimensional representation then that should help us to compress the data. But I want to make it really clear that I think that is a common fallacy. So this is not actually the truth. So in fact there are more modern compression methods that use something that's called a normalizing flow or even more modern variants of that are integer discrete flows and they actually map the input data to a representation that is then compressed but this representation has exactly the same dimension as the input data and yet it can be easier compressed than the input and it reduces applying simple compression methods to this representation which has in these models the same dimension as the input will still lead to a lower bit rate than if you would just to compress the input. So it's not really about reducing the dimensionality and in fact this dimensionality just the fact that this bottleneck has a lower dimension doesn't even necessarily mean that you compress the data. So you can, there exist actually mappings that can take arbitrary input data from some high dimensional space and they can map it to a lower dimensional space such that they are invertible. So you can then take this and they're exactly invertible. So the exact same information that's in the input space is also in the output space. So just this idea that you reduce that dimension has nothing to do really with compression. And I realized when I gave this lecture in live that this was actually obvious to some of the students. So maybe if you've been following along with these videos this may be completely obvious that, well, if you want to think about compression you have to think about information theoretical concepts like the entropy of this representation. But when I hear people talk about variational order encoders and other concepts I hear that quite a lot this wrong intuition that just the fact that you squeeze them through a bottleneck would lead to some sort of compression. So I want to make it really clear that squeezing data through a lower dimensional bottleneck has nothing to do really with compression. So let me actually state that formally. So note, the important note is that just squeezing data through a low dimensional or lower dimensional I should say that does not in itself imply compression. So instead we have to think about information theoretical measures. Information theoretical rather than the dimension. So in fact I should say that we don't have just two training objectives but really we have three training objectives. And so, excuse me. So the third training objective really is that we want to reduce or want to keep the entropy of each of these symbols. So we want them to correlate but also for each of these symbols we want the entropy to be low. Or in general we want the entropy of these latent or semantic representations to be low to enable effective compression. And again, in order to do that we will need a probabilistic model and this naive picture of just a stitching two neural networks together will not be enough. And that's what we'll discuss in the rest of this video. Before I get to that, just as a brief remark I mentioned these mappings that can map between, injectively between spaces of real valued spaces of different dimensions. So one of these mappings is called the Hilbert curve. So the way this works is kind of a fractal curve where you start with a curve like this. So this would be a mapping from one dimension which is this curve to the two dimensional space and then you continuously refine this curve so that it becomes more and more complicated. And all I'm saying here is that even if you start with this higher dimensional, let's say two dimensional space you could fully cover that with a one dimensional curve. So just reducing the dimension does not compress anything here. And as an final additional remark so these curves exist in arbitrary dimensions and you can even get a knitted version of them. This is what former colleagues of me worked on. They worked on compilers for knitting machines and it's kind of an extreme case. They showed that you can even knit in a single piece a Hilbert curve which then looks like this if you manage to push it all together. So this was a spoiler for this video. So at the end of this video you will have a better understanding of how these variational order encoders work and how you can really enforce these information theoretical objectives. Let's now finish up this preview of this course, this video and let's actually dive into how we can model these deep probabilistic models. And so let's get to the topic which is the deep latent variable models. So this will be our first step towards where what we will see will end up as a variational order encoder. And here in these models we are going to look only at the decoder side of the VAE. So look at decoder network only. So let me actually copy it. So here we have the model architecture and the decoder side is the green part. So this is the architecture that goes from Z to X and I'm calling, I called it the decoder network in the variational order encoder architecture but really in this, if you just look at it itself let's actually not call this the decoder network. Let's just think of this as some process that goes from Z to some output, some typically higher dimensional output. And let's actually therefore also not call this X prime because we don't think of this as some reconstruction of some input. We just think of this as some data that is generated by this process. So we think of this as just generate some data X. So we're going to interpret this now rather than taking some semantic representation C and mapping it to a specific data point. We're going to construct a probabilistic model that takes some C and constructs from it a probability distribution over this data. So interpret as a latent variable model with the following which factorizes as follows. So it has some data and some latent variables and it factorizes into a prior and a likelihood. And then this, all of these models depend on some model parameters which I've already called data here. So each of this part depends on model parameters theta. And these could for example be and in variational order encoders they usually are, they could be neural network weights. So these are learned model parameters. So for example, neural. And we're now interested in learning and finding a way to learn optimal parameters for these thetas. So to make this more concrete, let me give you a like the most common example. So example would be that the prior is a fully factorized prior. So I'm now going to, when I write down specific examples for the model, I'm going to write down the density functions, the probability density functions for these parts because I'm now going to assume that Z and in many cases also X is continuous data. So I'm going to write down the density function and I'm going to denote the densities always by lowercase letters. So that prior is fully factorized which just means that under the prior all components of Z are statistically independent. So P theta of this is the lowercase is the density function. So P of Z is product of P theta, the densities of each one of them. And then the likelihood is parameterized by a neural network. And it is just a, in many cases for example, a normal distribution around some mean which is parameterized by the neural network. So P of density function of the likelihood is a normal distribution denoted by N of the value X where the mean is some function theta, F sub theta of the latent variable on which we condition. And the variance of this normal distribution is diagonal in many cases so that it's again easy to sample for example for that or to evaluate it. And then this function is a neural network and this parameter sigma squared could be either fixed or learned. And here the first parameter of the, after the thing you're modeling the first parameter here is always the mean and the second parameter is the variance of the normal distribution. So this is fixed or learned. So basically depending on whether you do lossless or lossy compression. And then this is a normal distribution or Gaussian sometimes called a Gaussian which if you're not familiar with them they're just a class of parameterized probability distribution that look like this with density function looks like this. This has this shape. So this is our normal distribution. So this is just a common example but in more general you have this probability distribution which is a prior over this semantic representation or now we can also call it latent representation because it's a latent variable. And then a likelihood that different to the models we've looked at so far when we did for example the problem set on bit spec coding it's no longer a fixed probability distribution but it has some parameters and these parameters parameterize a neural network that defines the mean conditioned on the input value. And now with these probability distributions now our goal is to use them for compression. So we want to minimize the cross entropy again. So the cross entropy between the data distribution of X and the model distribution of X which is just given as the expectation under the data distribution of the negative log so the information content of the data under the model. But the problem here is this distribution this P theta of X if you evaluate that at some point X that is actually only defined implicitly because the only definition that we have for this model that we can really write down as a mathematical function the only definition is for definitions that we have are for this prior which we can write down as a mathematical function if we can write down these parts and we can write down the likelihood. But we cannot write down in closed form this marginal distribution over the data or we can write down how it's defined or how we can in principle calculate it. It's so again since we have continuous random variables we have to integrate over the density of functions. So this is in principle how we can calculate this distribution but in practice this even though it's typically lower dimensional than the data space this is still quite high dimensional space and high dimensional integrals are extremely expensive. So this would be prohibitively expensive. This integral would in most practical applications be prohibitively expensive to calculate. So not only can we minimize this part this probability distribution we can't even calculate it in practice. And that's kind of the problem that we're going to solve in a couple of additional tricks. So but before we do that let me just write down again. So we want to minimize the cross entropy which means due to this minus sign we want the rest of this video we will be concerned in maximizing this distribution or the log of it then evaluated on the data. So we want to maximize and this is often called this log of the marginal distribution of the data is called the evidence or the model evidence sometimes. The evidence P of P theta of X when evaluated on data X from the training set. So that is our goal but we cannot even calculate this probability distribution in practice. So how can we do that? Well, we already saw a first step towards a trick that we can apply in the parts of the in one of the previous videos that was about compression methods. So these compression ideas about compression methods will now help us to actually train ID probabilistic model. So the idea that comes in here is actually from bit spec coding. So recall in bit spec coding and this is now going to become useful because if you recall the net bit rate so for some data actually we just wrote it as bit rate of the message is exactly the negative log. So it's exactly the negative evidence. So exactly this part which we want to maximize but which we cannot calculate this is exactly the net bit rate of bit spec coding. So first idea you might come up with is well can we maybe then minimize due to this minus sign the net bit rate of bit spec coding. And in fact, that would even be very well motivated from a compression perspective because we want to minimize bit rates. Well, let's see how does we arrive actually at this result that the net bit rate of bit spec coding is exactly the negative evidence. So the negative log of the marginal data distribution. Well, we saw that in bit spec coding we have three steps. We encode not exactly in this order but these three steps are encoding the latent variable and that costs that has a cost some bit rate which is just the probability, the negative information content of the latent variable. Then we encode the data given using the likelihood but before we even do all of that we get this value Z by decoding it from the posterior. So we get back some bits which are exactly the information content of Z under the posterior. That is how we arrived at the bit rate for bit spec coding. And now, so this is the posterior distribution. Now the problem here is then so there's really no free lunch because well the fact that we cannot really calculate this part has to be reflected in that we there has to be some problem with this equation now. And the problem here is that is that we cannot calculate this posterior. So P of theta C given X, this is just the joint divided by the marginal. So again, in order to do this we actually need the marginal. So again, this is intractable. But now we can think one step further. And so when we did, when I introduced the bit spec coding method in one of the earlier videos, we use this posterior distribution in order to get the latent variable but that kind of wasn't at that point was kind of an arbitrary choice that we use the posterior distribution. It was only justified after the fact when we then saw that well if we take this choice then the net bit rate really becomes optimal because it really becomes the information content of the data under its marginal distribution. So it turned out to be really optimal if you choose the posterior here. But in principle, bits by coding would also work if you used any other distribution over the latent variable here. It would just not work quite as well. So it would have a net bit rate that first of all will then depend on C. So here we saw that magically and in some sense all, well this equation seems like it depends on the value that you choose for C but really all the dependency on C fall drops out when you write this out. So if you choose a, and this was a particular property of you use the posterior. Now if you use some other distribution then suddenly your bit rate will depend on C. But also you will see that at least in expectation the net bit rate will then be higher because this is really the optimal net bit rate that you can get. So if you do anything else, well it cannot become lower so in practice it will actually become higher. But we can still do it. So let's just say we have an idea. Just use a different distribution. Replace the posterior, which is prohibitively expensive to calculate. Just replace that with some other distribution with some other distribution. And I'm going to call this distribution Q lambda of Z. And this lambda will, we have the freedom of choosing a different distribution for every data point. So these lambdas are parameters of this distribution in the same sense that feed us are parameters of our model. So we can choose these lambdas differently for every, for every Z. So for example, and this is actually a common example, we could say Q lambda X of Z is just a normal distribution that's even fully factorized. So I equals one to K, a normal distribution for Z I with some mean and some standard deviation where these two parameters kind of for all, then if you accumulate them for all I from one to K, they make up lambda X. So lambda X is just a vector of parameters that contains all the mu's and all the sigmas for all the coordinates I equals one to K. So that's actually a common thing, to choose but it should be obvious to see that that's just one kind of class of probability distributions that we could insert here instead of the posterior. And obviously this will, in almost all cases that you can think of, none of, no matter which parameters you choose, you will not be able to actually to actually parameterize the exact posterior here. So no matter which lambdas, which parameters lambda you take, this replacement of the true posterior with its replacement Q lambda will not lead to the same bitrate. So instead it will lead to a different net bitrate. So then the net bitrate tilde, which I'm going to call, this will now depend on Z because this magic disappearance where all these C's here conspire to drop out, this will no longer be the case in general. So it will depend on Z and obviously also on the message X. And this is now basically the same equation as here, except that the posterior is now replaced by Q. So let me actually combine these first two terms again into the joint probability distribution. So this will be P theta of X being X and Z being value Z that we choose. And then, sorry, that's the information content. So the log, negative log of that. And then plus, because that's the bits that we get back, Q log Q lambda X of Z being the value Z that we choose. And we know that this method will now in practice no longer be optimal. So we know that in expectation, so when we decode data using this probability distribution, that is if we decode random data, that is equivalent to just sampling from this distribution. So in expectation, if we sample Z from Q lambda X, then this net bit rate or this new net bit rate of X will be practice larger but could in principle be the same if we happen to actually choose the posterior here, then the net bit rate of the true bit spec algorithm, which again as a reminder was the negative evidence. So marginal distribution of the data. And as a reminder, again, this is what we want to maximize. So we want to, with the minus sign, this is what we want to minimize when it's evaluated on data from our training set. And we also know that in this equation we have, so this is an inequality or a semi inequality, but we have equality if Q lambda X of C, if that is exactly the posterior distribution for the given data point that we're interested in. So you can already see, well, if you want to minimize this part and we have some quantity that is never smaller than it, well, the idea will be that we will minimize this part and then we will at least be sure that whatever we get out will be an upper bound on this part. But before I make that a bit more formal, let me introduce some notation and some naming conventions, just that when you read papers on variational inference, then that you will be able to understand what all these things mean. So some remark on notation and naming conventions. So I already mentioned that this log P theta of the marginal distribution of the data. So that is called the evidence. And we want this to be high because that will then make our information content low. And now another piece of very common notation is if you now actually look at the negative of this left-hand side here. So introducing this bound using bit spec coding is actually a very unusual way to introduce it here. I'm just introducing it this way because we can build up on the things that we've already learned about compression. But in most literature, which just introduces these models without thinking about compression, you will introduce them in another way. And then what is typically referred to as the negative of this left-hand side. So negative expectation under this replacement probability of this new net bit rate, which is to write it out as the expectation of C from Q lambda of C. Now I have to flip the signs. So this will be positive log P theta of the joint minus the log Q lambda X of C equals C. This is called the evidence lower bound. And it's called evidence lower bound. Well, because if you run through all the minus signs, so there's also a minus sign here, if you flip all the signs, and this becomes a lesser equal sign, so you'd have, so evidence lower bound is also abbreviated as elbow. And this is an abbreviation that you'll see a lot. So this elbow, which is a function now of both the model parameters and these parameters of your distribution is therefore less than or equal to the evidence. So to log P theta of X equals X, which is the evidence. And that's where this name comes from. Evidence lower bound. It's a lower bound on the evidence. And these are important quantities to remember because these will appear a lot if you read papers on this method, which is called variational inference. And the reason for that is called that these parameters lambda, lambda X, lambda sub X, which is not standard notation, by the way, of the distribution, sorry, Q lambda X of Z are called variational parameters. And the idea here is that you have a, instead of actually taking the true posterior distribution, you kind of make a guess for your posterior distribution. And by this guess has some free parameters and then you vary these free parameters until you find the best fit. So the one that makes this, the gap between the left-hand side and the right-hand side here as close as possible. So these are called variational parameters. And then this method is called the Q lambda X of Z is then called the variational distribution. And then the idea is behind this method is called variational inference. So, or VI for short. The idea here is just to find optimal values of, or let me say it in this way, approximate this evidence log P theta of X equals X by the elbow of theta of X. Lambda X star where lambda X star minimizes, I'm sorry, maximizes. Let me actually write it as an equation. Lambda X star is the arc max over lambda X of the elbow for some given model parameters theta and then over lambda. So this is called variational inference because you can think of this variational distribution as an approximate posterior distribution because it found it by, we introduced it kind of as a replacement for the posterior distribution. So what you will see empirically and we will also understand it better on the problem set is that observation here, I'm going to just phrase it as an empirical observation is that this typically leads to a distribution, a variational distribution Q lambda X star of C, which will end up to be close to the true posterior P theta of C given X, where you'll see on the problem set more precisely what I mean with close but qualitatively this shouldn't surprise you because we already know that this inequality here, it becomes an inequality if Q is the exact posterior distribution. So if we maximize or here in this case, we would minimize this left-hand side because we still have the minus signs in here. If you minimize this left-hand side, which is the same as maximizing the elbow this term, then this will become not an equality but roughly inequality and you can, it shouldn't surprise you at least, it's not a proof but it shouldn't surprise you that that will also go hand in hand with finding a variational distribution that's similar in some sense to the posterior. So that the net bit rate of this cheated variant of bits by coding will only then be a good approximation of the optimal bit rate if you replace the posterior with something that's actually similar to it. So this idea of maximizing the elbow, this one, or this idea of minimizing, which is the same thing, minimizing this modified bit rate, the net bit rate of our modified bit spec algorithm, this idea is called variational inference and I should give you the references here. So if you want to learn more about variational inference, I would recommend the following reviews by one is by David Belay and collaborators from 2016. And at the end of these lecture notes, which you can download from the link in the video description, you will find a precise references for all of these. And another review that I can recommend is by Chong Zhang and collaborators from the year 2018. So this was a first idea that can allow us to find now a, at least to solve one of the problems that we have. So let me briefly scroll up. What were the problems just remind, let's remind ourselves, what were the problems that we're trying to solve? Well, we started from this idea of minimizing the cross entropy because that's what we want to do for compression. So we want to minimize them over the model parameters here. But then we saw, so a more simple way to think about the cross entropy is really, we want to maximize the evidence, this log p theta of the marginal data distribution. So we want to maximize the evidence without the minus and you maximize it. But the problem here was then that we cannot even calculate this evidence in practice because this integral would be prohibitively expensive. So with variational inference, we've solved this first step of, at least now we're, well, we cannot really calculate this thing, but we can estimate it because you can estimate it by the net bitrate of our modified bit spec coding variant, which uses a stand-in for the, which replaces the true posterior by a stand-in over which we then optimize. So we solved this first problem that we cannot even calculate this thing, but now we still have to solve this problem that we have to now maximize over the model parameters. So how are we going to do this? Well, one observation that we can do, so now next idea, so let me first state, so now we can approximate the evidence, look P theta of X equals X. So that was this part, we want to approximate the evidence by the elbow by doing this variational inference procedure, but we still have to maximize it. We still have to maximize it over this time over theta. And so here the idea is now just maximize our approximation over theta, which was the elbow of theta lambda X star. And this may seem kind of like the obvious thing to do at first, but it's actually not that easy to do because the value for lambda that we'll find, it will really depend on the theta. So there's, we can only, we can get an approximation for the evidence only after an optimization over lambda. And then we can, once we change theta in order to optimize it, we would have to find new lambdas. So how would this algorithm look like in practice? So let's write out some pseudocode. So the pseudocode goes as follows. We have some outer training loop where we iterate over some time steps for T in training steps. And then for each training step, we first sample a minibatch, going to call B from the training set. Then we initialize these variational parameters lambda randomly. Can we do that for all X in the minibatch? And then we have to now first maximize the elbow over lambda in order to find our approximation, find this approximation. We have to do variational inference and that means maximizing over lambda. So let me actually use a different color here. So now we have an inner loop for T prime, again, training steps in inner training steps. We will now perform it in many cases, you can't do anything smarter than just gradient-based optimization. So then you do perform gradient step for lambda for all X in the minibatch. And then this step is, so this loop is variational inference, which finds our lambda X stars. And then once we found that, we perform a gradient step for theta on this elbow, which is a function which we now evaluate, well, at theta to take the gradient, but then also at the optimized lambdas. And then obviously we want to make a lot of training of gradient steps for theta, but as soon as we change theta, well, we have to always sample a different minibatch because we want to really not just optimize theta to a specific minibatch. So theta is really shared across all data points because it is a parameter of the model. But then every time we sample a new minibatch, we have to find the optimal lambdas again because these really depend on the data point because these parameterize the posterior which really depends very closely on the data point. And well, we could maybe remember all the lambdas, all those variation parameters for all the training points that we have, but typically we will have like a quite large training set. So if until we sample, we happen to sample the same data point X again, our model will typically have changed. So the model parameters theta will typically have changed a lot, so the lambda will no longer be correct. So we have to find that then optimize again to find the correct lambda. So let me actually write that also down. So remember, so the data is the model parameters, theta are shared. They are really kind of not a property of the data point itself. So they are shared across all data points. They're global, i.e. the same for all data points. X, where is the variational parameters, lambda X, they parameterize an approximate posterior or variational distribution, which approximates the posterior and approximate or an approximation of PZ specialized to this particular value. Thus, they are local, i.e. different for all data points. And what we really want to do is we, our goal is we want to maximize the evidence, but really we want to maximize the average, the expected evidence for all data from the training set. So we have to sample a different, you have to sample many mini batches, a new mini batch in each iteration of the outer loop. Loop, and then this invalidates the lambdas, lambda X star from the previous iteration of the outer loop. So if you go back to here in the pseudocode, we start in the outer loop, we start by sampling a mini batch of training points, and then we do this variational inference for these lambdas. You could actually also include this part in initialization of lambdas in this variational inference part. And only once that is done, we can perform this gradient step, but then in the next iteration of the outer loop we have to sample a new mini batch of training points. So for those, the lambdas that we had in the previous iteration of the training loop don't make any sense, so we have to run this entire inner training loop again, and this will be expensive. So we have a nested loop, and this will be extremely expensive in practice, because each step of the outer training loop is not just taking a simple gradient, but it actually involves a full, you have to now perform a full optimization within each step of the outer optimization step itself. So this invalidates the lambda stars from previous iterations of the outer loop. So we've seen that this method is not quite yet what we want to end up with, because it is still too expensive, but it's still a step in the right direction. So let me actually first give you the name for this method. So this method of optimizing, of performing inference, and then approximate inference, and then in this approximate inference, optimizing the model parameters, this is called expectation maximization, and our particular variant of it, which uses variational inference as an approximation, is called variational expectation maximization. So this is called variational expectation maximization, and the references here are, so expectation maximization in general, the method was originally proposed by Dempster and collaborators in 1977. And then this approximate method, which uses variational inference as an approximation, was first the reference for this, is Biel and Karamani in 2003. And again, you'll find the list of all the references at the end of the lecture notes. So how can we make this, how can we avoid this nested setup with these nested training loops? Well, one idea that we could have is, instead of explicitly optimizing lambda, these local variables lambda for every new training point, can't we learn how these variation parameters lambda should depend on the input? So can we learn a function that takes the input data X, that data X as an input, and then outputs an estimate of our lambdas of the optimal lambda functions that we want to have. And this is exactly the final step that we will add. So this is now final step, additional trick, is to learn how to do inference. And in particular, we will learn how to do variational inference. So we will learn a function. So what do I mean with that more precisely? IE learn a function G, which will now again have some parameters. And these will again be neural network weights, but it will be a different network. And this function takes a input data X, and it maps it to latent variables lambda X. And so then we set lambda X equal G phi of X in the elbow. And what we get out of that is now, this is then also called the elbow, so the notation is somewhat overloaded here. So the elbow will now depend still on model parameters, but instead of individual parameters, variational parameters for each data point X, it will now just depend on these global parameters phi. So these are now global parameters, just like the theta. So these are now both are global parameters. And it is just the same as the elbow that we had. So it is the expectation where we assemble C from Q, but Q is now no longer has a parameter phi, but it actually has this parameter, sorry, no longer has this parameter lambda, but lambda is actually now G phi of X. And since this is now getting kind of ridiculous with the double and triple, indeed says I'm going to use the standard notation where we say now say, apologies, notation, we will now are now going to call this G, sorry, Q, the probability Q sub, it has parameters phi now will be a function, a probability distribution over the latent variables as it was before. And then it's typically denoted as conditioned on X because it, which kind of makes sense because it really is approximates the posterior, which is also a probability over Z conditioned on X. So this notation is reminiscent of the fact that this probability distribution approximates the posterior. So this, but it really just means nothing else then, Q lambda X of Z with lambda X equals G phi of X. And then the elbow just becomes this part. So we're now sampling from Q phi of Z given X, and we evaluate the log joint and minus then the log Q phi of Z given X. So this is the elbow and then we will just, now that these are both global parameters, we can just, it is still, the elbow is still evidence lower bound. So it is still a lower bound on log P theta of X where this is still the evidence. And we're still interested in maximizing this evidence when evaluated on data from the data distribution. So we can now just, since we have a lower bound on this whole evidence, we can approximate it optimally. If we just make this, they will not be the same, but we can make the left-hand side as close as possible to the right-hand side simply by maximizing it over both theta and phi and with that we make sure that we both close this gap, but we also, by just maximizing it to maximize it effectively then also this right-hand side because that will allow us, that will give us more leeway on the left-hand side. So we'll now maximize and what, to be more precisely, maximize the expectation under the data distribution. So for X from data distribution or in practice our training set of the elbow and now to make it even more confusing, but if you look into papers, often this term itself, this whole term, is then also often called the elbow. So you sometimes have to just, it is usually obvious from context what exactly is meant, but just both of these can sometimes be called the elbow. So we maximize now this part over both the model parameters theta and the variational parameters phi. And this method is now called learning a function that maps data to variational parameters. This is called amortized variational inference because it amortizes this inner loop, so this inference process over all possible data points that we have. So rather than learning it anew for each data point, we now learn, we share statistical strength across data points and learn a single function that maps data points to their variational parameters. So this is now called amortized. So the method of doing variational inference with such a learned mapping is called amortized variational inference, but now that we're doing expectation maximization, we can call this amortized variational expectation maximization. So this term amortized variational expectation maximization is kind of what follows out from this naming convention, but you will very rarely see this term in any papers. And the reason for that is that there exists a simpler term for this. The simpler term for that is simply variational order encoders. And that's exactly what we started with. So this VAEs, that kind of closes the loop. So why is that really what we've now arrived at? Why is that a variational order encoder? Let me first give you the reference and then we'll see. So this was proposed by Kingman and Welling in 2013. And they called it order encoding variational base, but nowadays it's mostly called variational order encoders. So why is this variational order encoders? Well, let's go back to this picture. Let's actually copy this picture and let's now understand how we can view this picture now in a probabilistic interpretation. So let's go back. So here's our amortized variational expectation maximization. And why is that the same as variational order encoders? Well, we now have in a variational order encoders, we have these encoder network and this decoder network, and they parametrized mappings from the input to some latent representation or from the latent representation to some output. And we now have a probabilistic view of all these mappings because we can say this mapping on the decoder network, so which goes in this direction. And this is simply our function f theta. So if you remember the likelihood, we call the likelihood, I gave us an example here, p theta of x, or actually I gave the density function, p theta of x, x given z as a normal distribution over x centered around where the mean is a function f theta of z and the variance was, in the simplest case, just some diagonal variance. So this function f here is a neural network that maps from the latent representation to means of the output that those live in the output space. So if you were to, for example, now sample a reconstruction, if you use this velocity compression and you would just try to reconstruct an image, the best thing you could do is just take this likelihood, this probabilistic method of creating some output and just taking them out of it, which would be exactly this mean. So this is the mean. So this would be the decoder network would be f theta. And now we've learned that for, to make this method fast, we have to learn how to parameterize a probability disk that was an approximate posterior overseas given some data. So that is the interpreter that is given some output. What is the probability distribution overseas? But we can now, if you just denote instead of the generative process, if you just denote the data flow, then we can say we can start from some data. We first perform inference using this inference network. So this would now be g phi. So if you recall this function g sub phi, its output are the variational parameters. So it parameterizes our variational distribution. It parameterizes this function variational distribution, which I'm now going to write as a density, g q sub phi of z, given x using this conventional notation, but we saw it as really nothing else but this variational distribution q lambda x of z where we said lambda x to the output of the encoder network. So in this amortized variational inference setup, we have a function g phi that takes some input and then outputs variational parameters. And I gave you a common example for these variational distributions where they were simply Gaussian distributions. So e g q lambda x of z, where they were just Gaussians of z with some mean and some diagonal covariance matrix. And in this example, these means and these covariances were just these comprise the variational parameters lambda x. So the way to read this now is you take an input x, you apply the function, the encoder network to it, which gives you some variational parameters. And these variational parameters in the simplest example are simply these mu's and sigmas and they therefore parameterize a distribution over z, so a distribution in this space. So therefore we now have a very similar picture as we had in the beginning where we have two neural networks that parameterize mappings on the one hand from the input to z space and on the other hand from z space to the output. But in both cases they no longer parameterize deterministic mappings but instead they parameterize probability distributions. So they either parameterize on the encoder side variational distribution and on the decoder side they parameterize this likelihood function. With this probabilistic formulation of an order encoder architecture we can now address the issue that I highlighted several times at the beginning of this video where I said that watch out if you just this argument that is often made that you just have to squeeze the data through some low dimension, through some low dimensional bottleneck here that doesn't really lead to compression. So I argued that we really should look at the information theoretical bounds. So let's actually not care so much about the fact that this has a lower dimension. Let's instead care about information theoretical bounds and let's therefore minimize the entropy which is now something that we can do now that we have a probabilistic interpretation. So we want to minimize entropy. And in fact I mean since we started this whole derivation by minimizing the expected bit rate that is exactly what our objective already does. So we've minimized now the entropy of this part, this representation. More precisely what it minimizes, more precisely it minimizes AKL divergence from the prior to the approximate posterior distribution and you will see that in a second when we write out the elbow explicitly. Another thing that you see here is that now that we have this probabilistic model we're actually not mapping X to some deterministic Z but instead we're sampling when we write down the elbow you will see it contains an expectation value over C so we actually sample. So we inject noise at this point. Inject noise. So we inject noise here. Since we sample Z from this probability distribution Q sub phi of C given X so we no longer choose C as a deterministic function of X but it is when you write out the elbow it will be sampled from this distribution. So we inject noise here but there's no argument of how much noise you have to inject. It follows exactly you are actually learning how much noise you have to inject. So we have now this probabilistic understanding of the variational order encoder architecture and we know that we train it by minimizing this elbow part. So let's actually have a closer look at it and let's actually interpret what this objective function, this elbow, means. Interpretation of the elbow which is our objective function i.e. the thing that we want to maximize and I'm saying interpretations plural because there are actually several ways that are all correct to look at it but they all highlight different aspects of the elbow. So let's first write it out. So the elbow is now a function of the model parameters theta and these amortized variational parameters phi and it is an expectation over the latent variable C from this distribution from this approximate posterior distribution I'm going to leave out the expectation of the data so I'm assuming that we already use in order to optimize the elbow we already use a stochastic optimization process like stochastic gradient descent and where we sample different data points in each step. So here then in the expectation we have it's still the negative bit rate of bit spec coding so it's still the log of the marginal distribution over C then plus the log of the marginal distribution and I should actually write densities now since we're dealing with continuous latent variables so plus the log of so if you actually want to implement this you have to implement it as density functions the likelihood minus the log of q of phi of Z given X so again this was our notation q sub phi of Z given X where we say well really what we mean is q with the variational parameters g phi of X now one way to rewrite this and this will also be a problem in the problem set but it's really just a one-liner is then you can interpret this as you can take out the posterior distribution and you can interpret that as the negative expectation sorry you can take out the likelihood here so you can interpret that as the negative of that expectation over that log phi of X given Z actually it should be the positive one here this is the plus sign the likelihood and then the remaining terms if you look at this part and this part these are all functions these are our probability distributions over Z so over the thing here we sample from where we take the average over so these actually conspire to an information theoretical measure for these functions and we've introduced this before this is just the KL divergence the Kalbach-Liebler divergence between q phi because that's the one over which we average of Z given X and the model the prior of the model and remember we want to maximize the elbow maximize this so since there's a minus sign here that means we minimize the KL divergence but at the same time we maximize this expected log likelihood so how can you think about that well there's actually a lot of machine learning methods that just maximize the log likelihood in fact the maximum likelihood estimation is a common method for training models so if you had just this term in your objective function if you were to ignore this last term you would just maximize this part just maximizing only this part that would be maximum likelihood estimation now to be more precise in maximum likelihood estimation you actually only maximize this part but if you can think of this function this log likelihood as a function of C so this for any given X this is a function of C now you want to maximize that maybe has a shape some shape maybe like this then and you maximize it over both these parameters these model parameters and the parameters of the variational distribution well the best thing is that you can do is you just find a point that's only on this just find this maximum and then make the variance so just sample always this point so that means you learn here you're free to learn any variance of this distribution that you want variance zero so to learn a variational distribution that collapses if you were only to maximize this part then you will learn a variational distribution that just collapses to the point that is has the maximum likelihood so this would so maximizing only this part would mean need to maximum likelihood estimation so more precisely it would make q phi of C so this variational distribution would make it collapse to a delta function so delta function is just an infinite distribution that has an infinitesimally small peak so that's really sharply peaked at just one point where the peak in this case delta function peaked at the MLE so MLE is maximum likelihood estimation peaked at the MLE so at the maximum of this so at the MLE which is arc max over Z of log p theta of X,C and then obviously would not only learn the variation parameters you would also learn the model parameters so you would if you were only to maximize this part then you would learn a model such that its maximum likelihood estimation has highest possible probabilities on the data where you evaluate it and at the same time you would also learn how to find this peak of the likelihood given some data because you would also learn these variational distributions which would then just be delta functions so peaks at this arc max but luckily this is not the only part in the objective function so this is not the only part that we maximize we also have this part and now since this comes with a minus sign this we actually minimize this part and the minus sign by the way just comes from the fact that the definition of the KL divergence has a plus sign here and a minus sign here so this additional term you can think of that as a something like a regularization term think of this as a regularizer and so what we want to do is so the KL divergence is a measure we've proved on the problem sets that it is never negative and so it's a measure of distance between probability distributions it is zero if these probability distributions are equal because then you just have this term and this term which just cancel out everywhere but if they are not equal then it's strictly positive so if you minimize this because of the minus sign then we try to make these two probabilities as similar as possible so tries to make q phi of z given x similar to the prior so this is one way to think about this second term here this KL term between the approximate posterior and the prior and this interpretation really concerns the training procedure now once you employ this model for compression you can also interpret this term and so when you use this method for compression then you would typically map some input data to some output by using this this variational distribution for example you might choose the peak the mode of the variational distribution as some latent variable that you want to encode and then you would encode this latent variable in a less compression scheme that needs a probability distribution in order to define its code books or in order to just work now that probability distribution that you then take for in order to compress this value z that has to be a probability distribution that the decoder knows so it cannot be this variational distribution because that's something that the decoder doesn't know because it depends on the data so in order to compress this data you actually want to encode it with the prior because that's something that the decoder knows so at compression time so when we deploy this method for compression we want to encode a value z using the prior and then this term makes sure that the value c that we get out actually has a high probability under the prior so the precise way to do this kind of then depends on how exactly you choose z given some variational distribution but generally speaking this term helps to make sure that things that you want to encode actually have a high probability under the prior which means that they have low information content so that it can be encoded into a short bit stream ensures that z obtained from encoder have high the leader of z so that is one way how you can think about this this elbow another way to think about it is you can group these terms in a different way so another way to think about this is again you will derive this in the problem set and this will be a few more steps but also not too hard you can show that you can pull out the marginal likelihood of the data and then the rest sorry it's just the negative again a Calvate-Liebler divergence to the variational distribution from the the true posterior and this is in fact how variational inference often gets introduced so in variational inference you often start from this term this KL divergence and you say I want to minimize this part why would you want to minimize this well use a new color because it's yet a third term here so minimizing this so it's minimized because you maximize the elbow and there's a minus sign here so we minimize then effectively this KL divergence and by minimizing this we make sure that this variational distribution which we always said approximates the posterior well here we literally see it because we minimize a distance measure between the approximate posterior and the true posterior so this is one term in this objective that tries to get the two close so this minimizes so minimizing this makes oops the approximate distribution distribution q phi similar to the true posterior and this is why you will often see q phi can be called the approximate posterior because that is what you minimize and if you were given a fixed model there's something that also a lot of people are interested in and you just then are interested in performing base inference in that that is also a very important application of variational inference then you actually only have to consider this part because then this part would be a constant and then you literally just try to make sure that the variational distribution mimics the posterior distribution but now in a variational autoencoder we also general in a variational expectation setup we also say that the model has some parameters that are that we optimize over so then we again have two competing terms so one term tries to minimize the tries to make the posterior distribution as the approximate posterior as close as possible to the true posterior and this other term tries to make the evidence so this is the evidence and this will get maximized so we will maximize in this and this really maximizes the theoretical lower bound on the compression performance for loss less compression so this is literally just the it minimizes it because it's the negative information content so maximizing this minimizes the information content of X under our model so you may think well that's really all we're interested in compression but that's not quite true because there may be models out there that maybe have some optimal that they may have some very low information where the data that we want to compress has some very low information content but we may not be able to actually then do this inference part for computational reasons so that's why we also have the at this overhead that kind of measures well this is only the theoretical lower bound so this is the i.e. the theoretical lower bound of the bitrate but this is really theoretical now not only because of different coding algorithms but also because we actually have to now find a method to encode that and the typical method would be to map it first to a latent representation and we may not be able to find the true posterior for example if we were to use bit spec coding we may not be able to find the true posterior this one here to use in bit spec coding so this term measures the overhead that we get if we use a wrong posterior slightly wrong posterior approximate posterior for bit spec coding so these are two equivalent interpretations really of the elbow but really what you have to do now is you have to if you have a model architecture which is given by an encoder and a decoder network so function f and g which parametrize these approximate posterior and the likelihood and you also have a model for the prior then you just have to implement this function average over it from the approximate posterior and then maximize that but that will actually be a non-trivial task and that is the final point that I'll make here so go just to maximize the elbow over theta and phi and now there's one kind of final issue that can be resolved and you will resolve it on the problem set but there's this issue that the elbow of theta and phi is a function which where you average over sample c so you can approximate this average by sampling point c from this distribution but this distribution itself depends on on phi on the parameter of which you want to optimize so z given x of something so now the question is how do you so here the distribution from which you sample so if you implement this in practice you would not write out this, you wouldn't write out this expectation value as an high dimensional integral you would actually just sample points from this distribution and then estimate this expectation value so the distribution from which we have to sample depends on the parameter on phi by which we want to differentiate because we want to calculate the gradients of the elbow with respect to both theta and phi to differentiate so this is the final complication that comes in here is how do we actually differentiate by so if you differentiate that you can think of that as measuring how the elbow changes if I change phi slightly well if I change phi in here slightly then I can just measure that by just taking the derivatives of everything that's in here but if I change phi slightly here then that will actually lead to different samples that you will sample to kind of take that into account that changing phi slightly here will lead to different samples and that's what you will do on the problem set but to make these notes complete there are two ways how you can two mainstream ways how you can do this so one would be the so-called re-parameterization gradients and there are usually the better method to go for if they work because they usually lead to lower variance so can be optimized faster and these are due to kingman's value so from the VAE paper was 2013 but they don't always work so particularly if a discrete these discrete latent variables they don't work and then you have to use a most generic other method would be the reinforced gradients and they're called like this because they come from reinforcement learning where they're a similar trick is often used and those gradients were first introduced in the VI literature in the variational inference literature by Rangana and Rajesh Rangana and collaborators in 2014 and again I will list the precise references at the end of the lecture notes once you look at the problems that you may ask the question well in the end we've gone through all this theory in the end implementing them is actually very simple why did we even bother to go through all this theory well I think it's really important to remember all this theory where this all comes from because in the end if you want to do research on these things you don't want to just re-implement what's there you want to come up with new ideas and understanding where these things come from will allow you to improve upon the methods so let me briefly make a few comments there so why all the fuss so why do we need to learn all this theory well now that you understand where these variational order encoders come from and how they relate to variational inference and approximate basin inference and all these ideas you can actually just look into that literature and say oh there's actually there's a lot of work going on that improves these methods and we just also use that for compression so for example so ongoing going research on variational inference and related methods that may be applicable to compression well or it may not be because compression really has additional constraints and has different objectives than usually when you think about variational inference so look into that literature and try out whether it works for compression if it improves or if it helps to improve compression methods so as a few concrete examples we've seen that variational order encoders work by optimizing the elbow the evidence lower bound which as the names says is a lower bound on the model evidence now there's a lot of research going on in the variational inference or approximate evidence community that tries to come up with new formulations of bounds that are actually closer to the model evidence so lots of research on tighter bounds of the evidence so tighter than the elbow than the standard so and a lot of times that helps for inference or for other tasks for representation learning maybe but it's not clear really for all of these methods whether they actually help for compression because for example we arrived the elbow actually as the bit rate of bit spec coding so if you want to encode a model with bit spec coding then maybe the true elbow is actually the thing that you want but there are actually still improvements for that so for example there's a method called importance weighted e.g importance weighted bi and then there's a recent work applied to compression and that was by by Lukas Theis and Jonathan Hall in this year so this would be called importance weighted compression at the end of the lecture notes and this was actually non-trivial because it was not clear so importance weighted bi had been known for a couple of years but it actually optimizes a different bound than the bit rate of bit spec coding so it's not even clear that optimizing this bound actually helps for compression but what this paper comes up with is a new compression method so that's kind of more a variant of bit spec coding that then leads to this importance weighted bound and they show that that actually helps for compression another example is so now that you know that this encoder network of a VAE what it really does it just performs approximate base and inference and you now know that the fact that we're using a neural network for that is not because that's the only thing that can give us this data if you allow me to go back to the model so here you know that that this part the encoder network really just performs inference over this part over the decoder network so the only reason why we're actually using a neural network to do this is to speed up the training process but if you have a lot of time a lot of computational resources for compression maybe you're compressing a video as a streaming service and you want this video to be you have a lot of budget to compress it because it will be then streamed a lot of time so paying a little bit more for compression will be worth it if you can save a few more bytes for every single video stream then instead of using an inference network here a learned network you can instead actually go back and say well we have this decoder network let's just do iterative inference in this again at compression time so let's just do this inner loop that we had sorry let's just run this inner loop now at compression time rather than training time and let's see if we can get a better compression rate at that and that was indeed tried so that is iterative and you can actually use hybrid methods where you use the decode the learned decoder network to to predict a good initialization of the variational parameters but then do perform iterative inference on top of that it would be iterative armor test inference which was first proposed in the I literature by Marino et al Joe Marino et al in 2018 it was then applied to compression by Compos et al in a simplified version in 2019 and finally you've now seen that you can understand variational order encoder you can understand what the term variational means because they do variational inference and variational inference is really just a method that performs approximate basin inference in a way that is computationally feasible that approximates the true basin inference such that it saves computational costs but there are actually other methods that also save computational costs other alternatives other approximate basin inference methods so alternatives to VI exist in particular so called sampling methods or Markov chain Monte Carlo methods in particular Markov chain Monte Carlo or MCMC so these sometimes work better than variational inference but it's not at all trivial how you could use them for compression in still a scalable way there's some initial work here pioneering work by Havasi and collaborators from the year 2018 but this approach really still it either doesn't scale very well or if you make it scalable then it gives up a lot of the advantages of these Markov chain Monte Carlo methods and there's still actually in this area a lot of research a lot of very interesting research to be done how you can use these other methods these competitors to variational inference for compression methods so again if you found these derivations in this video a bit overwhelming then I highly encourage you to look at the current problem set which is linked in the video description and that will really show you that what comes out of these derivations is something that you can actually the plane vanilla variant of that is very easy to implement relatively easy to train these models but in order to then do research on top of that it's really important to understand what exactly where everything comes from and what it actually means so with that have fun with the problem set