 So what would we like to do for learning? Well, we'd like the probability for the actually occurring data to be high. So we want to calculate the probability of the item xi, or the log probability of that, which is going to be the log. And now we need to integrate out the reasons variable, which is z here, the latent space. So what we have is we have integral over z of p of xi given z times p of z d z. Now, this is hard. Now, like it's an integral, it goes through all the possible values in the latent state where we could be. Now, keep in mind that most latent state positions, most values of the latent variable z are absolutely impossible to, or like very, very close to impossible to produce the current image xi that we have. So therefore here, this integral that requires us to integrate through everything is mostly just going to be zero. And then in a very small number of places, it will be different from zero. So that's what we often use in this context of density modeling, a trick to rewrite this where we can say we instead have two terms, we have a term here where p of z given xi times log p of xi given z d z. So this term here, we might hope that we can, we only need to add a relatively small number of places because here we have p of z given xi. These are basically the z's that are compatible with the xi that we have. And so this is one of the terms, and we might hope that we can approximate this integral well. Then we have a second term which says that the probability distribution over the z given xi should be similar to the probability distribution over z. Okay. And some algebra now allows us to focus on the relevant z's for each xi. And for people who have taken maybe a Bayesian course before, this is basically the expectation maximization algorithm. Now what are the ingredients for the variational autoencoder? We replace the encoder with a so-called recognition model. It now outputs q of z a distribution over z. Now q of z will be learned to approximate p of z given xi. Now the decoder is a density network. And now the last function we'll use is an approximate. So it's a lower bound of the log likelihood of the probability of x. Again the probability under a model that is specified with parameters w. Now the last function will require Monte Carlo estimation. Why? Because as we saw before we can rewrite that integral and we will be approximating it with Monte Carlo samples. Now what's the variational autoencoder process? We take the images, it goes through an encoder network. They give us mean vectors and standard deviation vector. Now that gives us together a sample latent vector of which we can produce a number of them. And then we'll have a decoder network and each of them ultimately provides us with an estimate in that original space. And now in this case we will be sampling. And so now what you will try to do is implement our sample for an isotropic Gaussian q of z.