 So in auto-encoders we have an encoder and we have a decoder and we have no uncertainty. But in reality a given input could relate to many z and every output could be produced by many z. In that sense if we want to produce meaningful probability distributions we need the probabilistic equivalent of auto-encoders. So we want to sample good z, a mapping from input x to a probability distribution q of z. Why? We want to produce good reconstructions of x for all the z that we may sample, all the z that we might want to consider. And we want to also correctly sample x given z. In this view we can also have a distribution of outputs for given z, p of x given z. Now what's the idea of the VAE networks? Let us make both the sampling operation and the stimulus generation be neural networks. We have q of z being one neural network parameterized by phi and p of x given z be another neural network parameterized by w. So how can we measure how good they are? And if I say how can we measure how good they are? I mean we need something that we can do gradient descent on it. So the integral generally has no good solutions. It's very hard to solve it. In most cases we can't. So instead we use another term that's called evidence lower bound elbow. That elbow can be shown to be a lower bound to the thing that we would ideally want to optimize. Now the elbow of x, i, phi and w is the expected value of the probability distribution q introduced by one of the neural networks parameterized by phi of the ratio of the log of the ratio of the probability of x, i and c for the parameters w and q as a function of z and phi. Now it has two intuitive terms. The elbow has a fast term which is a reconstruction error which is the mean squared error when p of x given z is Gaussian. What's the idea here? We want to make the observation x that I have x, i to be probable and in the exercise that's log px. There's also a regularization term which happens to be the KL divergence from p to q. Where you can say what we want is that the network's estimated q should be similar to the actual p of z which is KLqp in the exercise.