 Hello, and welcome to lecture 24 of probabilistic machine learning. The last two lectures have been focused on a very interesting insight, which is that the relationship, and here's a new way of phrasing it that I haven't used before, the relationship between a probability distribution P over any variable X and any other probability distribution Q over any other variable Z can be described in terms of this equation that is here at the top of this slide, which states that the log marginal over X can be written as the sum of two terms. One term, which we have come to call either the expected complete data log likelihood, the negative variational free energy, or the elbow, the evidence lower bound, all words for the same thing, and the KL divergence between Q and the conditional distribution for the variable Z under P. So this is an abstract relationship between these distributions P of X and Q of Z, which we have arguably between Q of Z and P of X and Z, if you like, so a generative model for X and Z, and we've used this in two different ways so far. Two lectures ago, we first encountered it and also in applications in the Gaussian mixture model, we encountered it as a lever to help us find maximum likelihood estimates for any parameters that P of X might have. The idea there is to say if we have access to this full conditional distribution P of Z given X and theta, then we can just set Q to this value, to this P of, or to this function, to this P of Z given X. When we do that, the KL divergence is minimized and actually becomes zero advantages because, well, yeah, it's two arguments are the same, and that means that the log marginal likelihood P of X given theta is equal to the elbow, and then we can maximize the elbow as a function of theta instead of the log marginal likelihood. This is easier occasionally or frequently because, well, simply speaking, because the elbow has the sum outside of the logarithm. So instead of having to compute the logarithm of an expected value, we have to compute the expected value of the log complete analog likelihood, and then we can maximize that with respect to theta. This algorithm is called the EM algorithm for expectation maximization because we're maximizing a function that is actually the expected complete data log likelihood rather than the marginal likelihood. Now, EM is a framework for parameter inference, inference on theta, if we have access to the exact conditional distribution P of Z given X. In the last lecture, we realized that we can actually use the same mathematical relationship between Q of Z and P of X and Z to also construct probabilistic approximations, not point estimates, not to maximize likelihoods, but to construct approximate distributions in the case where this conditional distribution P of Z given X, which we can also think of as a posterior for some latent variables that given some data X, we can use this relationship to construct approximations to this posterior. In such situations where X can be thought of as data or has the role of data and Z are the latent variables that we care about, we, in fact, we also motivated this by saying we can take theta and move it into Z. In such situations, the, we can think of, so we might typically not have access to P of Z given X, it's intractable, so instead we consider an arbitrary approximation Q of Z as a function that we, a probability distribution that we want to be as close as possible to P of Z given X and we can quantify this as close as possible in the sense that, that in this relationship, the right-hand side, the care, in the right-hand side, the KL divergence between Q and P will now in general be non-zero and it will provide a measure of similarity, one possible measure of similarity between Q and the posterior P of Z given X. Now because the left-hand side is now a constant, because X is just the data, so it's given to us and it's just a number, we can consider the elbow, the evidence lower bound as a quantity that we can try to maximize to simultaneously minimize the KL divergence. So when we do that, then the, the, we are, well, we're equivalently minimizing the KL divergence and that's maybe a good thing in itself in the sense that it minimizes some divergence, right? Of course, there's many different measures of similarity between probability distributions and this particular KL divergence also in this particular direction is maybe not a unique smart thing to do, but it is a very interesting approach to construct probability distributions that approximate the true posterior. Now one way to do this would be to decide that Q of Z should be some parametric, some parametrized probability distribution, for example, a Gaussian distribution. And then we can write down, we can hope that potentially we might be able to write down the elbow in terms of as a closed form expression. So maybe if we're lucky we get to compute the elbow in closed form, we won't always be able to do it, but maybe we're lucky and we get to do that. And then we can just maximize it as a function of the parameters of the Gaussian. But actually we realized last lecture or through a TDS process that maybe went by a bit quickly, so we have to revisit it today again, that we don't always have to immediately do this drastic step. We don't always have to immediately assume that Q has a parametric form. Instead, we can actually get away with making weaker constraints on what Q is. If we ask for the optimal choice of Q without any constraints, then we know what that optimal choice is. It's just P of Z given X. And we know by assumption that we can't compute that. If we could compute it, we would just do it and we wouldn't have to do this whole spiel here. But let's assume that we can't compute P of Z given X, then we might still be able to say, I would like to find the optimal probability distribution Q of Z, not in a parameterized form, just the optimal function under the sole restriction that I wanted to factorize over certain subsets of my variables. This is going to be helpful because when we induce this factorization, the elbow will be affected by this induced factorization in the sense that this integral here might become significantly simpler. We discussed this in the abstract form in the last lecture. I mentioned that this approach is connected in physics to the idea of mean field theory. And so it's also sometimes provided under this name, mean field theory. It's also sometimes called variational message passing. And it boils down to, I'm not going to do the derivation again, but it boils down to an iterative process where we induce our factorization. So we assume that the Q of Z that we want to use for our approximations separates into individual terms. And when we do that, we arrive at an algorithm through some derivations that are on this slide that I'm not going to redo because we determined the last lecture that says, repeatedly in an optimization process, iterate through the individual subsets of variables that we have introduced by our factorization that we have decided we want to have. And then, amazingly, we can do a closed form operation that, well, closed form in the sense that we get an explicit tractable term that tells us what the optimal choice for the approximate distribution Q j over some subset Z of j is without imposing its factorized, its parametric form. We get told by the framework that this optimal choice is given by the logarithm of something that looks like a probability distribution. And that thing is the expected value under the approximations Q for all the other variables that aren't the one we're currently looking at of the log of the joint plus arbitrary constants, which we need for normalization. This is possible because, again, of this relationship between the elbow and the KL divergence, because we know that the minimizing the elbow, sorry, maximizing the elbow actually corresponds to minimizing a particular KL divergence relative to this, let's call it like an implicit joint distribution. And we know how to minimize KL divergence if we just set things equal on the left and the right hand side of the KL divergence. I said at the end of last lecture that this approach will become the final piece in our toolbox in our machine learning algorithms toolbox for probabilistic inference. It's called variational inference, and it's an approach that is relevant whenever you want to compute an approximate posterior distribution over some quantity Z under some generative model P for X and Z for data X and latent quantity Z. And you don't want to construct a point estimate, you don't want to do Markov chair Monte Carlo, you want to have an explicit probabilistic representation of a probability distribution that you can optimize because optimization tends to be easier than sampling. Or because it why because it converges after find that number of steps, and then you know that you have a approximation that is as good as it can be within this framework. And here is like a summary slide again that roughly describes what we're supposed to do. A, we introduce a factorization. So we say that we want to construct an approximation for Q for the distribution P of Z given X. And we only want the only thing we want to put in externally, in addition to the fact that we have P to begin with is that we impose that Q factorizes in a particular form. So we identify subsets of the variables and say, with such subset should be independent under this approximation from other subsets. How close can we get with this factorizing distribution to the actual posterior in terms of KL divergence? And then once we've done that, what we're supposed to do is to construct an iterative algorithm, which in each step computes the expectation under all the other approximations. So in every step, it goes to the subsets of variables j from one to whatever the number of subsets is. And then at each step computes the expected value under all the other subsets of the log joint. And then if you're lucky, we'll find that this expression, which is a function of Z j, it's a whole function, not just a number, that that function actually has the form of a particular log probability distribution for which we know the normalization constant, the constant. And then that's our approximation for this step of the iteration. If we keep doing that several times across the subsets of variables Z j, then eventually this iteration will converge because it can be shown to be a convex operation in the space of functions Q. And then when it has converged, we'll call that the variational approximation. Now, I know that this particular presentation, even though it's actually the entire algorithm is perhaps too abstract is very difficult to understand what this really means in practice, because it leaves out many details. It leaves out, how should we initialize the probability distributions? And then what do we mean by looking at this function and seeing that it's somehow an approximation that is tractable? To understand how this works, in practice, we have to look at some actual concrete examples. And today we'll do that. We'll look first at the Gaussian mixture model again, and see if we can find a probabilistic form of the Gaussian mixture model using the variational approach, and then return to our large scale example of our topic model, and see if we can find a good variational approximation for these topic models. And of course, we can. And we'll do it. So you can already tell you at the beginning of this process that while the individual results will be different for these two different models, actually the process will be very similar. So we can now go first through the Gaussian mixture model. And then when we do the topic model, maybe you have the time also mentally to find more of the structure and less of the concrete results. So let's get going. So here is our Gaussian mixture model again. Remind ourselves, we are observing, we're just observing those individual samples. These are locations in some input space. And the assumption that defines this model is that these data points are generated from several different Gaussian distributions. Each of these Gaussian distributions has a mean and a variance, covariance. And we select one of those Gaussian distributions with mean and covariances by assigning a membership to a particular cluster for each individual datum. To make this a Gaussian mixture model, we define mixture probabilities, pi. So those are numbers over these pi's are vectors of length k that contain numbers that are non-zero and sorry, that are non-negative and they sum to one, so they are probabilities. And to draw a datum, the generative process is to draw an x, we first draw a cluster membership assignment z by drawing from the discrete distribution over cluster memberships. That means that we get a vector z, which has for this particular datum number n, k entries, only one of which is one or the other ones are zero. And then now that we know which cluster we're in, we draw a Gaussian random variable from this Gaussian distribution that defines this cluster. It has a mean and a covariance. We saw already back then that it's a good idea to introduce this auxiliary variable zn, because it allows us to write the joint distribution like this with a double product rather than a sum over probabilities and no z. So the marginal distribution over x is a product over a sum, while the joint distribution over x and z is a double product, and that's convenient. This more abstractly, we can write it like this. So there's a probability for z given pi, and then the probability for x given z and mu and sigma, and this conditional independent structure is reflected by this graph. To generate z, we draw from just the distribution parameterized by pi, and to draw x, we need z and mu and sigma but not pi. To perform Bayesian inference in this model or to draw to build a probabilistic version of this model means that we want to replace those parameters mu and sigma and pi, which are the parameters of, well, the EM version of of Gaussian mixture modeling, or they are very closely related to k-means. We want to replace those with probabilistic variables, with random variables to which we assign or to with variables to which we assign a probability distribution. This means that our likelihood for mu, sigma, and pi, which is a distribution over those over the data in these assignments, this turns into a joint distribution, a full generative model for the data x and the assignments z, but also the variables pi, mu, and sigma. I call them variables now rather than parameters because they have taken on this new role. And what is that joint distribution? Well, that joint distribution is the likelihood that we already have from up here, times priors for those variables. And now we get to choose, of course, what those priors should be. And as in the example of the topic model, a good idea here, probably a good design idea, is to choose priors for these variables that are not just exponential families, but also the exponential families that are the conjugate priors for the kind of observations we make in this likelihood. So z is a discrete assignment. It's a draw, a Bernoulli draw from a discrete probability distribution. So the conjugate prior for the unknown probability pi that z is drawn from is the Dirichlet distribution. We've now encountered this many times. So let's just use that. Here is the PDF of the Dirichlet distribution again. Now the remaining two variables are the mean and the covariance of a Gaussian distribution. We would like to do joint inference and on the mean and the covariance of a Gaussian distribution. And the conjugate prior for this is actually the so called Gauss inverse Vishar prior. This is this product for every single cluster of a Gaussian distribution over the unknown mean with a hyper parameter called the mean of this Gaussian distribution and a precision parameter beta times the covariance, the unknown, times the Vishar distribution over the inverse covariance with another two parameters w being a positive definite matrix and u being a scalar parameter called the degrees of freedom. Now this seems like might at first encounter this might seem like a very complicated distribution. You've actually encountered the Vishar distribution on an exercise sheet if you've done the theory exercises. I could at this point here do a big detour to show that this is in fact the conjugate prior for this combination of unknown mean and covariance. Maybe your tutors in one of the tutorials has already told you the story of William Sealy Gausset, the Dublin sorry the London Brewer for the Guinness company that invented essentially this combination of conjugate priors and the marginal distribution for X under this under this prior is named after pseudonym the student T distribution but we're not going to do that here it's actually maybe simpler for you to pay attention to the variational inference if I just tell you that this combination for every single cluster K separately this combination just is the conjugate prior for these for observations that are drawn from a Gaussian distribution with unknown mean unknown covariance well actually there should be a subscript k here so that you can learn a different covariance for every single cluster let me just fix that here we go so this is our generative model and what we've just done here is a very rapid fire tour over stage two of the design process for a probabilistic model if essentially just written down a generative model by saying there are all these variables that I want to use to describe what's going on now let's just assign priors to them well let's use standard exponential family conjugate priors to the variables that we get to see so so far we haven't done any variational inference yet we've just turned a maximum likelihood type problem into a problem of full Bayesian inference now as we were doing that we had to introduce those priors and as always those priors annoyingly they have parameters because those exponential families themselves of course have natural parameters and they those just show up the Dirichlet distribution has a parameter called alpha which contains pseudo counts the Gaussian prior on the mean introduces two new parameters let's call them m and beta which are the prior mean and or hyper prior mean and hyper prior precision or scale and the Vishard distribution has two new parameters w and u here i have not indexed those hyper parameters with a subscript k for every single cluster we could have done that it would have made the graph more complicated i'm just going to assume that we have a constant or a joint value for those hyper parameters for all the clusters maybe this is also not so bad right it because when we want to have an algorithm in that applies to a general data set maybe we don't really have a reason to prefer one of the clusters over the other and say a priori that we expect one of the clusters to be larger or more central or have a higher precision than the other so we can use the same hyper parameters for every single cluster now our goal will be not to estimate those parameters instead we will want to estimate what the distribution should be over those new variables mu pi and sigma but maybe also z but z is really a nuisance variable we mostly care about pi what would be the standard way of doing that well it would just be Bayesian inference we just multiply the prior by the likelihood and divide by the evidence to get a posterior over pi and mu and sigma the evidence is this expression integrated against z and pi and mu and sigma now we can already see from the graph that that's not going to be easy why well because so first of all even if we had a way of integrating out z well we actually have but we know what the marginal looks like it's this product over sums then this graph tells us that there is a collider structure pi and mu and sigma are all the parents of x so when we condition on x the parents will become conditionally dependent on each other and because we can also see that the the the likelihood if you integrate out z is not an exponential family we cannot expect to have a conjugate prior for these observations x or well a joint conjugate right for all the variables mu and pi and sigma simultaneously given x and we can't expect to get closed formation inference so instead we'll want to construct an approximation and we decided to use the variation of approximation let's remind ourselves how that's supposed to work so we would like to construct this posterior which is intractable so instead we will assume that we or we will hope to construct an approximate distribution q over z and pi and mu and sigma and we would like this approximation to be as close as possible to this posterior which is a conditional distribution given x to do that we're going to try and minimize so when we say close for the in the context of variational approximations we mean that we want the KL divergence between q and p to be as small as possible and we'll do that not by minimizing the KL divergence but by instead maximizing the elbow which is the lower bound on the evidence where the distance between the lower bound and the evidence is the KL divergence so when we maximize this lower bound we'll minimize the KL divergence and as discussed in the abstract before we're not going to impose in a hard form that we want q of z and pi and mu and sigma to have a particular form we're not going to say oh i want q of pi to be a division distribution or q of um mu and sigma to be to have in in terms of mu and sigma that the form of a Gaussian distribution or a Vishard distribution instead the only thing we're going to impose is to say that we would like to have an approximation that factorizes in a certain way a maximal thing to do we would be to say that every single variable in here that means every single datum z i and every clusters mu k and sigma k and all the pis well actually not all the pis together or probability distribution that wouldn't make sense but pi should be independent of mu and mu should be independent of sigma for every single cluster k that would be a maximal factorization and this is maybe the extreme form of what we might call mean field theory but we're actually not going to do that we'll try and get away with as little constraints as possible because every time we put further constraints on this q we will make the the lower bound the elbow less and less tight and the KL divergence that we can that we can achieve with this more strictly constrained distribution will tend to at least we have to be varied that it'll become larger and larger get further away from zero so our approximation will have a lower quality the only factorization i'm going to impose is this one i will want to have a distribution over this whole set of variables such that z under this approximation is independent from the other parameters why would i do that well maybe it's just an educated guess but given that we've seen that this that this indicator variable z is really helpful to do k means and maximum likelihood inference on the parameters of the Gaussian mixture model maybe this is the smart choice and let's see how this whether it works and indeed it turns out to be exactly the right thing to do so now that we've made this assumption now we can turn to our cooking recipe for how to compute variational bounds this was a few slides ago i'll go back and just to remind you on here we had this expression that says or there's a variant of it here that says to construct the mean field approximation go through the individual subsets of variables which you assume to be independent here this is this notation this z and on this slide now corresponds to our entire set of variables so z and pi and mu and sigma and then for every single subset for every term in the factorization consider the function called log q star of zj which is supposed to be the the logarithm of the probability density function up to normalization of our approximate distribution and that thing should be given by the expectation of the log of the joint under all the other variables that are not currently being considered this is a function of zj and therefore it we can interpret it as a log probability distribution so let's see what this looks like in our concrete probability probabilistic generative model so we've decided to factorize into a distribution on z and the distribution on pi mu and sigma let's first consider the distribution on z so this is our first subset of variables zj and these are all the other ones said i unequal to j we will want log q star the approximating distribution of this subset called z to be equal to up to normalization constants the expected value under the other distribution so q of pi mu and sigma of the log joint now of course at this point in the derivation we don't know yet what this approximate distribution q is so we'll just have to assume that we will find it at some later point and we just keep it as an abstract object around and just think about the fact that we have to compute some expected values now we can look at the actual shape of our joint distribution the actual factorization properties of our joint distribution so i can go back one one slide here's the joint and remember we will have to compute expected values over well this distribution under mu and pi and sigma actually this expression maybe isn't all that helpful yet let me go further up one step so we know that the likelihood factorizes into a term that depends on pi and the term that depends only on mu and sigma and the prior also contains a term that only depends on pi and a joint prior from mu and sigma so that means we can we naturally just from the structure of the generative model without making further assumptions this expected value separates into a part that contains an expected value over pi over the log of p and z given given pi and a expected value under the approximation on mu and sigma whatever it may be of the log of p of x given z and mu and sigma plus constants and now we plug in the actual form of those distributions so the p of z given pi remember is just a product over pi k to the z k and a product over n so it's a product over n product over okay let me show you the thing here it is right this expression is a product over n product over k pi k to the z n k so if you take the logarithm of that we get the um ah i just went two slides forward okay sorry so now we're back we get the um the z n k comes down and we get the expected value under pi of the log of pi k notice that z and k of course doesn't depend on pi so it just moves outside of the expected value already this other term here those are the gaussian distributions so they are a gaussian over x given mu k sigma k and um at z k so we can write that as let me go back again a product over n a product over k gaussian of x n given mu k and sigma k to the z n k if you take the logarithm of that part again those two products turn into a sum the z n k comes down as before and we're left with the log of a gaussian the log of a gaussian is the a quadratic form minus the quadratic form with a one half in front um plus the log of the um determinant of the inverse of the covariance matrix and then plus constants of course that would be one half times the square root of pi log of the square root of pi two pi two um times d right the dimensionality but those constants don't matter we can push them into the constant here they're just additive so now we know that the log of our approximating distribution that we're looking for as a function of z looks like this now we notice that this expression here does not contain any z's the z is the only turns that shows up in the front here so this means we our the functional shape of our distribution on z which is the exponential of this expression will be a double product over n and k of something some complicated expression to the power of z n k and that is a discrete distribution it's an independent discrete distribution for every single datum over all different clusters over um some constant to the z n k we'll just have to give a name to that constant so we introduce a helper variable algorithmic variable not not a random variable just something that will show up in our code that corresponds to the log of these probabilities so we'll call those row n k that's the probability for um well it's it's not yet a probability distribution because there's a constant missing right the normalization constant but that's going to be trivial to find because we know that this is a discrete distribution so for every single datum n we will have a separate probability distribution which just has to normalize so the sum over the row n k should be one well we can just simply enforce that by just defining another variable r n k which is row n k divided by the row by sum of the entries of of row n j that tells us what exactly the probability distribution q over z is actually this should be an equal here let me just set that to equal so now we have an equal sign because of course this is actually a probability distribution right so the r sum to one this is really is just our approximating distribution so this is only the first step of constructing our approximation we have actually maybe not done the hard part yet um we just have an approximating distribution over z now we still need one over mu and pi and sigma the things we maybe care about more but before we move on and care about that there are already a few interesting things to point out about this approximation so maybe the first one is we are reminded here and we're already using the notation r of the responsibilities that we know from k-means or from soft k-means from em on Gaussian mixture models these are pretty much the same kind of variables but the quantities that show up in here are different so we're going to get an algorithm that is not quite exactly the same as before it's going to be a bit more elaborate but the structure is very similar our update for the distribution on z will give us a discrete distribution which contains probabilities i mean what else can it contain it's just a discrete distribution and those probabilities are computed in some form that now looks like sits here and that form will be more complicated and it'll determine in which sense the probabilistic formulation of Gaussian mixture models is a bit more useful more powerful than the the em style update another thing to observe is that even though we didn't make any assumption about q of z we just said we want to have an approximate distribution for z we didn't just get a discrete distribution no we got an independent discrete distribution so every single datum has its own probability there's a product over the n in front this is of course convenient because now we get a nice like easily factorizing maybe even parallelizable structure and we'll keep that in mind because we'll just encounter a moment later a similar but more interesting maybe structure in mu and pi and sigma and the other the final thing we need to notice at this point is that in the next step when we complete our construction of the inner loop of the variational update we will need to do the opposite thing so we'll have to compute an expected value under q of z of this joint distribution and that'll typically require us to compute the expected value of z the expected value of z so which of the entries of z is one is because it's the expected value of a discrete distribution is just given by r and k by the actual probability for z to be one so we can take this along this property of this distribution right it's just a discrete distribution so one of the properties of discrete distributions and we can just take that along into our next step so now we consider the harder part mu and pi and sigma I'll use this simple notation up here but we'll go back to that in a moment let me just first look at what we actually have to compute and then we'll use that up here when necessary so to take the other step as I just said we want to have an approximating variational approximation over pi mu and sigma we haven't said that we wanted to factorize we just want to have an approximation over pi mu and sigma and that to get that we'll have to compute the expected value under the other distribution q of z over the log joint so we can go back and look at what that joint is maybe by now we also know it a little bit by heart because we've spoken about it so much that joint is a prior the arrow coming down from above into the graph over pi it's a Dirichlet distribution with parameter alpha we need a log of that and then we need there's a product in the in the joint over the individual components or the priors over mu and sigma k the log of that those are our gauss inverse viscar conjugate priors for inference and a gaussian in the joint there is a product so now there's a sum in the log over the probability to be in the individual cluster k given pi that's a discrete distribution for z with probability pi finally times or plus in the log the probability for the individual datum given the cluster membership and the cluster parameters and that's the gaussian distribution so we need the expected value of this object under our approximating distribution for z so where does that show up well it doesn't show up in those two distributions and those two priors it only shows up in the likelihood which is here and there so those two we can drag outside of our expected value and we are left with an expected value or for under a q of z for the log of the probability for z given pi plus a expected value or under q for the log of the gaussian probabilities and we notice that because of this conditional independent structure in the generative process we now have two separate terms where we have to compute expected values here I've already in the second final part actually introduced our distribution for z because we know that this this probability is like for the gaussian right is this in without the log it's the product over the gaussian to z and k so if we take the log of that we just get another sum over z and k and we need that's the only point where z and k shows up so we can directly take our expected value inside of the entire sum a double sum because we know that q of z is a discrete distribution over the z and k well and what are those expected values of z and k well conveniently we have already decided what those are they showed up before they are our r and k so we know that we're just going to get our responsibilities r and k here we can plug them in and we already know almost everything everything we need to know at this stage well there's also an expected value over q of z up here well that's the expected value under q of z over the log of p of z given pi so let me go back up and see if we can find somewhere an expression for p of z given a given pi well here it is right so log of p of z given pi is well before we took a different expected value but it's just z and k times the log of pi k so the expected value of this is just going to have an expected value of z over z and k here as well and we're just going to get one more value of r and k in here so we will be able to compute this exact expression as a function of pi and mu and sigma and we will yeah like this update will just tell us how what to set those those numbers to the before we do that actually before we really plug in what the actual values here are let's first maybe notice that something very convenient has happened here when we look at the structure of this expression for our approximating distribution to do that i'll just copy the result this final result here back up so this is just a copy of what we've just seen and now um what you may notice is that if you consider where in this expression we see pi and where we see mu and k then we can rearrange these terms and notice that there's a term in pi here and then a term in pi over here and pi doesn't show up in those two terms those two terms only depend on mu and sigma and each of these terms that depend on mu and sigma have a sum over k in front of them so what we're seeing here is that even though we only asked for an for a factorization in our approximating distribution q between z and pi mu sigma actually our optimization step finding the best possible approximation has told us that even this like weekly restricted optimal factorization or approximation has a further factorization this resulting optimal distribution separates pi from mu and sigma and then within the group of mu and sigma it again provides independent distributions for each mu k sigma k pair for each cluster of k this is called an induced factorization this is basically variational inference telling us that if you are separating z from pi mu and sigma then everything else becomes easy they just separate into sub distributions now what are those distributions so let's first since they now factorize first consider this distribution on pi to do that we just pick out the terms in this sum up here that where pi shows up so that's this term and this term and everything else is just a constant from the point of view of pi so our variational inference tells us that we should set log q star of pi to this expression plus a constant that expression is it's a log of a Dirichlet prior so that is what the log of a Dirichlet prior is right up to constants so remember the Dirichlet is normalization constant times a product over pi k to the alpha k minus one but here for the prior we've decided to use maybe a scalar parameter alpha every single cluster has the same prior probability so all the alphas can go outside of the sum plus this term here which is the log of p of z given pi so remember the p of z given pi is a discrete distribution for like factorizes over all the data points so it's a product over n product over k pi k to the z k so if you take the log of that we get a sum over n and k z and k log pi k we need to take the expected value of this under q of z and we know from before that the expected value of z and k under this distribution this discrete distribution is just r and k the responsibility of cluster k for datum n so now we can look at this expression and wonder what kind of probability density function this is the logarithm of this is a function of pi k and we can we see that there's log pi k showing up several times so we can like rearrange these terms take the sums outside and we get a sum over k log pi k times alpha minus one plus r and k with a sum over n and this is the logarithm of the distribution we're looking for up to normalization so the distribution will be something that is like the product over k pi k to the alpha minus one plus sum over n r and k and that is up to normalization a Dirichlet distribution so our optimal variational approximation to the distribution on pi that gets as close as possible to the full posterior is a Dirichlet distribution a Dirichlet distribution with the parameter value that is set to the prior parameter plus the row wise sums over the responsibility so it's the pseudo count for the number of data points that are assigned to cluster k this is super convenient which means that we've just inferred an optimal distribution and it turns out that that optimal distribution doesn't have some complicated interactable form thanks to the factorization that we've imposed it actually has the form of one of our standard data types our exponential family distribution for discrete probability distributions the Dirichlet now what is the corresponding thing for the parameters mu and sigma well here i'll have to ask you to suspend this belief a little bit if you really wanted me to do this derivation it would take quite some time and it's very tedious but if you look up on Wikipedia what the Gauss inverse Vishar prior is and actually plug it in here and check what the expected what kind of expected values you want to compute well the expected values here are easy right because they are just r and k then you'll notice that if you take the log of this prior and the log of the likelihood in fact there is an a closed form update that very similarly to the Dirichlet to the form of the Dirichlet here shows us that the the log of the approximating distribution has a form that looks like the log of another Gauss inverse Vishar prior with updated parameters and here the updates are these if you've ever done conjugate prior inference on Gaussians with unknown mean and covariance then those updates look very unsurprising to you maybe like very much expected they are the typical updates to sufficient statistics under pseudo counts for observing a total of nk data points being softly assigned to cluster k at a with an with a sample mean given by x bar and a sample covariance given by this so this part is is maybe confusing at first sight but what's really happening here is just we're just updating some parameters that define our data types Gauss inverse Vishar prior now are we done yet well we're almost done if you think about the structure we wanted to build a for loop that at every single iteration inside first estimates a q of z then a q of pi mu and sigma and then repeats so we've just written down here how we get our q of pi mu and sigma it turns out that it factorizes into a Dirichlet distribution on pi and a Gauss a product of Gauss inverse Vishar distributions on mu and sigma k and now what was let me let's just go back and check we haven't forgotten about anything for q of z oh there was a complication right so when we found our our log q star of z we only we only knew that it would be a discrete distribution but we didn't know yet what the parameters of that discrete distribution are because at that point we didn't know yet what our approximating distributions for pi and mu and sigma would be so to close the loop we now have to do that and this is like the interesting the wonderful thing about these variational bounds that we could derive them while only abstractly assuming that we have this approximating distribution and using the form of this like the functional form to discover that it's a discrete distribution now that we know what our approximate distributions for pi and mu and sigma are we can go back and ask what are those quantities actually and so here what you usually have to do when you do this in practice is that you look up some standard properties of the approximating distribution again i'll mostly do this for the Dirichlet because it's easier and then for the Gaussian here this is really the log of a Gaussian it's just something we can look up so if you we now know that our approximating distribution for excuse me now we know that our approximating distribution for pi is a Dirichlet distribution so what we need is the expected value under a Dirichlet of the log of the probability distribution that the Dirichlet is a distribution over well what is that well okay if in doubt let's go to Wikipedia here it is and check out what we might find here about the log distribute the log the expected value of the log of pi well there's in the little bar here on the side there's something about the entropy of the Dirichlet oh that's good because the entropy is something like the expected value of the log but of the log pdf not of the parameters but let's go to the corresponding section on the entropy and lo and behold there is an expression here that we can just read off so the expected value of the log of the individual entries of the Dirichlet is given by the difference between oh two things and those things are called the die gamma function um so I actually have those in here on my corresponding slide as well here we go so here's here are some standard properties of the Dirichlet that you can look up basically copied them from literally copied them from Wikipedia I'm just using the nice symbols that are not so straightforward to Wikipedia the expected value under a Dirichlet distribution of the log of its parameter is given by this thing called the die gamma function the die gamma function is the derivative of the log of the gamma function you don't need to know how to implement that because conveniently our nice toolboxes provide these for us so someone else has already done the hard work of implementing this function it might seem like magic that we can do that but maybe you've you're you're also using all the time little library operations that that evaluate exponential functions or error functions or various other non-trivial intractable functions that we someone has just found a really good approximation for on a computer that uses floating point operations so there's also something in the cypress special library that computes the die gamma function for us and the expected value of the log of pi d so here xt for the standard Dirichlet is the die gamma function this thing of the individual parameter value of this vector alpha d and the minus the die gamma function of the sum over all the entries in this distribution so we know what that thing is what that expected value of when we go back of log pi of k is it's just something that a computer knows how to evaluate and similarly we'll need to evaluate this expression under a so the expected value of a log of a Gaussian under a Gauss inverse Vishar prior distribution and similarly here we can go to Wikipedia or somewhere else and look up the corresponding quantity so if you go on Wikipedia and go to the Vishar distribution you'll be able to find somewhere below that it has some complicated expected value which again uses some die gamma function a version of the die gamma function which is the multivariate die gamma function again something available in a toolbox so we get to use standard tools that come are provided to us by powerful libraries a because someone has written those libraries for us and b because we made the smart choice of using exponential family priors in our model here again and again as in gaussian modeling in this generalization of probabilistic modeling we get a handsome payoff for being willing to use standard probability distributions exponential family distributions so if your question was ever why should i use exponential family distributions here is the answer because it makes your life so much easier even though of course they don't perfectly capture exactly what you want to say about the distribution so our update for the the responsibilities will involve these quantities that we can we can just evaluate on a computer and they will just have some form that we can actually plug in and then update this closes our loop and it allows us to go to go like to to build an algorithm which iteratively updates in in succession our discrete distribution on on z and our Dirichlet distribution on pi and our gauss inverse gamma distribution on the parameters of those gaussian distributions if you implement this algorithm and here i have to go a little bit quick because i also want to talk a little bit about the topic model and you'll get a method that estimates probability distributions marginal probability distributions for all the parameters of our gaussian mixture model the mixture components weights pi k and the individual parameters of the gaussian distributions mu and sigma and we can use those to also directly infer a posterior over z so which data point belongs to which cluster and because it's a probability distribution it can fix some of the issues that the the the non probabilistic the point estimation version the em algorithm has one of them being that it can basically effectively figure out how many cluster components there actually are because a Dirichlet distribution can be sparse it just can find solutions that put zero probability on some of those of those components if we start with prior parameters alpha that have a few that have values that are less than less than one to give you just a quick idea of what this what this works like here's the update equation okay so you can find that in the slides um if you the quick idea of how this works like is here's a a few snapshots of this algorithm running on the old faithful data set that we encountered the last two lectures this is a plot that on code that was written by Ankatrin Schallkamp and here you see the algorithm initialized with uh one two three four five six seven components this is already like two or three steps in it started up with completely random assignments and now the algorithm has honed in found some clusters and now it notices that it's actually it's actually better in terms of lowering the variational bound sorry raising the elbow lowering the KL divergence to assign marginal probability zero to some of those clusters and instead grow some of the other clusters and as that happens some of those clusters drop out they just vanish from the approximation and in the end the algorithm actually finds a clustering that only contains two clusters which is intuitive because if we look at the data set we only we also really see two clusters and this uh approximating distribution finds an in expectation two Gaussian clusters that have a non-trivial covariant structure they are aligned to the data set and therefore provide well okay arguably a better description of what's going on here than the EM version and here I haven't even so this is the advantage of a prior right and here I haven't even shown you the full probabilistic output which of course assigns a probability distribution to all of these parameters so if you had very few data points the algorithm would in fact be very uncertain about all of those parameters as well impressive no so now you might say well thank god I don't have to implement this because it seems really complicated but hang on there is something you will have to implement soon in your exercises and that's the topic model can we construct a variational bound for the topic model as well maybe let's not spend too much time on this Gaussian mixture model and instead actually have a look at our topic model so that I can help you a little bit with that homework the advantage of doing this by the way will be that you get to see the very same structured derivation twice and the only thing that changes is the model structure underneath so maybe this helps for you to quickly get the gist of what we're trying to do without having to focus too much on implementation details just by getting two different examples so here is our topic model again we've now seen this slide several times just to remind you again we're talking about a corpus of documents that is considered to consist just of counts of words so a bag of words where each word is assumed to be generated by assigning a discrete distribution over topics to each document assigning a discrete distribution over words from a vocabulary to each topic and then for each word in each document drawing a topic from the document topic distribution and drawing a word from that topic identified by that assignment we now looked several times at length at the corresponding generative model while this graph that I just showed you is maybe nice and clean this is what it looks like in math this is our joint distribution for the unknown topic word or word topic assignments the document topic distributions the topic word distributions and the actual words that we get to see only this thing is data all the other stuff is latent variables the last the only algorithm we have so far or the only class of algorithms we have so far are sampling algorithms the Gibbs sampling algorithms where we iterate between sampling c and pi and theta in fact I pointed out two lectures ago that there is a simplification a speed up version where we can actually marginalize out theta and pi the two quantities we actually care about in order to sample lots of c's jointly in a collapsed Gibbs fashion a Gibbs sampling fashion and then only afterwards consider implied posterior distributions for theta and pi and the reason for that is that there is a structure in this model that is not unlike the Gaussian mixture model which is that when we condition on c the distribution on theta and pi is convenient and if you condition and theta and pi the distribution on c is convenient so maybe there's hints at a situation where we can once again hope that our variational approach could also help by us imposing from externally that we want to separate the way we construct approximations on c from the way we construct approximations on theta and pi because then that factorization that we already saw this conditional factorization we already saw for the Gibbs sampler might help us here again so our assumption is going to be that we want an approximation for those unknown variables pi and theta and c that factorizes between c and pi and theta and as before in the Gaussian mixture model we will not make any further assumptions about factorization but spoiler alert we'll discover that the optimal approximation factorizes further another case of induced factorization and as before we'll try and construct this approximation by variational inference which means we're going to try and minimize the k-l divergence between the factorizing approximation q and the true posterior over those unknown variables by not explicitly minimizing k-l divergence but maximizing the evidence lower bound the elbow minimizing variational free energy and this amounts to as we've now seen several times not a numerical process of optimizing an elbow which it can also be but here it won't be instead we'll find a closed form update we'll find that by once again following this central fundamental equation of these mean field approximations which is to construct iteratively our two approximations the approximation on c and the approximation on pi and theta by computing the expected value of the log joint distribution under the other approximation so we'll construct the approximation for c by computing an expected value of the log joint under the approximation for pi and theta and then vice versa so here is the joint this is the whole thing and we want to have an expected value of this expression for as a function of c because we are trying to look for an approximation for approximate distribution on c if we take expected values of pi and theta under an approximation that we don't yet know now to do that we look at this expression and we realize if we take the log of this that there's going to be a bunch of terms in here and in here that do not contain c at all so they are only contributing constants if you take the log so we can subsume them in the constant and then what we're left with are only the inner bits the inner bits are in the joint a product over d and i and k so over the documents the words and the topics over pi dk times theta kwdi to the power of cdik if you take the logarithm of that there's going to be a lot of sums over d and i and k over the logarithm of this product the logarithm of pi dk theta kwdi times cdik oh there's a cdik missing here let me just fix that there we go so now it's correct and this also tells us exactly what we need to do if we take the expected value under whatever the approximation for pi and theta is we can drag it inside into the sum for the individual terms the individual terms over the log pi dk theta dwdi and notice that really this is really at the core why we introduce these factorizations if we wouldn't have this factorizing assumption then at this point we wouldn't be able to drag anything inside here and we'll be stuck so by imposing factorization we're able to compute those expected values well we don't actually know yet what the expected values will be but that's just entirely analogous to the situation in the Gaussian mixture model we just accept that there will be some numbers here and as a function of c we notice that this log probability distribution is a product over di and k over the cdik times sum number plus a constant so the if you take the exponential of that we know that the distribution for cdik will be a factorizing over d and i set of discrete distributions with some probability distribution with some probabilities as a as a parameter vector which are normalized versions of these these numbers here whatever they might be raised to the power of cdik so what we can do is we can introduce a helpful notation we'll call those things pseudo counts so they're called log gamma dik and we can use them to define actual probabilities which are analogous to the responsibilities we have in the the Gaussian mixture model which we'll call gamma tilde dik which are just normalized versions of those gamma dik where normalization happens over the topics because every single of these assignments has to be a probability distribution over the topics for word i in document d so that's the first part this is like computing the approximating distribution for the cluster memberships in the Gaussian mixture model and now we can move forward and ask what's the corresponding step for our approximating distribution for theta and pi given that we now know that we're going to have a discrete distribution over c so for that the top part here is the same as before I've just changed something down here the the optimal distribution approximating distribution over theta and pi will be an expected value of the log of this big thing up here under our now found approximating discrete distribution over the entity cdi okay what if we take if we look at this expression and take the logarithm of it then we can notice again that um as we've done in previous lectures that there is a big part here on the left that only depends on pi and the big part here on the right that only depends on theta and they don't have mutually an effect on each other they're just a product of each other so if we take the logarithm of that we'll get a big sum over here with just pi's and of course c's in there and a big sum over there with just theta's in there now we plug in the actual form for a Dirichlet prior so um well by the way of course we noticed that we have products over d on both sides that we can take out and products over a product over over v's and um i's on either or no sorry a product over topics on either sides case so we can drag them out and we'll we'll find that we first see that the Dirichlet priors for each individual document will show up those are remember the business end of the Dirichlet is a product over the pi dk to the power of alpha dk minus one that's the bit over here and then there's a normalization constant which doesn't matter we'll push it into the constant and then in here we have another sum over d one over k and over i over the log pi dk and in front of that of that log is a c di k now this sum over i and um yeah just a sum over i we can simplify we can kind of collapse into this convenient notation with pseudo counts n that we've already encountered previously so we'll just use those pseudo counts again they were really helpful for us in the gips sampler and we implemented it so they'll be helpful for us again now if we use um if you use if you do variational inference it's also convenient for you because if you've already found a good convenient data structure to to deal with those you can reuse it similarly on the right hand side there's basically an entirely analogous thing that we can directly reuse and now there's a different sum that we're taking a sum over here we need the pseudo counts of how often we've seen any word in document d in topic k and here we'll need a count over how often we've seen in any document the word we being assigned to topic k we need to take the expected value of this expression under those discrete distributions so we'll need expected values for the pseudo counts um or sorry for the counts n dkv if you like and those are going to be really convenient to compute because these are expected values over some of samples from discrete distributions and as you may convince yourself those are because the expected values of the individual values of c are given by the gamma gamma tilde from above let me go back up again right so the expected value of c d ik is just gamma tilde d ik and because all the c's are independent of each other over d over i and well once we've normalized the gamma till this also over k we can just compute this like the sums over them which are given by the ends by summing over the gamma till this because you know the expected value of a sum is the sum of the expected values if the random variables are independent of each other as they are under our approximation so that means we can now stare at these expressions knowing what will come out of here so here we just get um sums over pseudo counts and here as well over gamma till this and um try to figure out what kind of distribution this may be this is the like final step in these in these these inner parts of these of these variational approximations as before for the gaussian mixture model so we see that as a function of pi this approximating distribution over q over pi and theta is so first of all there's a separate term for pi and then a separate term for theta there's no pi in this sum and there's no theta in this sum so actually we're getting two separate distributions one over pi and one over theta how convenient there's going to be some induced factorization again and secondly what others distributions well they are independent over documents and then over all the topics because these things have to sum to one there is a product over some parameter sorry over pi dk raised to some parameter and that distribution up to normalization is again a Dirichlet distribution because Dirichlet's are constant times product over k pi dk raised to some power and that power is just the thing up to the minus one that um we've already seen here right so that they just we just plug in those parameter values so our approximating variational distribution for pi will be a Dirichlet distribution and the approximating distribution for theta which is independent of it because of induced factorization will also be a Dirichlet distribution because it's of the same functional form it just has a somewhat more annoying parameter vector value because we have to sum over the gamma dik in a in a way that uses which word actually has which identity so we need to have an indicator variable that tells us how to collapse our pseudo counts or our approximate probability distributions you can maybe imagine that implementing this requires a little bit of non-pi foo but um in principle once you have dealt with this array you have a Dirichlet distribution again so very much analogously to the Gaussian mixture model before we now have an explicit approximation for theta and pi given that we have a discrete approximation on the c variables so the word topic assignments and there are Dirichlet distributions actually very similar also to the um to the cluster membership probabilities assignments in the in the Gaussian mixture model and the only thing that's missing to close the loop to be able to continue this process is the bit that we just assumed to have the kind of induction assumption we made on the previous slide where we said well we somehow have to be able to compute the log the expected value of the log of those two probability distributions so actually the sum of them two right because the log of pi times theta is the log of pi plus the log of theta so we have to be able to compute those um expected values of logs under those approximating distributions and as before as in the Gaussian mixture model we get to do that because we know what the expected value of the log of pi and theta is under a Dirichlet distribution we can look it up and find again that the expected value of the log is this um difference between two diagram functions and that tells us how to write our variational inference algorithm it's going to be a for loop around the whole thing that will and I'll show you pseudocone in a moment which at each iteration of the loop given that it has some assignments comma tilde dik to the individual topics um word uh word topic assignments that's the difficult word given that it has those it can compute the pseudocounts those n n variables and dik collapse them over i and get updates for our Dirichlet distributions on pi and with a little bit of slightly more work also on theta and then given that we have those we can compute expected values and that's actually the business end of the log um values of pi and theta using um the uh using uh diagram functions which are available in software libraries for us to compute the um the the new values for the gamma dik which we can then normalize and we're done I've already plucked in the expressions for the diagram functions here you may notice that there is a minus diagram over some other sum missing if and it's this thing it's the sum over this bit if you're wondering why then maybe like look at it afterwards for yourself when you do the implementation when you do the corresponding homework and convince yourself that we don't need that quantity because it actually drops into the normalization so with that we can build our algorithm and now I want to do a quick detour a very minor additional thing before we're done which is that maybe it's a bit confusing to you that and that's totally fine if we that that what we're doing here is we are trying to maximize this object this elbow this evidence lower bound to minimize KL divergence between the approximation and the posterior but we have never actually computed so far this function itself so this bit which we are maximizing we haven't even computed we're just building updates to these approximating distributions Q and it's entirely implicit that we are actually raising this bound and that's correct this way but if we want to and I very much recommend that you do this as a unit testing as a bug fixing exercise we can in fact compute this evidence lower bound here is the corresponding computation so to compute our evidence lower bound we need to and this is just from the definition of what it is we have to compute actually let me go back to again to show this again right so the evidence lower bound is the expected value under the approximation distribution of the ratio between the joint and the approximating distribution of the log of the ratio between the joint right so here is another way of writing that that's the expected value under the approximation of the log of the joint plus the entropy because my expected value of minus log of Q under Q is just the entropy of Q so for that we can we can plug in what we know we know that so we need we need to compute an expected value under under Q of this joint distribution here is our log of the joint distribution here's our log joint again and add the entropy and now for the entropy things are particularly convenient because we know that there's a factorization between those distributions and there's also an reduced factorization between theta and pi so we'll get entropies completely factorizing entropies of Dirichlet distributions over theta for every topic k Dirichlet distributions over the over pi for every document d and Dirichlet distributions for the topic word assignments or what topic assignments i don't know for every single word and every single document separately and those are just entropies over discrete distribution so they're just sums of logs weighted by the corresponding gammas and those entries up here they require us to compute expected values under those Dirichlet and discrete approximations for the log of this joint now remember that the log of this joint is just a big sum we can where we can rearrange terms and then the only bit where theta and pi shows up is in here and that's also the only bit where c shows up and we the c will come well is hidden in those ends which will come down in the sum if you take the log so to compute those values maybe you can imagine without the explicitly doing it that you can look up in those tables again using what we know about expected values of the ends and expected values of the log pi and log theta that the fact that they are diagramma functions we can actually compute this object in closed form and it'll give us a number now we don't need to know this number to run variational inference because the update loop itself already improves the bound but we can use these numbers and watch them change first to convince ourselves that we're actually raising this lower bound and maybe secondly even to check convergence we don't necessarily have to do this you have to check convergence this way we could also just check whether the parameters alpha and beta and gamma tilde of our approximate distributions actually change and if they don't change anymore then we could also call convergence but it's maybe reassuring to know that we're actually raising an lower bound with that here is our pseudo code finally for variational inference in our topic model in our latent Dirichlet allocation model and it has a structure that you might recognize if you've built the Gibbs sampler because it has a very similar kind of process just that the individual operations are a little bit different they don't involve sampling random numbers but assigning pseudo counts that get iteratively updated we begin by assigning random topic identities topic probability distributions for every single word in every single document and setting the bound to minus infinity and then for every document and every topic update the document topic distribution according to this update rule which we just derived then updating the topic word distributions by summing over those sufficient statistics as before and then finally and this is maybe the more complicated part the computationally more demanding part for every document every word every topic recomputing the this words topic assignment probabilities by computing an expected value over a bunch of diagram functions which come from a library and normalizing across the topics and if we want to at the end of this iteration of the loop we can compute the bound until this bound doesn't change anymore until it has saturated and then the that bound is a lower bound on the marginal likelihood for this model so for the marginal evidence for the probability 4w under this model we can do various things with this bound by the way but we'll get to that in the next lecture for today I think we should wrap it up here and then next lecture I'll get to tell you a little bit more funny special stuff what we finished today was the introduction of the final tool in our tool kit variational inference is a powerful mathematical framework to construct approximate fully probabilistic approximations so approximations that are probability distributions and in many cases we don't have to impose those a strict parametric form on those approximations but instead we can find optimal approximations by imposing only a very mild requirement which is that the approximated distribution should be factorizing in a certain way and then we can find such approximations by iteratively increasing a lower bound on the evidence thereby minimizing the KL divergence to the posterior from the approximation and that actual update steps involves computing expected values of the log joint under the approximating distribution an advantage of this approach over Monte Carlo is that it's an optimization process so it's a deterministic process that ends at some point and then it gives us a full probability distribution this can be much more efficient than Markov chain Monte Carlo because we don't have to wait for a Markov chain to mix and we don't have to check that it mixes it's very easy to watch and check and notice convergence however as you have clearly seen doing those derivations requires some work it requires knowing your model very well and it requires more than anything else very careful model design so that your derivations later on actually work out in your favor so using carefully designed graphical models with very careful associated choices of exponential family distributions really helps with those derivations even then though they tend to be tedious to do as you've now clearly seen that's why there's an old joke in the community that even though Markov chain Monte Carlo methods might be might take longer to converge and to mix maybe you can try and derive your variational bound in the time it takes for the Markov chain Monte Carlo method to mix there is a hidden truth behind that joke which is that MCMC methods are a good initial way of testing a model because they tend to be relatively easy to implement and variational methods tend to be the tool that you might want to use in production because they're more stable can be made more computationally more efficient and they're occasionally also more easy to interpret and to deal with with that we have our toolbox complete for this term of course there are more algorithms that one might use that I didn't have time to cover in this comprehensive but nevertheless infinite lecture course nevertheless I think that we have a very powerful tool set together now which provides an ample opportunity to build very explicit very expressive very powerful probabilistic models that quantify uncertainty in structured representations of knowledge in the next two lectures we'll finally think about a few more ways of of improving and deviating from those structures how to use them in practice in our topic model but that's for the next lecture thank you very much for your attention and see you next time