 Hello and welcome to probabilistic machine learning lecture number 23. In the last two lectures, we took a brief detour from this applied problem that we're currently studying to have a look at a related but slightly different class of inference models or generative models called mixture models. We started this conversation with a look at an even older algorithm, a clustering algorithm called K-Means, and we convinced ourselves that one way to think of K-Means is as performing maximum likelihood inference in iterative fashion in a Gaussian mixture model. So this means that we're dealing with data that is assumed to be generated by several different Gaussian distributions where each datum is assigned to one and only one such distribution. So we first realized that the EM algorithm constructs a maximum likelihood estimate for this model. So it constructs a point estimate both for mu and sigma and pi, the three variables defining the parameters of this model. And it does so in an iterative fashion where each iteration inside of the loop consists of two steps that alternate with each other. We also understood in the last lecture why we have to do things this way. That's because in this graphical model, even the likelihood, so that's the conditional distribution for X given those three parameters, is of a relatively complicated form. And one way to understand this is to look at this graphical model and see that if you think of these parameters as variables that these observations create a collider structure. So observing this data leads to a conditional dependence between all three parameters, mu, pi, and sigma. To address this, one way, one, which at that point seemed like a relatively ad hoc thing to do, to introduce is to consider an additional latent variable z, a helper variable, which assigns, which is basically a variable, a way of writing down the fact that each of these observations X is assigned exactly to one such cluster and considering the value of these variables. The reason that this is helpful, and this is actually not immediately clear from this graphical model, is that if we introduce this variable, then it turns out that if we would know the value of this latent variable z, then the resulting joint likelihood over mu, pi, and sigma actually is of a relatively simple factorizing form that we can optimize in closed form. So I say that this is not entirely clear from the graph, because if you just consider these two variables, then this doesn't of course remove this collider structure. But this is instead a property of this specific model, of the structure of this model, that provides us with easy to optimize distributions over pi, and mu, and sigma, because pi becomes independent of mu, and sigma in this specific model under this particular choice, when conditioned on z, and vice versa, pi, mu, and sigma become independent of pi when we condition on z. We, like one sort of intuitive way of thinking about this was that there is a sum here inside of a log in this likelihood that is not easy to optimize, and by introducing this variable z, we can write the joint log likelihood as a product, and when we take the sum, both of these, when we take the log, then both of these products go outside and become one big sum instead of leaving us with a log over a sum. Now, k-means actually really commits to this kind of structure. It doesn't even have this parameter pi, it just assigns these indicator variables z called responsibilities in a hard fashion to every single datum. What we realized in the last lecture is that we don't actually have to be this extreme, and you can still get a tractable algorithm, even if we consider this latent variable pi, and the trick here is to not assign the z values in a hard fashion, but instead, consider the probability distribution, the predictive distribution, for that, given a particular point estimate for the parameters, mu, pi, and sigma, which we subsumed into a variable theta. Computer, we saw that we can compute these quantities, they actually show up naturally in our optimization problem, and you can call them responsibilities. They are now not binary variables, but variables that are of the form of a probability distribution, so they are discrete numbers between 0 and 1, that's sum to 1, and then instead, we can maximize the expected value of this complete data log likelihood. So instead of setting set to a particular value, and then maximizing in theta, we compute an expected value over this quantity, over under this distribution 4z. This gives us an additional sum, so there are really, like for every single j, there are non-zero values here in the sum, and this could of course be a problem, but actually it turns out that for this particular optimization problem, it's still fine, because each of these quantities in here can be easily optimized, and if you have a complex combination of them in this form, then this optimization problem doesn't become particularly harder. We realized at the end of the last lecture that this algorithm is actually just a specific example of a more generic type of algorithm that applies in these kind of settings, and it's called the expectation maximization algorithm, because it does exactly this, it computes an expected value of some distribution that we can construct, and then maximizes this distribution as a function of its parameters. So we ended with phrasing this EM algorithm in this generic form, it's an algorithm that applies to situations in which we're trying to compute a maximum likelihood type estimate, where this maximum likelihood problem is hard to solve in closed form, because if you compute a gradient of this expression with respect to theta, we can't just analytically write down where the root of this gradient is, and the EM idea now, and this is why it becomes a tool in our toolbox, is to say, let's see if we can invent a latent quantity, a helper variable, like this z in our case, which was this indicator for which datum belongs to which cluster, such that this expression would be easier to optimize if, well, first, if we knew what z is, or maybe it still remains easy to optimize if you compute an expected value over the logarithm of this joint, rather than the logarithm of this marginal. So a very simplistic way of phrasing this or a little oversimplification is if you could only take this sum out of the log, then things would be easy, right? Let's see if we can do that. So if we can do such a thing, then the EM algorithm, or if you can come up with such a structure, then the EM algorithm consists of iterating in a repeated fashion over these two steps, where we first compute the predictive distribution for this latent quantity z, given both the data and our current parameter estimates, and then maximize, hopefully in closed form, but we'll see actually we don't have to do this in closed form a little bit later today, then maximize this, the expected value of the complete data log likelihood, so this expression without the sum, under this predictive distribution. And if you can do that, then we can repeat this process, wait for convergence, and call this the EM algorithm. Now, what we're going to do today is to take a closer look at this algorithm, here is the gray slide on which the last lecture ended, and we'll first try and understand why this algorithm is actually a good idea, and why we can expect it to actually converge at a maximum likelihood solution, so at a local maximum of the actual log likelihood, the marginal log likelihood, not the complete data log likelihood, but log of p of x given theta, and when we do so, that will give first a beautiful connection to a larger class of algorithms, and actually provide us with a final, very powerful, sharp tool in our toolbox. To understand why this algorithm is a good idea, as a first step, let's look again at this quantity, this predictive distribution or conditional distribution for the latent quantities, the explanatory variables, given the data and the parameters. I introduced this distribution arguably in an ad hoc fashion, I mean it's maybe the natural distribution to think about, like what do we believe about those latent quantities given that we have the data and the parameters, but beyond this nice interpretation, so far we don't have a strong motivation for why this is exactly the right distribution to integrate over, to marginalize against. To see that this is a good idea or to do things this way, I will first show that actually no matter what distribution we probably put in here, as long as we put any probability distribution over z, what the EM algorithm then does, so maximizing the expected value under this any distribution queue is actually an interesting idea because we can show that this raises a lower bound on the quantity we care about, which is the log likelihood, and then we'll realize why this choice is a particularly good choice. So let us consider any arbitrary probability distribution over our latent quantity z. In the specific case of EM, we decided to use this predictive distribution, but for the argument that follows this is enough, it's just any probability distribution over Q of z. We could call this an approximation for whatever we want to assign to z. Now first we'll see that for any such approximation the quantity that we are maximizing, which is the expected value of the complete data log likelihood under Q, is actually a lower bound on the quantity we want to maximize, which is the log likelihood. To see that we'll do as follows, so let's write down the quantity we want to maximize and introduce our latent quantity z, this is easy, we are always allowed to do this by the sum rule, and now let's assume that our approximation has the very mild property that it's non-zero wherever the joint is non-zero, then we are allowed to multiply by one inside of the integral without accidentally dividing by zero, and then we apply a wonderful result, which is stated down here, I'll explain it in a moment, it's called Jensen's inequality and it's widely used in analytical works on not just machine learning. I'll first tell you what it says, then we look at the theorem and then we give a picture of what it does. The theorem says if you exchange the order of the logarithm and the integral, so you drag it inside here inside of the expectation, then you get a lower bound on the quantity, so if we do this we can drag it in here, and this is the quantity that this quantity is then a lower bound on this quantity, and now if we maximize this expression with respect to theta, notice that this isn't quite what we're maximizing, because there's this one over q of z in here, but that's not a problem because q of z doesn't depend on theta, so if we add that in it's just a constant, we'll just get the expected value of the log of q of z, which is like the negative entropy of q of z, but it doesn't depend on theta, so if we maximize this expression, we're maximizing a lower bound on this expression. So what does that theorem again? It's called Jensen's inequality and it states that for any probability measure over some probability space, in our case that is q of z over the space of z, so the probability space consists of the space z and q as a probability measure on it, that's our mu here, if g is a real valued intercouple function, in our case it's this function, this ratio, and phi is a convex function, so we're actually not going to be using a convex function, we'll use a concave function, the logarithm, but that's fine, it just switches the sign around and the inequality, the direction of the inequality, then this property holds, so applying the convex function to an integral to an expected value under the probability over the function q, a g, is less than or equal to computing the expected value while we apply the convex function inside of the integral. So the picture for this is, here is a probability measure, actually it's a random variable g, which has a distribution under mu, let's say the distribution looks like this, this is our convex function, then you can sort of visually imagine that if you take this probability density and apply this convex function, then this convex function, because it has a higher derivative over here, when we apply, when we do the change of measure rule, over here there's more stuff being pushed upwards, because the function is more rapidly changing, so we'll get more concentration of probability mass in higher regions, and therefore the expected value of this distribution will be lower than the expected value of this distribution, which is sort of shifted upwards. In our case we're applying a concave function, so everything's exactly the other way around, everything every plus turns into a minus and a less than turns into a larger, then that's why we have our lower bound here. So in em, when we make a particular choice for q and set it to this predictive distribution, when we maximize this expected value of the complete data log likelihood, then what we're doing is, as we just saw, increasing a lower bound on the log likelihood, this expression. So you can imagine if you're thinking of a particular value for our data log likelihood, the marginal data log likelihood, what we're doing in em is we're constructing a quantity that is below or equal to the value of the log likelihood, and now what we're doing in em is that we're raising this quantity, we're pushing it up. So it's, that sort of sounds like a good idea to do, right, because if you're pushing up a lower bound, maybe if you're lucky you're also pushing up the actual log likelihood, but of course it's not a guarantee that that will really happen, because we might just be increasing this lower bound without it ever touching the log likelihood. So to make sure that we're really increasing the data log likelihood when we're maximizing the expected complete data log likelihood, what we're now going to do is, is that we'll see that this particular choice that em makes, setting the q of z to the predictive distribution for it under the data, that that actually is the optimal choice. This makes the bound tight, and if the bound is tight, then by raising the bound we're also raising the data log likelihood. To see this, here's another slide that is full of math. So I've written down this lower bound again, so I always have to be a bit careful when I say this by the way, because there's two different names for this. In the context of em, this is the expected complete data log likelihood plus the negative entropy of q. We'll quickly find, and I'll just say it out now, so that I don't get too confused, that this is also a lower bound on the evidence, and that's why it's called the evidence lower bound, and that's the name that is actually commonly used, the elbow. So I'll get to that in a moment, but if I sometimes confuse it, and in particular, if I sometimes say the expectation lower bound later on, it's because of this situation here. So this quantity that I just defined on the previous slide, let's see if you can play around with this a little bit. So first of all, we have a joint up here, and we're doing an integral over z, so it might be a good idea to split up this joint into a quantity that depends on z, and one that doesn't depend on z. We can do this using a product rule of probability theory, I've done that here, and written a predictive distribution for z, given x times a marginal for x. You can already see that this is the quantity we will eventually use, so that's very convenient that it shows up here. Now we notice that we have one expression that involves z, and one back here that doesn't involve z. We can use the functional equation of the logarithm to pick that out, and we'll get this expression, which is the expected value under z of the predictive distribution divided by q, plus something that doesn't depend on z, so we can drag it outside of the integral, and the resulting integral is just the integral over q. Now q is assumed to be a probability distribution, so its integral is just unit, okay, so we can just pick that out. Now if we just saw, what we just realized now is that we can just rearrange this equation, right, we can drag this term to the left hand side and keep this on the other side, then we've just seen that the thing that we're trying to maximize, the log likelihood, so this is the function we care about as a function of theta, can be written as this thing we just introduced, this lower bound, minus the expected value under q of the ratio of the log ratio between p of z, given x and theta, and the approximation q of z, so maybe you notice that this is actually an expression that we've encountered before, we've encountered it in lecture, I think, 15 on exponential families, and back then I pointed out that this quantity is called the Kulblak light-blood divergence between q and p of z, given x and theta, including the minus. If you go back to that lecture and remind yourself, you'll find that it's called a divergence because it satisfies two important properties, this object back here is always larger or equal than zero, it's always a non-negative quantity, this is called the Gibbs inequality, this relationship, and this is of course just a V assurance, a V affirmation of what we just already know, which is that this is a lower bound on this, so therefore the distance between this quantity and this quantity has to be a positive number. However, maybe here in this context this is the more important statement, this quantity is zero if and only if q and p are equal to each other almost everywhere, well mu almost everywhere, where mu is the base measure over z, so that means in a specific case of em, even though we didn't realize that we were doing this specifically back then, when we make the concrete choice to construct an approximation q that is equal to this predictive distribution, we are effectively setting this term in our equation to zero because we're setting q equal to p and that means that this quantity here, this evidence lower bound or the expectation of the complete data log likelihood, this is actually not a lower bound on this quantity, it's exactly equal to this quantity at the particular location theta that we're currently considering, so here's another visual story on what em does, another recap of what happens in em, what we're trying to do in em is that we're maximizing this quantity, the log likelihood, we now realize over the past few minutes that we can write this log likelihood as the sum consisting of a lower bound and a positive number necessarily otherwise it wouldn't be a lower bound called the KL diversions between q and p of z given x and theta, they are here these two quantities, so what em does is in the e-step, so when we are constructing this predictive distribution for z given x and theta, we are actually setting at a particular value theta the lower bound equal to the marginal likelihood at theta, so that means we're actually making the bound tight and then in the m-step when we're maximizing this expression with respect to theta the other quantity, we are raising this bound by something and because it's a lower bound on the log likelihood and at the moment the point where we start the log likelihood is tightly bound by this bound, we also necessarily have to raise simultaneously the log likelihood, but because we are changing theta at this new point that we are going to, the bound is not necessarily tight anymore, we now have a new value for p of z given x and theta at this new theta i plus one if you like and therefore our previous approximation so p of z given x and theta i the old theta that here this gap is not necessarily zero anymore, so when we go back when em now updates this or resets this q to p then this bound gets tight again and the process can repeat over and over again, so to really drive the point home let me rephrase again what I just said but maybe show it to you in the slide that I've used so far for em the summary for em, so we can think of em as an algorithm that iterates between computing a posterior distribution over some latent quantities and optimizing the model that produces those this posterior distribution, so it's kind of doing simultaneously Bayesian inference and hierarchical Bayesian inference from that point of view we can think of this quantity as an evidence for this model so consider the posterior over z given theta in that application of Bayes theorem the in the denominator you would see this term so without a log that's the evidence so what em does is it iterates between computing the actual posterior for z in this model this posterior given the particular model parameters if we plug it into well we can think of this as this as a choice which like sets the the KL divergence between an approximation for z and the actual posterior to zero that's kind of almost a tautology because the KL divergence is zero whenever their two parameters are the same but when we do that we can then compute this interesting object L which is from this point of view could be called the evidence lower bound or it's also up to a constant the expected value of the complete data log likelihood and that lower bound we can then raise optimize in the em step to increase the quality of the model to fit theta better basically so this really is kind of an iteration of you know Bayesian inference of the type that we've discussed several times before there is a model parameterized by theta and it has latent parameters called z so what we do is we compute the posterior for that this requires us to compute the evidence basically then we raise the evidence and we compute the posterior again at this point I hope you know have a better grasp grasp for what em is will add em to our toolbox not without noting a few interesting aspects so if we want to make em a tool for ourselves that we want to use in various applications here are three kind of insights that might be useful to have so the first one is that whenever we apply em it's still again a really good idea to have in our model that we designed our probabilistic model exponential families lying around why because they tend to make the optimization easy we already saw this in the Gaussian mixture model when we made this particular choice and we introduced introduced z in this particular way and used gaussians as the representation of the clusters and a discrete distribution for pi both of which are exponential family distributions we saw that the optimization is particularly convenient even though this may have seemed like a lucky coincidence there is something a bit more structured with that by our choice of exponential families the idea or the decision to use exponential families to represent the distribution over the data em is usually a good one not just in gaussian mixture models why well if you think of a generative model where the predictive distribution for x given z is an exponential family so it's parameterized by theta in the form of an exponential family sorry x n z the joint model then em our elbow our evidence lower bound is given by well this expression all right so that's the expected value under q of the log em of this exponential family plus a constant right the negative entropy of q then em because we take the log right the exponential goes away and what we're left with and this is really the beauty of exponential families all over again is an expression where theta enters linearly in the term where x and theta show x and z show up so our expectation under q over this expression of course doesn't affect a lot of that we can just take it outside of the integral and it doesn't affect theta so we can take theta outside of the integral and we're just left with an expected value under q of the sufficient statistics of our exponential family now of course that in itself doesn't guarantee yet that we consult this optimization problem in closed form but it often often leads to optimization problems that are of closed form why because if you take the gradient of this expression with respect to theta well clearly here theta is just going to drop out because it enters linearly right em this quantity might be something we are lucky enough to be able to compute and if we can then the only thing that might hinder our progress is what the shape of the gradient of this log normalization concept is and for many practically used exponential families we know and we've seen this for Gaussians now for discrete distributions and so on that those gradients have a relatively elegant form in the natural parameters theta and therefore can be solved in closed form so long story short again when you're building your graphical model when you're writing down the generative process for the data that you want to use to model the data you're working with it's generally a good idea to use exponential families as the base types as the first choice the first attempt to you know parameterize your probability distribution because it tends to make your life easy a second insight that is also useful to extend how how powerful em is is to maybe point out that we can use em not just to maximize the likelihood the evidence so the likelihood for x given theta but also to maximize an a posteriori estimate for theta given x so to construct a map estimate maximum posteriori estimate for theta given x why well because everything operates on the log and when we add a prior this really just sort of runs all the way through so maybe that was too high level here is what I mean by that if we are doing em then let's just look at what happens if we care about the log posterior well we could take our evidence lower bound this expression that we're maximizing and just instead of considering the likelihood for x and z the complete data log likelihood we could consider a complete data log posterior if you like so we just multiply by a prior here em that's the posterior up to normalization and normalization doesn't depend on the on then em em now if you think about what happens here there's an integral over z and of course that doesn't affect this prior because it enters in the log so we can just write here plus log of p of theta times an integral over q of z dz right which is one because q of z is a probability distribution so we can em just as well maximize this expression right by just as well I mean if you already know how to compute the gradient of this thing you get to choose the prior over theta and then of course you can add its gradient to whatever gradient of the elbow you've already computed and then follow that when you do that you're effectively maximizing log of the likelihood plus log of the prior so log of posterior so if you are having problems with em if it doesn't work in practice the most common issue is that this likelihood is em it's just not properly defined it just has some some pathologies maybe it em it diverges or it isn't you know it's just not properly regularized maybe it's it's also em it doesn't have a unique minimum and then what you can do is you just add a log prior and that you're essentially constructing a log em map estimate and the adding this prior is a really simple thing because you just have to be able to write it down as a function compute its gradient you don't have to be able to perform integrals and by the way this doesn't invalidate the previous point if you're using exponentially exponential family distributions for your model that in the hope of getting a tractable expression for the optimization problem you might still get maximum likelihood estimates at awkward points but then there's also a very natural prior to choose it's the em conjugate prior for your exponential family distribution which always exists as you know from the lecture on exponential families that conjugate prior might not be completely tractable as a probability distribution but it doesn't have to be because you only need its logarithm you don't need to have the normalization if you're just computing a map estimate so but what do you do if you somehow end up with a model where this optimization isn't tractable where you can't just compute the the exact minimum of the negative log likelihood or the exact maximum of the likelihood or posterior in this kind of fashion or actually the exact maximum of the elbow in this particular fashion well then actually turns out that you can still use em and you can just numerically optimize this evidence lower bound so why is this a good choice so there's a final slide on this so notice that the KL divergence which is the difference between our lower bound and the quantity we do care about optimizing this KL divergence is minimized if q equals p so if we optimize the evidence lower bound this curly l then eventually we will reach a point where so what what we'll do is one of two things right or maybe both of them simultaneously either we're making the KL divergence smaller or we're increasing the log likelihood eventually like if the former happens if we if we are making the KL divergence smaller eventually we'll reach its minimum its gradient at that point will be zero and then the gradient of the lower bound has to be equal to the gradient of the log likelihood so by keeping by following it we're now necessarily increasing the actual likelihood so what you can do is right you just compute this quantity the lower bound compute its gradient by automatic differentiation and then just follow that gradient if you do that sooner or later you're bound to also increase the log likelihood and when this optimization terminates so once you found a zero a root of that gradient you are necessarily at the mode of this expression right and with this we have a new tool in our toolbox the EM algorithm it's a tool that you might want to whip out if you're faced with a maximum likelihood type problem where you want to find estimates theta that maximize the likelihood over some model and this maximization is not of a particularly simple form so you can't just write down what this likelihood is and then maximize it in closed form instead we introduce a bunch of latent variables such that we are able to compute the posterior if you like over these variables given the current value of our estimates for the parameters and notice that in contrast to how we usually do Bayesian inference these variables set that we introduce here they are purely at least for the purposes of EM they're purely for algorithmic convenience they are not necessarily the quantities we actually care about inferring arguably we are caring about inferring the values of theta the parameters so for example in the Gaussian mixture model that we looked at Z where the cluster membership variables the the the the the relationships between which datum comes from which cluster those weren't so much the quantities we cared about what we really cared about were the shapes of the clusters and those were in theta and nevertheless we found that introducing this nuisance variable if you like that is really helpful for maximum likelihood inference because we can compute a posterior over it and then the well a tautological statement to make seemingly is that sorry that this posterior minimizes the caldivergence this q this provides an approximate distribution for z that minimizes the caldivergence to the current posterior that's maybe tautological but what this also means is that this choice maximizes a or tightens a lower bound on the log likelihood at the particular value theta that we're currently at and then by raising this lower bound we are also necessarily because it's a tight bound raising the log likelihood and if we now then go back and recompute the posterior at this new choice of theta will tighten the bound again and we can then update the log likelihood again and here I've actually I mean this is maybe a typo but it's also actually maybe the right choice right the I've called this a lower bound an expectation it is a lower bound and it's also an expectation under q where q is the posterior for z given x that's why this algorithm is called the expectation maximization algorithm it iterates between computing an expected value under the posterior for z and maximizing that expression with respect to theta now we'll add this tool to our toolbox it's a piece of machinery that we can use to perform maximum likelihood inference in complicated generative models but this is not so straightforward notice that to do that we have to look really careful at the model we have to come up with these latent quantities z that make optimization easy so it's not the kind of thing that you can very easily automate you typically have to provide a lot of structure to the algorithm to get this right so it's again an area where a skillful human operator is key to achieve good performance at this point we could spend some time implementing em for various models but I'll actually not do that instead I will want to skip ahead because I really want to complete our toolbox with one additional tool and the convenient thing is that this additional tool which will be called variational approximations actually is a direct generalization of the em algorithm at least from the point of view that we have now taken and therefore it's relatively easy to add it's just another little thought on top of what we've already done today we'll do that now but maybe here's a good point for you to take a quick break and then we can continue on okay so I hope you've had your break to get to this final tool in our toolbox here's our thought process so we we encountered the em algorithm as a way of finding maximum likelihood estimates for models that involve some parameters theta and some data x we did this by adding latent variable z which we didn't care about we just use them for numerical convenience as nuisance parameters and set the approximation to the actual posterior of those nuisance parameters now maximum likelihood is an interesting like estimate to construct but ideally of course we want to have Bayesian estimates for everything so we would like to have a posterior over theta not just a maximum a posteriori estimate for theta because I pointed out that we can of course use em to compute map estimates now we ideally we like a probability measure over theta as well so let's say that we subsume our theta into z we just consider theta as part of the variable z and our goal will be now not to compute a maximum likelihood or maximum posteriori estimate but actually a posterior over z which now includes theta now unfortunately the price will pay for that is that in general we won't be able to compute the posterior over z given x anymore because the model will just be too complicated to do that so for example in our Gaussian mixture model it was clear to us that we can't compute in closed form a full posterior an exact posterior over all the cluster distributions and their assignments to data because that those wouldn't have a simple parametric form this challenge this fact that we can't compute the two posterior this was the original reason why we did em in the first place actually it turns out though that we can use the same mechanism we just used to maximize the likelihood we can use that also to construct an approximate posterior for those latent variables that so let me go back to this picture for em maybe so here in em we we said let's let's repeatedly construct a distribution q over z and make it equal to the posterior because then this town this bound is tight and then we can maximize this function as a function of the model parameters theta to increase the em evidence the marginal likelihood for the model now we can when we do that we can use the exact same expression and instead try and find like use the same mechanism to try and find a good approximation q given that the full posterior is intractable so we could also say let's keep the model constant so that's not changed theta the parameters of the model of the model then em this year this l is still a lower bound on a constant and that lower bound the distance between the lower bound and this expression is the KL divergence between the distribution q we're dealing with and the true posterior for all the variables said given x so what we can do is we can try and come up with some way of introducing a family a space of approximate distributions q and again maximize this lower bound but this maximization will now be the actual part of the effort the actual like the key part of the algorithm it won't be a simple straightforward thing we just write down and when we maximize this lower bound we are then going to naturally try well we're going to make this gap as small as possible right because this is going to be a constant so when we make this gap small then we're minimizing the KL divergence between the approximation and the true posterior that's also a good thing because that means that the q we're getting out of this process even though we're not actually raising the red bar will be as close as possible in the sense of KL divergence to the true posterior and maybe that's something we want because then the resulting approximation q will hopefully be well it'll be good at it'll certainly be good in the sense that it'll be close to the true posterior in terms of this KL divergence now this process this idea is called variational inference in machine learning or minimizing variational free energy at least it used to be called historically and why is that so before we get into the details of the algorithm let me maybe make a bit of a historical connection or a technical connection to where this algorithm historically comes from let me just find the right slide again and here it is so um this this whole idea hinges on the this insight that we've already been using for this whole lecture but that I maybe have to state again because we are now sort of switching focus a little bit it's the same equation as before what we were thinking about it in a different way so we've realized that any probability distribution over some variable x which we can think of as data the logarithm of this p of x can be decomposed into two terms if we introduce an arbitrary distribution q of z over some latent quantities z that are assumed to be related to x through some generative model p of x and z as the composition states that we can write this p of x as the evidence right p of x is the evidence in this model we can write the log of this evidence as the sum of the KL divergence between q of z and the posterior over z under the generative model and this lower bound this evidence lower bound because it's a lower bound for the evidence the elbow we can think of q as a model that we are constructing and what we do when we maximize the lower bound is that we minimize KL diversion so we get our model q as close as possible to the true posterior that it really should be we could also have to find this quantity l with a minus inside in front and then it would be something that you wouldn't maximize but minimize that's maybe more convenient for the framework of optimization it's also the typical way in which physicists phrase this this kind of process and then the negative value of this l of this lower bound is which is an upper bound right and is is called or is associated with the term variational free energy the reason for this as far as I understand is a very deep old connection to modeling in physics so going back to basically all the way to newton physicists have to have used the the word energy as something that is being minimized as something that is dissipated by a system that energy is sort of naturally lost until the until a system finds a state of lowest energy historically this was originally used for for simple mechanical models but then very quickly or maybe not very quickly but in the late 19th early 20th century when physicists increasingly started to model complicated systems like thermodynamic systems and fluid dynamic systems and and solid state systems they used energy as a representation of the quality of their model they they encountered a situation where they had to model the behavior of something that was overly complicated it was too complicated to do on a piece of paper and in the late 19th century of course people didn't have computers so what physicists did was to build statistical models of the systems that they were dealing with and describing their behavior as minimizing some energy which we now today maybe can connect to the idea of finding an approximation to the actual behavior of the system that gets as close as possible to the true posterior of like the states that the system that we should really the state of our knowledge of the system the statistics of the system if we knew everything about the system and we could really write it down so this idea of free energy is really that of it's actually kind of i think it has changed its role from a law of nature that that that systems always minimize energy to the concept a word that is used to describe a mismatch between the between reality or under some generative model and sort of the true posterior and a model an approximation that can be constructed in attractable fashion so across the like the timeline of physics various great physicists like Helmholtz and Gibbs and Boltzmann these are all people that are associated with thermodynamics and statistical physics came up with various ways various models that describe the behavior of a thermodynamic system typically in terms of some simple basic statistics like potential energy according to some macro friend and field of how particles are interacting with each other and the entropy of the system or the or the relationship between pressure and volume of the system and then they said this the actual state that the system finds is the one which minimizes free energy and that means it minimizes the difference between what we what really actually happens and what our model describes as what really happens in machine learning which is maybe the contemporary continuation of the the modeling process that physicists started 200 years ago more than 200 years ago we can actually use a very similar kind of idea we can say that there is a system that or there is actually a true posterior for the the quantities in the world that we would like to compute but we can't describe it with our computational tools so instead we're going to find an approximation which gets us as close as possible to the true posterior in the sense of minimizing hail divergence between the approximation and the true posterior and which is equivalent to maximizing the lower bound so here i'm putting David Bly as a placeholder for machine learning researchers these days he's not arguably probably not the inventor of variational inference such it's very difficult to pin that down to one person other names that we could certainly mention here are Chris Bishop and David Pekai and various other people i'm sure Michael Jordan was also involved but i think they would like him is actually probably the inventor of the term elbow the evidence lower bound so i'm going to use him as a placeholder here and i hope that that's fine so in machine learning the connection so that actually the reason i'm showing this slide is that i believe there's a deep connection between what we do in machine learning these days when we do a particular probabilistic machine learning and the work of physicists in the past which is that we're building models for the world and trying to tune those models such that they predict the true the real reality of our world as much as possible and we do that by constructing probabilistic generative models and then numerically for inventing algorithm that find approximate posteriors or point estimators for them the difference between machine learning and the kind of physics that these fellows here did is that we have access to massive computational power that these people didn't have historically and we can use those to build much more powerful much more elaborate models and this may have seemed like maybe a bit of a chaotic historical detour maybe it was but the connection to physics actually will stay with us for the remainder of this lecture but for the moment let's go back to what we're trying to do let me summarize the idea that I've proposed in our toolbox we've already collected various algorithms that approximately in a numerical fashion get us give us an estimate or something that is related to the true posterior arising from a generative model and some data what we're trying to build here is a new such approximation another one and this approximation is based on the following insight if we have a generative model that connects some latent quantity z to some data x then we've realized that there's this algebraic trick that allows us to write the KL divergence between any approximation q of z so that's a probability distribution over the latent variable z to the true posterior under the generative model of for z given x as the an equivalent maximization problem where instead of minimizing the KL divergence we're maximizing this evidence lower bound the elbow and we know from the case of em that this evidence lower bound might be easier to compute under q than the KL divergence itself because it only involves not the posterior over z but the log joint so the logarithm of p of x and z and maybe under our q that's an integral that we can actually solve so one thing we could do a first idea would be to write down to make a hard decision which approximating q we're going to use so for example I could say q of z should be a Gaussian distribution which has a mean and a covariance maybe I even fixed the covariance so I could do something really crazy and I hope that I just hope that I'm lucky enough and I can compute the this evidence lower bound so the the expected value and the discussion approximation of the logarithm of p of x and z and then if I can compute that as a function of the parameters of my Gaussian the mean and the covariance or whatever I want to choose as parameterizations then I can compute a gradient with respect to this and optimize for it now this idea exists it has a name it's called parametric variational inference but we shouldn't adopt it yet it would be a bit of a knee jerk reaction and it would be getting ahead of ourselves imposing an explicit structure for the approximation we're going to construct q to decide that it's Gaussian for example is actually not always needed it turns out that it's sometimes possible to find an approximation over a large space of probability distributions without explicitly stating what that family is in terms of a set of parameters and this idea is based on a really beautiful framework called the calculus of variations it's one of the grand ideas of applied mathematics that you don't learn about in school maybe it's it's as important as differential calculus but it's rarely taught it's connected to again a bunch of physicists so I told you that the physicists will stay with us um Leonhard Euler Lagrange and then even Feynman who arguably got his Nobel Prize for using it in the development of quantum field theory I'm mostly just showing these faces to explain that this idea goes back a really long time that is a very powerful idea that has survived over the the centuries the fact that it's still relevant today probably means that it's going to stay relevant for at least a few more decades so long enough for your career maybe and instead of doing a large like foundational introduction we're going to like do it like use this idea directly in the context of our machine learning problems and it goes like this so let me go to the right slide so we have our generative model p of x and z we want to construct an approximation q of z that um maximizes the elbow or it minimizes the k l divergence to the true posterior under our generative model for z given x now if we don't impose any constraints on what q might be then we know exactly actually what the best fit is the best fit is the one where we set q of z to exactly the posterior p of z given x but we know that well we assume that we can't compute that if we could compute it then we wouldn't have to have this conversation in the first place one thing we could do is to impose and say that that q of z has a particular parametric form that it depends on a bunch of numbers in the following form for example it's a Gaussian distribution but that's a too extreme a step so instead let's see if we can find or it turns out that we can put a wager form of restriction on this distribution this unknown distribution q and still get a closed form tractable answer out and a really powerful way of imposing such a restriction is to only require q to factorize in a particular fashion so what we're going to assume is that there is a q of z that we're looking for we want to want it to minimize the KL divergence to the two posterior we're not going to say what q actually is but we are saying there are certain sets of variables in our latent variables so we can we can take the set z of all our latent variables and group it into subgroups indexed by i and we want q to have the property that we can write it as a product over these individual terms that's all we want and now what we like let's see whether this might get us to a structure that actually tells us what q i is without having to write it down explicitly so let's write down our evidence lower bound l of q so this is just the remember that l of q is given by the integral over q over the log of the joint divided by q or if you use the properties of the logarithm and that's the logarithm of p of x and z minus the logarithm of q we impose our factorization structure we've decided that q should factorize in this form so we get a product here and in the logarithm we get a sum of course so now let's pick out one individual variable j or one set of variables within z that are indexed that is indexed by j if we if we focus on this particular subset then we can exclude it from the rest of the product and think of this first expression here as an integral over q j times all the other terms in the product so those are all the i that aren't j and the corresponding integral because of this factorization assumption becomes can be dragged inside right because well i mean we can just just first do those we can just do those first and leave the integral over the z j as the outermost kind of summation variable and in the back here we then get for the same kind of separation we get one term where we get an integral over q j log q j d z j and then a bunch of terms where we just get integral over q i log q i so that that's just a constant it doesn't have anything to do with q j or we get an integral over q j log q i and that's actually also a constant because well log q i doesn't depend on q j so we can take it outside of the integral and we just get an integral over q j d z which is just one because it's a probability distribution so these are the remaining bits that actually depend on q j and the rest is just constants now if we look at this expression here a bit carefully we see that it's actually a marginal a marginal over well it's not really a much it's not a marginal it's an it's an it's an integral over or it's an expected value sorry it's an expected value of the log of the joint under this approximating distribution q i over all the other variables so what this means from a functional perspective is that all of these are integrated out and what we are left with is an expression that once we've done the integrals only depends on z j so we could write this down using this kind of notation so we could say there's a there's a p tilde some other function that isn't actually well it's not log p it's log p tilde of some other function that only depends on z j you can you've maybe already noticed that this is kind of related to the idea of message passing in a graphical model where we are constructing messages by summing out all the other bits and then sending it as a function forward if you'd like to know a bit more about this connection ask me about it in the feedback and we can talk about it in the flip classroom this is actually a notion called variational message passing that is exactly this so um where we're left with a function that only depends on depends on z j and now we can just look at this expression and see if we can make sense of it and maybe you've realized already that the expression we see here looks a bit like a k l divergence like a k l divergence between q j only the sub part q j and this sort of constructed approximate distribution p tilde of just z j right is only a function of z j this is a very powerful insight because we actually know how to minimize k l divergences and we know so in closed form we don't have to write down some uh iterative process anymore we just know where this this functional k l divergence is minimized it's at exactly this distribution p tilde of z j so let's build let's use this insight to construct an actual algorithm an algorithm that constructs an approximation to the true posterior for z given some data x under this generative model by minimizing the k l divergence between the approximation q and the true posterior without actually saying what the explicit form of this approximation is supposed to be the only assumption we're putting in is that q has this factorization property that it provides independent distributions under this assumption this algorithm works as follows we initialize somehow so some with some initial distribution admittedly this step is a bit wake we'll see in practice that this is actually of another problem and then we iterate between the individual variables and iteratively compute the elbow the marginal evidence lower bound so the lower bound on p of x or actually lock p of x under the approximation specifically for this one uh for the local variable set of variables z j we just saw in the previous slide that we can write this elbow this evidence lower bound in this form and we realize that what we just what we see here is a k l divergence between our approximating distribution q and some implicit probability distribution p tilde of z j it of course depends on the data but it's basically a p tilde of z j now we don't have to do this minimization manually if you like but we can actually we actually know what the distribution is that minimizes this k l divergence it's p tilde so if we can find a probability distribution q j star of z j such that it's logarithm the logarithm of its pdf is equal to this function then we have found the quantity the distribution the local the factor in this approximation that minimizes the k l divergence and I'll notice this expression here really is a function of z j it's not a value of that function at various points it's an entire function so it actually might uniquely identify this approximation without us having to write down what like in how how q j can be written in terms of some parameters it's it's it's already an entire function it is the logarithm of a probability distribution which minimizes the k l divergence now if you've done that for z j then we can keep iterating over all of the individual terms in this approximation and keep like coming back to it in a in a while loop right it'll turn out and this is something I'm not showing here that this iteration will actually converge it will converge because this elbow can be shown to be a concave function or its negative value a convex function in the space of all functions q or probability distributions q this approach of optimizing k l divergence between an approximation and the true posterior in a non-parametric form in a so-called variational form by optimizing the entire function as one is perhaps the most general form of this class of approximation techniques called variational inference it is also connected to a or sometimes also called actually in the machine and literature mean field theory because it arises from an analogous idea in physics where we model the behavior of a multi-body system so many particles together in interacting with each other through some potentials that each individual particle has by modeling the behavior of such systems as separate independent distributions so like as terms in a joint probability distribution such that this independence behavior minimizes k l divergence to the true distribution that they would have if you would consider all interactions why is this called mean field because it mediates the effect of all the other particles under these potentially super complicated interaction terms solely by this object here and this object as you know from the previous slide let me just go one slide back this object is an average over the two joint distribution under all the other distributions of the particles so it's a mean of the joint field under all the other distributions of the all the other particles and it's the only way that it creates kind of mediation terms between all the particles with each other so we can we could think of every individual particle behaving completely independently on its own in a potential that is only driven by these mean interaction terms but that's just an aside if you see somewhat the term mean field theory you know that this is essentially fact a variational approximation where the approximation is imposed solely by factorizational assumptions there's also sometimes in the literature the case that mean field theory applies to or implies that we're taking a maximal factorization so we're really assuming that everything is completely independent of everything else but what exactly that means is dependent on like a definition right of what the individual variables are so we'll we'll actually get to see an example of this with that we've actually done the hard part the derivation we've arrived at our final tool we haven't applied it to anything yet we'll need to do that in the next lecture that'll be our job for the entire next lecture doing concrete examples of how this works today or for this this lecture was the the derivation of this algorithm variational inference the final tool in our toolbox is a general framework for constructing approximating probability distributions that approximate the true posterior arising from a generative model by minimizing the KL divergence between the approximation and the true posterior under some imposed constraints on what the approximation is it's a very powerful framework because what i just said leaves us with a lot of options for how to construct this approximation if we don't put any constraints on q then well we know what the what the minimization process will give us the true posterior now in in general we won't be able to do that but the variational framework the calculus of variations allows us to sometimes only impose very vague constraints like for example the assumption that q only factorizes in a particular way without explicitly stating what the terms of the factorization are we've just seen that when we do this then the the framework tells us and this is the operation operational equation of this framework that we should set our approximating q over the variable subset index by j to the expectation under all the other terms in the factorized approximation of the log of the joint this term here this expected value of the log of the joint under all the other approximations is called the mean field and therefore the resulting approximations are called mean field approximations we will see next lecture that even though this is such an abstract statement it's actually possible to realize in practice and it's particularly straightforward if we use exponential family distributions in p in our generative model why you can maybe already guess well because then the log of p will consist of relatively simple terms that are linear in the in that in the parameters of the model and all we need to compute quote unquote is the expected value under the approximation of the sufficient statistics of our exponential family distribution and quite often these approximations are possible to compute in closed form and they give us a function that explicitly tells us what q actually is so now i've used the word explicit and tractable quite a lot but of course everything we've done so far is still very weak and abstract and to make it concrete to make it a tool that you actually know and understand how to use we have to apply this framework to a concrete problem for example to the gaussian mixture model and to our topic model to close the circle and come back to our running example when we do so you'll see that variational approximations are a very powerful tool that can be applied to a large class of models you will also find that they require a lot of manual work to do derivations and to implement them in practice on a concrete problem variational approximations are therefore in practice not the kind of tool you want to use when you first design a generative model we've seen we've also done this in our topic model example that in such situations you may prefer to use a smaller subset of your data and run let's say for example a Monte Carlo Markov gym Monte Carlo algorithm to get a first idea of what the model looks like what its predictions look like but variational approximations tend to be significantly more efficient than sampling because they turn inference into an optimization problem and still return a full posterior we'll see how to do this in our practical examples and then add this as the kind of a the high performance tool that you might employ in a in a finished product to get really fast reliable stable and quite efficient probabilistic inference at nevertheless relatively high fidelity you'll see how we do this in practice in the next lecture i invite you to go there and see how the derivations actually work out in practice for now we're done with this lecture though thank you very much for your attention