 So what is EM? I told you on Monday that we've now spent this entire course building these beautiful probabilistic algorithms and so these are methods that assign probability measures to latent variables in particular to things you would like to know. And then we infer those latent variables by extending these probability distributions as a joint measure over not just the things you would like to know, but also things we can actually measure, the data. And then use Bayes theorem effectively to compute posterior distributions over those latent quantities. But whenever we do that, invariably, at least if we think hard enough, we discover that there are some other parts of our model, some parameters, some ways in which you could change the model in a discrete or continuous fashion which we have to sort of drag around. And we call these things parameters. So at least from a computational perspective, you can sort of think that in this entire algorithmic setup that we're working with, there are three types of numbers floating around. Data, those are numbers you don't change. They just come from the disk. They're stored or measured, and they're never questioned. Variables, these are things you would actually like to know. So you're uncertain about them. You're assigned probability distributions to them. And parameters. These are the numbers that you need to write down the model. And you don't actually know what they are. You also don't care so much about them, but you just have to have them in your model. Otherwise, you can't make it work. So if you have to have them in your model, the natural question is, how do I set them? So we need algorithms to set those parameters. And I spoke about the general recipe for doing so on Monday. The philosophically correct answer is to say, well, those parameters, they are also just variables. So really, you would like to do Bayesian inference. So you would like to compute, sort of marginalize out the variables that you already have a distribution over, the z's. And then you're left with an expression like this, I mean, with or without the log, which is actually a likelihood for the parameters theta. So you could multiply with a prior and normalize, and that would give rise to a posterior. But that's the sort of answer that only a mathematician can give, because in practice, the real reason we have these thetas in our model is that we don't know how to do Bayesian inference with them. If we did, we would do it, right? If they were linearly related to the data, we would just make everything Gaussian, and then it's fine. And maybe it's a little bit more linear algebra, whatever. We can do linear algebra even over infinite spaces. It's fine. But these thetas tend to be really nasty. They come in through complicated nonlinear functions. Maybe they even have non-continuous effects on our model. They're just really annoying. So we can't actually do this. So by definition, we can't do this. That's kind of the definition of what a parameter is. We can't do Bayesian inference on it. So what we can do instead is, and that's sort of the general recipe I've advocated so far largely across the courses, we can compute maybe, if the model is sufficiently differentiable, at least a second derivative. First of all, you can find a mode in the joint distribution over theta and z given, sorry, y and z given theta. And then do a second order tail of expansion in theta space at that mode, and call that a Laplace approximation. And when we do that, that gives rise to quantities that we can use. We can, in particular, then compute an approximate evidence, or an approximation to this quantity by integrating out this sort of Gaussian approximation. And that's actually not bad. It's a general recipe to do, and it works quite well if you have autohomatically differentiable code, and the likelihood is sufficiently often differentiable. But it's not exact. It's an approximation to this distribution. And it's also a numerical problem. So we then have to compute a gradient with respect to theta, and then we have to follow that gradient. And before scenes have all times known over the course that following that gradient might be hard, just calling gradient descent is not necessarily the most efficient algorithm. You can do it, but it's not going to be super fast. So I said on Monday, well, here's a third option, which I didn't actually motivate quite well. I said, well, you could also do something else, which is the following. We look at this expression up here again, and we realize that maybe a problem is that we take the logarithm of this integral, and this integral is the hard part, actually. We don't know how to do this integral. So something you could try with your model is to sort of, in some sense, just move the logarithm into the integral in the sense that we move it inside. And then we integrate over some probability distribution given by, well, a quantity that we can talk about in two different ways. One way is we could say this is the posterior over z, given the data, for a particular choice of parameters. Or we could say this is an approximation to the generative model for the data if we assume that the parameters are exactly this, this theta star. And then, so this is an expectation that we're computing, it's an expected value. An expected value is an integral over a function against the probability distribution. And then we maximize this quantity, this thing here, with respect to theta. And I've deliberately written it like this with a theta and a theta star so that you can see that this term does not, like this sort of which controls the integral in some sense, gets a fixed theta, which we're not optimizing in this step. We're just optimizing the theta in the back. And then I said this method is famous, and it's called expectation maximization. Here's an algorithm for it, EM. And I asked you whether you've heard of EM and everyone raised their hands. I was like, OK, I've heard of EM before. And now in the feedback, I've heard some of you raised your hands, but you said, I've heard of it, but it's apparently difficult. So let's talk a little bit about it and see if we can get a sense for what it actually does. So here it is again, just compact it down. It's a for loop. That's maybe the main message of this slide or maybe even a while loop with a convergence criterion. So we iterate between sort of two steps which sometimes get moved around so that it's a bit difficult to see what the expectation and where the maximization is. We compute an expected value, an integral over the log joint against the posterior at a particular point in parameter space. And then we maximize that expression with respect to theta. And when I say maximize, you tend to think of gradient descent. And it's not impossible to do this with gradient descent. This is called generalized EM. But the typical idea you should have in your head is that you probably won't be using gradient descent on this problem, but that instead we're going to do an analytic step where we compute a gradient of this theta, of this thing with respect to theta and then set the gradient to zero analytically on a piece of paper. And then when we do that, we go back to E. We set theta to theta T or theta T to theta and repeat. And we keep doing this until either theta doesn't change anymore or this quantity that we've computed here doesn't rise anymore, this Q. So you could, depending on how you want to implement it, you can either just keep computing Q and checking whether it keeps rising or you could check whether theta keeps changing. One of the reasons why you might not want to compute Q is that you don't actually need it for anything else. So it'll turn out that we can compute the update rules in the computer once we've done the derivations without actually computing Q. And then sometimes it's easier to just keep track of theta. So that's an algorithm. Well, I can write it down, but of course you can't just write out an algorithm and just use it without understanding why it works. So what we're trying to do is to maximize the evidence, which isn't even on this slide anymore. And instead, I'm telling you to maximize this quantity. So why is this a useful thing to do? Why would you want to do it? And that's what the first half of this lecture is going to be about. And spoiler alert, now there's going to be some annoying math. And there are four or five different ways of explaining EM and none of them gets around doing annoying math. You can find another way that is different and focuses on other aspects than I do on Wikipedia in various textbooks from Moritz Hart to David Barber. A large part of my exposition is going to be based on Chris Bishop's book, because he has already a focus on what might come next, what we might do on Monday, which I quite like. But we're going to need to do some annoying math. And I'll try and stay relatively high level to tell you what we're trying to do. So for that, I need to tell you the same thing a few times, because every time I say it, it'll become a bit clearer. I can say it the first time now. It'll turn out that when we compute this quantity, this Q, then the value we compute here at this particular point theta happens to be exactly the correct value for the evidence. So we're trying to compute this quantity, which is a function of theta. And this quantity we compute down here happens to be for theta star exactly this number. And that's surprising because you don't see it from these two lines of math. And it also happens to have not just the right value, but also a particular structure in its gradient that doesn't mean it has the right gradient, it has the same gradient as this function, but it creates in some sense a sub gradient. So a way in a direction in which we can step so that we are guaranteed to actually also increase the thing on the left. And the argument for that is going to involve some lower bounds. So let's do that. So what we're doing here is we are constructing this P of Z, so Z are the variables. And we're gonna do an example later on, given the data and the parameters. And you can think of this quantity as an approximate distribution over Z. Well, it's also a posterior distribution, but you can also think about it as an approximate distribution over the latent variables that we don't know. And you could call this distribution Q, a probability distribution, to get rid of this X and theta business. Because then we can see that no matter which approximation we use, any of them, any probability distribution Q, as long as this distribution Q has the property that it has the same support, or at least as much support as the joint over which we're trying to compute a marginal. We can do the following sort of at first seemingly not motivated steps. So on the left-hand side, we have this thing we would like to compute, which is given by this integral. So this is the evidence for the model, the likelihood for theta for our parameters in our model, of which we take the logarithm. And this is just written out like this. So this is a marginal, it's an integral over Z. And now we can sort of multiply by one. So we multiply by Q of Z and we divide by Q of Z. And we're allowed to do that because Q of Z is supposed to have support, wherever P of X and Z has support. So we're not dividing by zero, and therefore we're allowed to do this expansion. And now we do some, this is actually the operative part. We use Jensen's inequality. So now here's the point where I need to check how good your math for machine learning course was. Who knows about Jensen's inequality? Not everyone. So Dr. Gabriela didn't do that last year, I guess. So Jensen's inequality is stated here again. Okay, I'll read it out and then I'll tell you what it means. So this is an equality that comes from a Danish mathematician who voted in 1906. There are various other versions of it that came before. This is the elegant clean form of it that says, effectively it says that if you have a convex function, phi, then an expected value under some probability distribution over some random variable. And if you then take the function of this number is less or equal than the expected value of this convex function of the random variable. Here it is again in some sort of more general way of writing it down. What this is is a sort of a generalization of this type of mean value type theorem that or second theorem that you've probably had in some of your first year math courses. So the picture you've probably seen in your first year undergraduate classes is if you have a function, let me briefly close this. And so if you have a convex function like this, then the, maybe let's wait, throw it more like this, then it's a bit clearer. Then for any two points, the line that goes to these two points is above the curve. So another way of thinking about this is that the average value of this thing is above the curve. And the average value is like an integral against the uniform probability distribution of this function. And so what this theorem says, what Jensen says is this is actually more generally true if you do this against arbitrary probability distributions for convex functions. So the logarithm is a concave function. And so we just turn the signs around and we get a larger then. And so what this thing tells us is that this expression, which we're trying to compute, is actually lower bounded by this quantity, which is given by our integral effectively up to some constants, which we'll talk about in a moment. And therefore we give this thing a name. We call it L, L, curly L of Q. And when, so notice how the only point where theta actually shows up. So I mean, so this Q of Z, right? This is our posterior in our case. And remember that we set Q of Z to P of Z given X and theta star. I said we fix theta at that point. So that means the only part of this expression that will depend on theta, on the actual theta we're optimizing is the bit up here. So the extra term we get from this one over Q back here doesn't really matter because it's just something in theta star that we don't optimize. So this whole thing here is a lower bound on the evidence which we're then maximizing. So when we maximize this expression in the M step of EM we increase a lower bound on the evidence. And that's why this thing L in the literature for the past 15 years or so has been called the evidence lower bound, the elbow. So who has heard of the elbow before? It's amazing that this term has made it to the entire community. It's like, there's way more people of you who know about elbows than who know about EM or about Jensen's inequality. So as far as I can tell the word elbow comes from David Bly in New York who started to use it in variational inference because it's just that joke name and he thought it sounded funny and it really stuck. And so everyone has been calling it elbow. On Monday, I'll give a few other names to it. It has some like exalted interpretations in physics that come across much more seriously than elbow but it works, it fits for everyone. So this is the elbow. Okay, so what we are doing, that was the first step to understand EM. What we do when we maximize this sort of seemingly weird expression is that we actually maximize a lower bound on the evidence. So there's this function called the evidence. It's a function of theta. That's the log probability of the data given just the parameters. And what we now have is we know that we can compute another function L of Q, which sort of lies below. But that in itself isn't really enough yet, right? Just something below and then we push that thing up is sort of good, right? It's a useful thing, it's a minorization strategy. So we have this thing and what we can do by moving theta around is we can move in a step that gets us to a higher point here. So because the lower bound rises, we could hope that the thing above it rises as well. Because there's only two options, right? Either it rises as well or we get really closer and closer to it. But that second thing that we get closer and closer to the evidence, it doesn't really help us so much, right? It just means we get a better approximation to the evidence, but in the end, we're just gonna return one particular theta and then we're gonna use the model that comes from it. So that itself seems like it's not the final answer. So for this to actually make sense, we have to, then we could find lots and lots of lower bounds as well. So we would need to know how tight this bound actually is. And here is where the amazing bit comes in. It turns out that if you choose our queue to be this posterior, then we actually go to a point where this function is exactly equal to the lower bound. Yeah, sorry, to the evidence. Now I'm gonna remove this line below because it's actually wrong. It's just the intuition we could have, but it's not what's actually true. So I'll keep this bit. So for any distribution queue, we're gonna get a lower bound as long as it has full support. But it'll turn out that if we choose for queue p of z given x and theta, then we'll actually have the exactly correct number at that point. The bound will be tight. So this will be not a larger or equal than it'll be an equal to. Why is that? So this is where the other bit of annoying math comes in. It has to do with these two people that you've probably heard about before. They're called Kulbach and Leibler, KL. And they are on the left and the right because the pictures just fit better this way. Because Kulbach is looking in this direction and a lot than the other. And these are literally seemingly the only two pictures of these people out there in the world. I think there are two pictures of Kulbach that you can find. People mispronounce their names. They misspell their names. Leibler, they sometimes call them. And Kulbach with that CH or whatever. Why is this? It's because these two people lived a very secretive life. They worked for the NSA for their entire career. Kulbach actually was the scientific director of the NSA towards the end of his career. Before he retired and became a professor as you do. And so we don't actually know what they did there for most of their lives because they broke codes. First against the Germans, then against the Japanese and then against the Russians probably. And they rarely ever spoke about what they did because that's just what spies do. But every now and then they wrote a paper and they wrote one about this thing for measuring distances between, not distances but divergences between probability measures. And the moment it was published it was immediately read in 1951 when they thought it was safe. It was immediately read and picked up by Russian scientists who were very eagerly trying to read what these two people were up to because it seemed very important. So here it is. We've just seen that we're going to use this L. So L is our elbow, our lower bound on the evidence. And here it is again for your convenience just copied over. This is the expression we're working with. So now what we can do is we can see that this is an integral against Z. And there's X and Z in here. So let's get rid of the X as much as we can in a way. So we rewrite this joint as a product of a marginal conditional using the product rule. P of X times P of Z given X, both conditioned on theta. Now we realize that that means that now this part is much easier because it doesn't depend on Z at all. So we can rearrange a bit. We keep the first three terms. Keep them here in the front. And then do the logarithm bit. Like so use the functional property of the logarithm that the product of two terms, the logarithm of a product is the sum of the logarithms. So we get a sum plus log of P of X and theta. Now that this thing doesn't depend on Z. So we can take it outside of the integral. And at the back here, we're left with just an integral over Q of Z. So Q is supposed to be a probability distribution. And therefore the integral is one. So this thing goes away. And now we notice who here on the right-hand side, we have to think we are trying to compute. That's the evidence. That's the quantity we're trying to work with. Ha! Okay, so now we can just rearrange the equation. So we just move this, we keep this here and then we move this over. And then we say, we see that the evidence that we're trying to compute is actually equal to our lower bound minus the integral over Q of Z times log of the posterior Z given X and theta divided by Q of Z. And now we just give a name to this object, this minus integral over here. This is called the KL divergence. And I'm guessing everyone has heard of the KL divergence. If I ask, we'll give, no? Okay, I'll ask. Who has heard of the KL divergence, the Krugberg-Lawyer divergence everyone has. Okay, so you know that this is a divergence, which in particular means it is non-negative. It's always at most zero or larger than zero. And it has also some other interesting properties. It is a divergence, it's not a metric, it doesn't obey the triangle inequality, but it obeys something sort of similar to obeys some, for certain types of distributions, exponential families, it obeys some form of Pythagorean theorem. The most important thing is that it is zero, its lowest possible value, if and only if Q and P are equal to each other. And that's the entire trick. So this term tells us which Q to choose if we want to make this bound tight. If we set Q to the thing up there, then this term will be zero, and therefore our lower bound on the evidence is actually exactly correct. It's a tight bound. So what we do when we do EM is actually that we iterate between building a simple function, L, as an approximation to the evidence, which has the property that at the current location theta, it is exactly, has exactly the correct value and has a shape that will always be below the evidence that we have. And then we move up, we take a step that gets us to some, that's badly drawn, to some better point. Let's call this the optimum. And then we recompute a new lower bound on the evidence. Yes, what is the sign? It's just rearranging the terms. So we keep this is on the right-hand side, and then we just move it over here. Oh, why there's a plus here? That's just the definition of this divergence. It's just, that's how it's defined. So maybe it's just a question of how you define it, but how you've used to writing it, but that's just it. No, so there's a logarithm of a ratio here which can easily be less than zero, right? So it's not clear that this is a negative number, a positive number that we're integrating up. So Q is positive and P is positive, but the log of P over Q doesn't have to be positive. Okay, so for everyone else, here's the story again with another picture. So what we do when we do EM, here's have just so that the equations on the left are just copied over from previous slides. At this point, there's nothing new anymore. I'm just saying what we have just found out is that the evidence can be decomposed into two functions, which depend on the parameter values theta and an arbitrary distribution Q over Z. As long as that probability distribution Q has full support over Z, we can always decompose it like this into, and the usual way to say the sentence is the other way around, is as the KL divergence between Q and P plus some number. And what we then usually look at is this expression back here because this measures some form of dissimilarity, divergence between Q and P. But we can also think about the thing in the middle, the gap, if you like. And well, it doesn't really matter which of them is the gap, right? They together make up this thing. Well, on the left-hand side, we have this expression, which we've now given a name to. It's called the elbow, and the KL divergence is just this by definition. So what we've now done is we've said here's an idea for an algorithm sort of in hindsight, which is we, if we can, and of course we can't always do this, but if we can, then we could compute P of Z given X and a particular theta, the posterior over Z, if we knew what theta was, and set Q to this distribution. What that will do is it will make this term zero, the KL divergence, boop. So therefore, these two numbers are now exactly equal. Add this particular value in Q, right? So these are functions, sorry, add this particular value in theta. So these are functions of theta, and here it's just a line, right? So this is not the theta axis, it's just a line. So at a particular point theta, those two are now equal to each other, and now we can optimize for theta. You can take a step in theta space. And when we do that, we change the KL divergence again between, or we change the, well, all of these terms, right? But maybe the best one to look at is that by changing theta, this Q becomes kind of an outdated old posterior that is not correct anymore. And therefore, L is not actually the exactly correct value for the evidence anymore. Because the distribution over Z that we're integrating over is in some sense outdated. And therefore, there is a non-negative, not, and also non-zero KL divergence showing up again. And then we can repeat. We kind of close the gap again, and we go again and again. So, yes? So we're moving around in theta space, and we keep having point values for theta. So why not use gradient descent on theta? Actually, you can. So in the second half of this lecture, I'll show you why you might not want to, but you can. So you can compute a gradient of this elbow with respect to theta. And in fact, actually, that's what people often do in the class of algorithms that arise from this idea, which are called variational inference, and then they just follow the gradient. So here is another way of writing down the EM algorithm again. I know I have a lot of these slides that all say this is EM. So here's, we've just cleaned up our notation now into like one slide that will say this is what EM actually is. If you are in a situation where you need to estimate parameters of your model, and your model contains data, variables, and parameters, the data are called Y, the parameters are called, at the variables are called Z, the parameters are called theta, then what you might want to do is maximize the evidence, this expression. You want to do that because you can't do Bayesian inference. If you could do Bayesian inference, then theta wouldn't be called a parameter, it would be called a variable. So that means you want to maximize this expression, but we assume that this integral is actually difficult to do or intractable. So what we do instead, and that's the EM algorithm is, we initialize at some particular value theta, theta zero, and then compute a posterior distribution over the latent variables given theta. What we do by that, we've now realized in hindsight, is that we are effectively constructing an approximation to the distribution over Z under the model, which happens to have the exactly correct value for, or it happens to allow us to compute the exactly correct value of this function for this particular point in theta. And that means we're minimizing, or this is the case because this choice of Q minimizes the K-L divergence between the posterior and Q simply because it is the posterior on Z given X and theta. And then, just a moment, we optimize this expression, which is called the elbow, with respect to theta. And this function of theta is not this function of theta. It's just another function which happens to have the property that it has the correct value at our starting point in theta and it is a lower bound. So when we increase it from a point where the bound is tight, we necessarily increase the evidence as well. And that's the cool idea. So we keep doing this until we don't increase anymore and then we're stuck in some local optimum of the evidence. In the picture I've drawn, it looks like we might converge to the global optimum, but of course we might still get stuck in the local optimum. But at least we'll actually find an actual optimum, which is not something guaranteed if you use the pass approximation because we'll always just approximate this integral. So I'll get to you in a moment. One question you might have is just why, so it sounds like I need to be able to compute the posterior over Z given theta. Can I even do this? Why would I do that? And it sounds like it should be just as hard as the evidence, but it's actually not. So if you think of Gaussian process regression, our sort of base case simple learning algorithm that seems so important for so many other classes of algorithms, of course for a particular choice of the parameters, we can compute the posterior over the latent function. That's the entire point. You can keep doing it. It's just not so straightforward to then think about what the landscape of the parameters theta actually is. Now your question. So concretely, what is the question? What is the thing we're trying to compute? Well, I think actually I can just do the simple answer is we'll do the second, after the break, we'll do a concrete example. And not Gaussian process regression by the concrete one. Uh-huh. So the question is, is there a convergence rate for this algorithm? So there's a lot of theoretical analysis about this algorithm. It depends a lot on how it is implemented and what kind of model it's actually implemented on. So what we are going to do now after the break is an instance where we can do block coordinate ascent in the theta space. And then you sort of inherit all sorts of conversion statements from those type of algorithms. So if you do it with gradient descent, you sort of get a convergence rate from gradient descent. If you do second order optimization, you get a convergence rate from second order. But actually, typically this algorithm is used in a much more structured fashion to update entire blocks of the parameter space to an exact maximum along that. It's some kind of Gibbs style block coordinate ascent way. So before we go into the break, I just want to point out a minor thing, which is just to clean up the formalism a bit. What we're doing here is we're trying to maximize the evidence. And I said on Monday and the previous times a few times, of course, quite typically, you don't actually want to maximize the likelihood because you have learned in all sorts of classes by now that maximum likelihood is really dangerous, right? The likelihood can be maximized at some point that is very bad because it just explains the data really well, but it's totally unphysical or just you know that it's the wrong value of theta. It's just not the right choice. And so what you actually might want to do is you might want to multiply in this expression p of x and given theta with a p of theta with a prior for theta. And I've already mentioned a few times that that's usually not hard. So when we did this numerical optimization of the parameters of a Gaussian process model, I didn't actually do this, but you could have easily added a term. Just had another term to the optimizer to just basically regularize it. So this is already a logarithm here. So if we multiply with a prior, that's like adding a log prior and literally adding a function is not that complicated. And this actually holds true for EM as well. So it's absolutely straightforward to write a variant of EM that does maximum apostatory inference. So it maximizes a posterior rather than likelihood. And the only thing you have to change in the slide, so this is everything as before, you'll be only add one term here. So we're gonna minimize the KL divergence to the posterior to the sort of the, well, the joint actually over x and z and theta. And this is really easy because as you can see, this is a distribution over z. There's no z in this thing here. So we can really take it outside of the integral and literally just say, well, we just maximize the elbow plus the log prior over theta. Whatever your prior over theta is. So these things, these priors over hyperparameters, they tend to be sort of just encoding hard real world knowledge. Like, so for example, theta might be a rate and you know that it can't be lower than some stupidly low numbers. So you're sort of bound away from zero just to make sure your model doesn't choose zero or some kind of, they are often some kind of log concave functions that just avoid the model going too high in theta space too low or at the, to the corners of a domain that you don't want to be in. Yeah. Say again, you mean out down here. Yeah. Yeah. So the inequality should be the other way around. Yeah. Yeah. So we're maximizing this thing and therefore we're maximizing this. So this should point this way. Yeah. Okay. And now we take a break and let's continue at nine oh five. There was a question in the break about why, so I fixed the inequality now. And there was a question about why, why Laplace is actually not the right thing to do or why Laplace is bad. So maybe I should clarify this point of it. I said in the very first slide of the lecture today that to find good parameter values for your model while ideally you'd like to do maximum posteriori inference or maybe even full Bayesian inference over your parameters but you can't do that. So one option, one general option is the Laplace approximation. And then maybe I dragged Laplace a little bit too far down. Maybe I made it seem like a really bad thing to do. So we used Laplace approximations for Gaussian process classification and for deep neural networks. And even for exponential families and I argued that they are a good thing. We did model adaptation for classification with Laplace. And you can do that actually for deep models as well. So there are people out there now at the big companies fitting architectures of deep neural networks using Laplace approximations. And this is a good thing to do. I'm not saying it's bad but it's fundamentally an approximation because the actual distribution over Z isn't actually the one you're approximating it with. So when we do Laplace, we typically approximate the distribution over the latent variables with a Gaussian even though that distribution isn't actually Gaussian. And then we track the dependence of this Gaussian distribution on theta and optimize for it. So because there is a gap in approximation between the two distribution on Z and the Gaussian distribution on Z that we construct with this Taylor approximation in lock space called Laplace approximation means that we can't be totally sure that we are actually optimizing the right thing. And actually in practice, I think it's usually not a problem. It's just not as strong a mathematical argument. So there might be a confusion here about what we are finding a mode for. So in the way that we've used Laplace so far in Gaussian process classification in deep learning and also for exponential families is we effectively constructed and we optimized in Z space. And then at a particular point in Z space where the gradient is zero, hopefully, do a second order Taylor expansion in Z. That gives us a Gaussian distribution in Z. And we can marginalize over Gaussian distributions in Z usually because the integrals are closed form. And that gives us that functional dependence on, so this is sort of an approximation for the relationship between X and Z in here that linearizes the relationship and makes the posterior Gaussian. And then we can integrate out Z under this approximation in closed form and get a thing that depends on theta. So what we're doing is, and maybe actually from this picture we can also think about, or from this slide, we can also think about this as constructing a particular Q of Z that happens to be Gaussian. So it's an approximation in function space to say we want the Gaussian approximation, Q. And that's Gaussian approximation Q has a non-zero KL divergence to here, non-zero KL divergence to the two posterior because the two posterior for these models isn't Gaussian. And what we then do is we just maximize this thing with respect to theta. And maybe what we are effectively doing is we're just hoping that the KL divergence stays sort of similar while we do that. If it doesn't grow too much, then we're probably increasing the evidence of the exact model as well, right? So if you're sort of moving around and this gap doesn't change much, then we can keep moving up. But there could be a weird case where we sort of focus too much on the optimization on the approximation Q and we kind of sort of trick ourselves into a point in theta space where the approximation Laplace approximation Q looks really good. But the actual correct model to it isn't good. For example, this could be the case, I guess for sort of Gaussian approximations have these weak tails, they're just local, right? So one thing that could happen is that what we do to our model actually increases its sort of heavy tails in a way that moves mass away that we don't notice. So EM is an algorithm that avoids this problem by making the bound tight at every step. But of course we can only do that if we can actually compute the true posterior. And for the models that we use in classification and deep learning, we can't compute true posterior so we can't hope to do this. So why might EM still be a good thing? Well, maybe sometimes we can compute the actual posterior and even do more than that. So one particularly interesting case arises if the model that we're using is a big surprise, an exponential family. Because when it is, then we can do some trickery with the, just write down the structure of the joint model. And because we are computing integrals against log probability distributions, we will then end up, and I'm not actually going to do this slide in detail, you can look at it afterwards, with expected values that we need to compute over the sufficient statistics of the model. And those sufficient, these integrals, expected values of sufficient statistics for exponential families, often have closed form expressions. You can just look them up on Wikipedia. And so that's not totally automatic for exponential families, but it's often the case. So for example, for Gaussians, the sufficient statistics are moments, so they're linear and quadratic functions and the expected value of a linear and a quadratic function under a Gaussian is known. Or for the gamma, the Dirichlet distribution, the sufficient statistics are log counts, log probabilities and the expected values of these log probabilities are known. They are these gamma functions, basically. Gamma functions and their derivatives are the diagram functions. And then we can sort of look at this derivation a bit to see that to optimize these models to do the M-step of EM, we have to set those expected values of sufficient statistics equal to the gradient of the log partition function. And the log partition function is known, so it's gradient is known, you can write it down, and then often we end up with equations that can be closed form solved on a piece of paper. So the M-update step becomes an analytic update step. And to show you what I mean by that, I'm gonna show you an example. And it's one of these textbook examples which feels totally out of place in 2023 after a course that has done deep learning and Gaussian processing version, but I'll do it anyway, because some of you also said in your feedback that you want to have a simple example every now and then. So here is just think yourself back into a simpler time. It's 1995 or something, and computer science is really easy. And what we're trying to do is learn a mixture model. And so the picture I would then show you in 1995 is this on the left of old faithful gazer in Yellowstone National Park. Has anyone ever been? It's very nice. You get to sit on a bench here and there is this hole in the ground, everything smells of sulfur and there's actually a big lodge because the Americans of course would build a big lodge over here that's never on the picture because it's not so nice to look at. And so you sit there and there's this gazer which has this property that's oddly enough for the past hundreds of years ever since white settlers arrived, this thing has an eruption with a lot of regularity. That's why it's called old faithful. The other geysers, they come up every now and then there's a lot more geysers down the field over here and so. And this thing has an eruption every about every hour. It's between every 50 minutes and 90 minutes that you have to wait. And they have a little clock there that counts down how much longer you probably have to wait. And then it spits some water really high up in the air, really impressive, traumatic thing for about two to five minutes. So at some point someone sat down probably from the National Park Service and collected this data set that said how often you have to wait versus how long the eruption is. And lo and behold, if you have to wait for longer there's more water collecting in this cavern of the geyser and that means the eruption is probably going to be longer. So that's interesting and more interesting is that this data set has this interesting structure. It seems like it has this kind of bimodal distribution. There's two blobs basically going on. And these kind of data sets have occasionally been found in other applications as well in biology and so on. And usually we only see this because they tend to come up in data sets that are quite low dimensional like this one that you can look at. And then you can see, ah, okay. So the amount of people that want to do in the old way of doing statistics is I'd like to have a model that learns this kind of thing, this structure. And you could call this clustering or a mixture model. And then mixture models are associated with math that looks like this. So what is this? It says that, so these red dots, these are our x's, the data. And we think that this model can be parameterized by three sets of things, pi, mu, and sigma. And we use those to describe several Gaussian distributions that lie in this space. In this case, you probably can guess there are going to be two Gaussian distributions because you can look at the data set and it looks a lot like two. That's also why it's in all the textbooks because it's totally uncontroversial that there will be two classes or two clusters. And so we assume that both of these clusters are Gaussian just because Gaussians are nice exponential families. And so we write, we say every individual datum, every red dot is drawn independently. That's why there's a product here over this mixture. So a mixture is a sum over probability distribution with a mixing coefficient, pi k, such that the pi k sum to one and there are probabilities. So you could think of create the next eruption by independently drawing one mixture component from the pi vector. So pi is a set of probabilities. We just draw one instance from this discrete probability distribution. And then when we know which of the two clusters we're going to be in, we just ask where's x going to lie by drawing from a Gaussian distribution with mean mu k and covariant sigma k. Another way of writing this by the way is to introduce a variable z which gives us the identity of which cluster each data point is going to be in. So we think of z as a matrix, an array of size number of data points by number of clusters with a one hot encoding. So this is a binary matrix that is zero almost everywhere. And in each column that's exactly one one in each row that's exactly one one which says the nth data point belongs to cluster k. So you've probably seen models like this in undergraduate machine learning classes. And now what you, so actually I think you have probably seen these kind of models, maybe even this data set. Another question is maybe just for the back of your mind if you can remember your own courses that you've had before, what kind of algorithm do people now apply to these kind of models? Because there's many, many different ways of doing inference in these kind of models. And that's actually why they come up in the textbooks because for this particular setting, you can do like at least eight different things and compare them to each other. So today we're going to do EM. And the way that this works is here's another blast from the past. So in 2007 or eight, when I was working on structured models as well, the thing that people would then do even at the companies, so I was at Microsoft back then was if you get your data set, of course not old faithful, but some kind of customers interacting with some product, is you'd stand in front of a big whiteboard and you'd turn this piece of math into a graph. And back then graphical models were all the rage. And if I would have taught this class back then, and I mean not in 2007 or so, but I taught it a bit later, we would have spent a lot of time on graphs. So let's remember how these graphical models work. We draw a circle for every variable that is in our joint distribution. So on the previous slide, we had X and Z. So there are two sets of variables called X and Z. The X's are observed, so that's why they are black. The Z's are unknown. We don't know which data point belongs to which cluster. So there's a white circle. And then all the parameters, the things that we won't put probability distributions over but which show up in the probability distributions get often drawn as little circles as little dots. So in our case, there are means and covariances and mixture weights, pi. And then here is a little nice addition to this notation that I haven't told you yet, but it's very easy to understand. When there are multiple instances, so in some sense sort of array index variables, like in this case N and K, we draw boxes around the groups of variables that belong to these arrays or to this dimension of the array. These are called plates in this notation and indexed them by how many copies of them we have. So here is K copies along K and N copies along N. So this gives us an idea for sort of arrays overlapping with each other of this type of size. Okay, and now you look at this graph and if it were, I don't know, 2006 and you were experts in graphical models, you would immediately see it. Now I have to tell you, because we haven't talked enough about it in this course, is that inside of this graph there is a V-structure hidden. So we did in the third class, we talked about these elementary types of graphs, right? There is chains, fanouts and colliders. And this is a collider structure. It's a collider structure in mu and sigma and z. First of all, obviously the arrows come together. So when the conditional X, mu and sigma and z will become dependent on each other. And now the engineer knows, this is where my headache is going to be. So that's why the algorithm is not going to be easy to write down because to find the correct values for mu and sigma and z, I have to jointly optimize them because they become dependent on each other through the data. But there's actually an even worse kind of joint dependence which is hidden by this plate notation because we see that the K and the N plates overlap and that means that every X, N affects an entire row of entries in z and they become all dependent on each other. And that's a situation where an algorithm like EM might actually be interesting because it means that all the variables we're trying to infer depend on each other and a simple optimization method is probably going to struggle to work well because in particular if this is high dimensional or we have large N, large K and there'll be a lot of these variables and they all depend on each other and we have to navigate our way through this. If you want to do this efficiently with low computational cost, we have to scratch our heads a little bit. And then what we notice is, and this is where EM comes in, huh, if we actually knew what z was, if we imagine for a moment that we knew what z was, then everything would become really easy because what would happen is I would just know if someone told me which datum belongs to which cluster. Well, then I just have to estimate the parameters of a bunch of Gaussian clusters. I've learned that in, I don't know, second year studs class. You get some samples from a Gaussian, you do some maximum likelihood estimation for the mean and the covariance, done. And then I also, I mean, if I know which data belong to which cluster, I can just sum up how many instances of each cluster I've seen and I can do conjugate prior inference with the Dirichlet prior over pi. Ah, okay, this is one of these situations where EM might actually be really useful because it's an instance where this complete data log likelihood is easy to write down. Where the elbow is easy to construct. And if you haven't seen that yet, that this is the case, but I'll show you on the next slide. So at first, we've just sort of realized that what we're doing, this is sort of the actual practical instance now of this exponential family structure. Since we've used, so notice that these are exponential families, right? So this product over k times pi k to the z n k, that's a multinomial distribution. Multinomial distribution of exponential families. And these Gaussians, they are Gaussians, they're exponential families. So if we take the logarithm of this, we get sufficient statistics of Gaussians and multinomials showing up. And then we might need to take expected values under some approximate distribution, but those expected values will just be, you know, expectations of our log probabilities and linear and quadratic functions. So maybe we were actually going to be able to do that. And of course we're going to be able to do it, otherwise we wouldn't do this example. So how does this work? So notice how there is this variable now. I'll go back one slide back. There is this variable in our model now, z, which is actually maybe not necessarily something we care so much about. We don't actually care which data point belongs to which cluster, but it just so happens that it really helps us structure the model. Because if we would know z, then things would become independent of each other. This is sometimes called induced independence or induced factorization. And so now we have a map from this model to the general structure for EM that we talked about in the first half of today's lecture. There's data, x, there's latent variable z, which we have a probability distribution over. And we have parameters of the model called mu and pi and sigma and then k copies of them. So we can try and do EM. So what was the general recipe for EM? It was first write down the posterior distribution for the latent variables z, given the parameters and the data, pi and mu and sigma and x. So what's the posterior for these discrete one-hot random variables? We can compute it with base theorem. So we write down the prior for z, n, k to be one, multiplied by the likelihood for the observation x and divide by the evidence for z. Well, the evidence for the model with z, which is the marginal probability for x. So this is the abstract way of writing it down. It's just prior times likelihood divided by evidence. Now what are these terms actually? So what is the probability under the model for datum number n to be in cluster k? It's just pi k. That's what the model says. It says first of all, all those data points are independent of each other. So you don't need to think about where you put the other data points. There's a big product in the model. And the probability to be in class k then is just pi k. So we plug in pi k. What's the probability for xn to have a particular value if it is in cluster k? Well, it's a Gaussian probability for xn with mean mu k and covariant sigma k. What's the normalization constant? Well, it's a sum of all of these terms. It's just a number. So we are going to compute it at some point. We're going to evaluate pi k under the model and Gaussian probability distributions for the PDF of a Gaussian at xn under mu k and mu k at sigma k. And you can imagine that you can do that. Even our Gaussian base class that we've used over the course, our Gaussian dot pi has this function to compute log PDFs, which we need in here. So what this is is the probability distribution for the Zn case is of course a discrete distribution because the Zn case are discrete numbers. They're just binary numbers for every possible value of n and k. And so they just contain numbers. They're just a table. It's just an array that contains probabilities. And we will give a name to these probabilities. They're called r and k in this literature, r for responsibility. It's how much this cluster k is responsible for data point number n. So this is an array of size number of data points by number of clusters that you keep in your memory. And it has the property that if you sum out its second axis, or it's under in the Python numeration, it's first axis, not the zeroes one, but the first one. And then that sums to one. So each row is a probability distribution. And so I won't even show you the code for it. I'm sure I'll show you the result of the simulation, but I mean, you can imagine that you can implement that yourself. It's just three lines in Python. Okay, so now that we have this probability distribution indexed by this array, in the E-step, we need to compute the expected value of the log joint probability distribution. The expected value under this distribution, which is completely determined by this array r, of the log joint. That's the elbow we are constructing. And for that, we just have to, and this is often a confusion, something to sort of bend your head around, we just have to realize that the expected value under a discrete distribution, it's just a sum over the function evaluated for the different values of n and k multiplied with these r and ks. Because those are the probabilities, right? So the expected value, we tend to think of integrals, but it's just a sum for discrete distribution. It's just a sum over n and k times the probability, which is given by r and k, times the log joint. And what's the log joint? Well, we can go back a slide. So here is the log joint. It's this expression. So we just copy it over. Here we go. And that's it. So this is our elbow. And where's theta now? Just to sort of make sure you're still following. Is this clear for everyone? So we've, on the previous slides, I had this distinction between theta star and theta. The thing we're optimizing, the thing we keep fixed. So where is theta star and where is theta? I think this is important to get across. So theta star, the current value of the parameters are the parameters that we've used to compute r and k. Because there is pi's and mu's and sigmas in here. And we've used them to construct r and k. So now they are in here. This is r and k of theta star. And this is log p of x and z given theta, where theta is pi and mu and k. And now what we're going to do is we're going to optimize this expression with respect to these three parameters while keeping the ones fixed that are in this r. And when we do that, of course, the responsibilities won't be correct anymore. Because if we change the weights and the parameters of these mixture components, they will lead to different estimates for the responsibilities. And that's exactly this gap opening up in KL divergence. So now we do a little bit of math, which you can just follow along and not let it affect your mental health. So because it's just one of these things you just then do at the end. Well, maybe the main thing to take away is doing EM requires you to bring out a piece of paper and do some derivations, as you do actually in this week's homework as well. So here's the expression again, copied over from the last slide. Now we need to optimize this thing with respect to only pi, mu and k and we ignore the fact that there's a pi, mu and k in the r and that's the entire trick. That's we're allowed to do that because it's called EM. So let's take the derivative of this expression with respect to mu first. So there are k, capital K many instances of mu's in there and each mu is a vector. We take a gradient with respect to one of them. Let's call that thing mu L. That mu L only shows up in one term in this sum. The one where k is equal to L. So that sum now vanishes and we're left with just the instances where k is replaced with L. There's still a sum over n left. And now which of these terms depend actually on mu k or mu L, sorry, well the r and k stays in the front outside of the bracket. This thing does not depend on mu L, so we drop it and we're left with the gradient of a log Gaussian with respect to the mean. Now the log Gaussian is just this quadratic form where it's xn minus mu L transpose times sigma L inverse times xn minus mu L times a half. We take the gradient of that, it's a quadratic form. So it's like the two comes down, cancels with the one half and we're left with sigma L inverse times xn minus mu L. So this expression is supposed to be zero. Now we rearrange. We saw that, you see that there is only if we open up like expand this bracket we get r and l sigma L inverse times xn and r and l sigma L inverse times mu L with a minus in front, separate sums. Oh, the sigma L is in both of these so we can just multiply from the left with sigma L assuming it's actually invertible which it hopefully is because it's the covariance of a Gaussian. So the sigma L actually falls out completely. We can cancel it from the equation and we just separate those two sums. So there's the sum over r and l xn, that's this thing. And then there's a sum over r and l mu L on the other side. So a sum over r and l over the n's that's not something that sums to one because it's a sum over this r array not along the first axis but along the zeroes axis. So we just give a name to this variable we call it capital RL. That's just a vector you compute. So this is literally a numpy dot sum over r along the zeroes axis. And let me just divide by that, bring it to the right side and we have our update from you. Boom. This is the optimal you under this model. And notice how there is no gradient descent. We just solve gradient equal to zero exactly in this step. Yeah. Ah, so this is where we take the expectation, right? So z is the thing you don't know so we can take an expected value of it and the expected value of z under this distribution is just p, right? So the expected value under a discrete distribution of the binary variable of the random variable taking this probability is the probability that it takes that value. I think this is, and I know that this is often confusing so it's, I don't even know a really good way to write this down but it's really just, it's one of these things that is so easy that it becomes difficult again. So the expected value of random variable z in taking on value k, if that probability is r and k, the expected value of that happening is r and k, literally. So we just replace z with r. Okay, so this is an exact update. There's no gradient descent. It's of course, it's an exact update if we keep the r and ks fixed. So now we do the same thing for sigma. For sigma it's a bit more complicated so I will skip a little bit over it, just point out a few things. It's basically the same derivation or analogous derivation. We take a derivative with respect of, like this thing with respect to sigma, one particular sigma, let's say sigma l. We again notice that the sigma l only shows up in one term of the sum. So that sum drops, we can replace k with l. There's still a sum over n. And now we need to take the derivative of a log Gaussian with respect to the covariance matrix. And that's one of these linear algebra, matrix linear algebra differential calculus type of things where you can convince yourself that, so remember that the sigma shows up in two points, at two points in the Gaussian distribution. One is up in the exponential. There it shows up in an inverse. And for matrices there is sort of a similar statement to how for a simple function, you get the derivative of x inverse is minus x to the minus two times the x d alpha. There's a corresponding thing for matrices, which is up here. It's this thing. And then there's a corresponding thing for matrix determinants with a square root, which is also a chain rule, that's sort of a generalization of what you would get with an absolute value of a scalar number. And you can convince yourself, if you believe those two equations up here, that this is the correct gradient. We see that sigma l shows up a lot, but it's the same sigma l. There's no sum over l here. So you multiply from the left once with sigma l, then one of those drops, and it drops completely here. And we're left with only one sigma l. So we can solve this equation for sigma l and find that it's given by this expression. And those, by the way, these equations of course look in some sense exactly as you would expect them to look. So if you knew that the datum belongs to the cluster, you could estimate the parameters of the cluster. The covariance, this is an empirical estimate of the covariance. And this is an empirical estimate of the mean. It's just where is the centroid of this distribution. And now finally, we need to take the expected value of, sorry, we need to optimize for the cluster weights, for the pi's. And whenever we optimize for something that's a probability pi, there's always this little complication that probabilities need to be things that sum to one. So we need to constrain for the fact that the sum over pi has to be one. And when we need to constrain, Lagrange multipliers show up. So have you done Lagrange multipliers? You have everyone nodding. And you're probably in your head, you're going, oh, I don't remember though. But it's this thing where this lambda shows up and I never really know why and what I need to take the derivative with respect to. But actually, if you just trust the process, it just works somehow. So you just go and say, okay, so how does this work? I write down the thing I actually want to optimize. And then I just write the function that I want to be zero, the constraint. So the sum over the elements of k minus one has to be zero because the sum over the elements of k has to be one. And then I somehow write this magic lambda in front. And I just trust that at some later point, I will find out what lambda is and I just wait for it to find out. So let's put that there. We take a derivative with respect to pi of this entire expression. So the derivative with respect to pi of the first part, this here is actually very nice and easy because pi only shows up here. So if you take a derivative with respect to pi L, then L only shows up in one term in the sum, the one where k is equal to L and we just take the derivative of a logarithm. So log of pi d pi is just one over pi. So we get a one over pi with an R and L in front. This old stuff, this old vanishes because it doesn't have anything to do with pi. And back here, we take the derivative with respect to pi L. Again, pi L only shows up once as a single linear term in the sum and we're just left with a lambda. Nice, okay. So that's the expression we need to set to zero and then we're somehow magically going to find out what lambda is. So let's set it to zero, we rearrange, we find that we can write pi L again because there's no sum over L here so we can just move it outside of the sum, put it here and find that pi L is given by, ah, this also kind of makes sense. It's counting how many in expectation data points are in this cluster pretty much. That's gonna be the relevant for what pi is and then we just need to normalize, right? We need to find that along L they sum to one. Ah, okay, so this is one over lambda times RL from the previous slide and this just has to sum to one so now we know what lambda is. It's just RL over N. It's just minus one over N, sorry. So that things just cancel out and we're left with estimates for pi L such that we count up how many data points we are assigning to cluster L and then divide by the total number of data points so that we get something that if we sum over L just ends up being one and that's it. So that's the algorithm and I'll show you the plot and then we can answer questions. So the algorithm will do initially and so I'm not showing you the code because you're doing it as your homework. I could show you the Python code but it's really just four lines. So we instantiate with an initial guess for the clusters. So here I've decided there are going to be two clusters. I've injected that from outside. Here's the data set again and I've just created two standard Gaussian distributions. They are actually Gaussians with a unit covariance matrix. They just look so elongated because the plot is so elongated, right? It goes from 40 to 100 minutes and from two to five. That's why it looks like this weird extended thing. And I just put them at some random points like at three and four and 50 and 80. What people often do in practice correctly is you compute the overall data set mean and covariance so that would be somewhere there and covariance and then you just draw from that Gaussian as many centers as you want. Set them all to standard covariance and now we do, so this is the first part where you can compute an expected value. So this will say, oh, this cluster has more mass for these data points than this one because it has some elongation over here. So all of these points will be assigned to this cluster and these points will largely be assigned to this cluster. And now we do the maximization step, single maximization step. Everything has moved over to the data sets as the two sort of obvious clusters and we've already updated the mean and the covariance a little bit. That was the first M step. Now we recompute and now of course the expected values are becoming much better. So now pretty much all of these points are almost completely assigned to this cluster and all of these points are almost completely assigned to this cluster. And now we update once more. Now we've done two M steps and the third M step you don't see anymore. It's converged. So we do the third step. We realize that nothing has changed anymore. So there's a little piece of code that runs that checks how much things have changed either by computing Q, the elbow and checking whether the elbow has kept rising or by comparing all the like a norm between the pi's and mu's and sigma and checking whether they've converged. And then we're done. So yeah, it's a two dimensional data set that you can look at and understand. But it's also a very efficient algorithm. It's so efficient that you can run this in like you don't even notice. You can run it in a Jupyter cell and you don't even know whether you've clicked intake or not, it just works. It's super stable because this is in every step it increases the model evidence. So at some point it has to converge. And it's very difficult to instantiate this with a setup where it doesn't converge to a good choice of clusters. Okay, questions? Yeah, to K-means? Yes, so the classic way of teaching all of this is to have four weeks on this entire thing. And maybe I can actually talk about this in the very last lecture and start out with K-means. So K-means is a much older algorithm that doesn't produce these probability distributions. It just produces the dots in the middle. And that's usually taken as the starting point. And it comes from the early 20th century, Hugo Steinhaus and so on, and the Scottish cafe. And then this is also, there's versions of K-means that you can run on this dataset. And then you can study that they have these nice properties that they converge conveniently. There's some beautiful Lyapunov analysis to show that this is a stable algorithm. And then you notice that though there are datasets where K-means doesn't work really well when the clusters are off balance, when there's some small, some large clusters, K-means tends to be confused. So then you do what's called soft K-means. Soft K-means is K-means with a Gaussian isotropic distribution around it. And then people use that to argue that, ooh, see there's a probabilistic interpretation of K-means, aha, aha. And then we realize that these isotropic Gaussians, they don't work when the datasets are really elongated. So we need to learn the covariance of these distributions. And that's where we're at now. Then we come in with EM. We go from soft K-means to EM. Aha, EM can be used to learn probability distributions, parameters. And now what we're left with is the final step, which I haven't done yet, which is maybe one of the questions you might have in your head is, where does the two come from? How does he get to pick two clusters? I mean, of course I can see it, yeah? But I would like to have an algorithm that tells me that there are two clusters, not five. And for that, we need an algorithm that can be uncertain about the parameters that make the model. So we need to go back to our class, to our graph and say, ooh, I would like to be uncertain about this thing. I need to be somehow, this really should not be a dot, it should really be a variable. And for that, EM doesn't work anymore. And hopefully on Monday, I have time to do that part of the story for you. Because they cancel out, right? So minus C half times one is one half, minus one half, minus, no, plus, maybe it's one too many. Yeah, maybe, okay, let's think about, if it were just a square root, it would just be, so one over square root of X, DX is minus one over X to the three half, times there's a one half, two coming down. So it's X to the minus one half, D to DX is minus one half X to the minus three half, right? And so this is X over square root of X. I need to check, one of these instances where standing in front of the too close to the blackboard probably won't catch it. I need to check. So it might be that I copied one too many sigmas in here. So I'm pretty sure that this line is correct. Maybe this one is not. I checked that yesterday and it works out. But it's also the correct update. So I'm very sure that this is the right thing to do. And if only by argument or authority, because it's in literally every single textbook. I mean, this Gaussian mixture EM is just the thing that everyone does. Okay, I'll end it here. Summary, what is this algorithm we've come across? It's a way of fitting the parameters of models, which include uncertainty. It's a way of dealing with the parts of the model that you don't put a probability distribution over, to compute a maximum likelihood or a maximum apostrophe estimate under marginalization over the unknown variables. And it consists of a two-step procedure, one in which we compute an expectation and then one in which we maximize this expectation. And we realize that we can think about this process as computing an, oh, there it's wrong, an evidence lower bound. I always make this mistake sooner or later in the lecture. It's also an expectation that is a lower bound, but it's called the evidence lower bound. Like the word means evidence lower bound. It's also an expectation that is a lower bound. And it looks like this. It's annoying math to look at sometimes. But what it turns into, the kind of algorithm it creates, as you've just now seen for this mixture model example, tend to be algorithms where you have to do some modeling carefully beforehand, staring at some graphical model or some piece of math, realizing that there is some variable that if only I knew that variable, then everything else would be easier. Introducing that variable into your model, computing posteriors over it, and then doing a bit of algebra on a piece of paper as you actually do this week for the Kalman filter, and its EM update. And what you then typically get is, so it's annoying because you have to do it on paper. And to this day, there's many PhD students out there who have to do this. But also what usually happens is that the derivations when you do them right, of course, the first time you do them wrong, but when you do them right, you get updates that make total sense. It's a little bit like, well, you saw one example of this for the mixture model, where you get, oh, of course the estimated mean is responsibilities times empirical mean. Of course the estimated covariance is responsibilities times estimated covariances. And for the Kalman filter, you'll get similar updates. So if your updates, you can do your own type checking or sanity checking. If you don't get updates for the Kalman filter variables that look like the way they should look, you're probably wrong. And people quite like doing these derivations. They're very like Zen meditation. It's not like watching a gradient descent not converge. On Monday, we'll do the final bits of what do I do if I would like the parameters we need to have to be variables as well. That's going to be called variational inference. I'll just read it very briefly. And then on Thursday, I'm hoping to sort of tie this story line up with something that's not math much, but mostly history and some stories fitting together. Maybe I'll talk about key means and that part of the story. Thank you very much.