 Welcome to probabilistic machine learning lecture number 22. We have momentarily departed from our ongoing extended research or programming modeling example to look at a slightly different type of model which will lead us as a detour to a new tool that we can then use again in our topic model example. That model class is one for clustering and it'll lead us today to the concept of mixture models. The algorithm we began with is a clustering algorithm called K-means that was arguably invented by Hugo Steinhaus in the early 20th century and it's an iterative process that iterates, that creates a clustering on a data set that consists of samples Xi by iterating between two different algorithmic steps. One of which is to compute for a set of K-means the distance to every single data point and assigning each mean a responsibility for this data point if it is the closest mean to that data point and then as the second step updating the location of these means by computing the means over the data points that are assigned to them. We saw in the last lecture that while this algorithm works relatively well on simple data sets like this one it has a few pathologies. In particular it has no way of setting at least out of the box the number of clusters K and it doesn't work particularly well when the data set has a structure that although evidently containing clusters has a certain shape to the clusters if the clusters are imbalanced so one is very broad and wide and the other one is narrow and small or if the clusters are extended as in the spatial extension as in this example over here. So we use this as a motivation to start and think about how to improve the situation how to fix it and on that way we looked at the way that it's classically shown that this algorithm converges this is due to the existence of a Lyapunov function which shows that this algorithm K means even though not phrased in this way can actually be thought of as minimizing an empirical risk and we saw at the end of the last lecture that this risk function that K means is minimizing can be written as a sum over not weighted but selected square distances for across the clusters for all the points relative to the cluster means that they are assigned to. We ended the last lecture by observing this and using this as a starting point to say what would be a probabilistic interpretation of this algorithm. We do this of course because we are in a class on probabilistic machine learning so it's natural to ask what the probabilistic interpretation of such a classic algorithm is but also with the insight that probabilistic modeling can help us understand and fix algorithms. This is a broader insight that doesn't just apply to K means it's actually a very frequent occurrence that if an algorithm has pathologies and you'd like to fix it at least in machine learning it can be useful not just to do an asymptotic analysis but also to try and understand what its implicit assumptions would be if we were interpreting it as a probabilistic algorithm and that's what we're going to do now. So how do we construct a probabilistic interpretation of a machine learning algorithm that is currently available to us in the form of an empirical risk minimization problem? Well this is actually a generic kind of recipe and if you followed along some of the analogous arguments earlier on in the lecture course you may already guess how this works. We have been given a risk function that the algorithm is minimizing. I'll write that down somewhere that so the algorithm that we're looking for is finding a choice of vectors r and m or actually a matrix r if you like and m that minimizes this loss function and now what we can do is it's actually always the same trick minimizing a empirical risk often not always but often corresponds to maximizing a likelihood for the following reason. So if you're minimizing this function then that's the same as maximizing minus this function we can take the minus inside of the sum already and in fact it's the same as maximizing minus this function times a constant plus a constant because that doesn't change the location of the minimum it's a linear transformation linear transformations do not they're monotonic and so in particular they do not change the location of a minimum so we can multiply this whole thing by a conveniently chosen set of numbers and parameters so we could take this and multiply by one half times a parameter that we might call sigma inverse squared and you can maybe guess why that is plus a constant because shifting a function by a constant obviously doesn't change the location of the minimum now we can also take actually more generally any monotonic transformation of this function and that's also not going to change the location of this maximum so in particular we can take the exponential of this expression that's the same and here we have to be a little bit careful using our notation as maximizing this expression so I've taken the exponential of this this sum over the individual data points then becomes a product right because the exponential of a sum is a product of exponentials and this inner sum here I've actually retained now why am I allowed to do this I am because this these indices are IK these responsibilities remember that those are binary variables so in this sum here there is only one term that is actually non-zero so when we take the exponential of that that's only one term in a product that is that is not one but something else so we might as well just keep that sum it's gonna have the same effect right it will keep that one term that remains that is the exponential of this expression inside this constant turns now into multiplication with a constant and I have called that constant Z inverse and now you can maybe see that what we have here this expression that we have here is actually of a very common of a form that we immediately recognize I mean maybe you already recognized it when I voted down up here what we are maximizing the likelihood we are maximizing here is a product so that's an IID model for every single datum over a sum over the responsibility so there's only one term in here that is not zero of a Gaussian likelihood for the individual observation Xi at around the mean Mi so the interpretation of means is retained with a Gaussian surrounding it with a parameter that I've introduced to make it an explicit Gaussian distribution and now we can immediately see the value of this kind of probabilistic interpretation why it's useful to think about the algorithm in this way in addition to this minimization perspective because it tells us how some of those pathologies that we've previously identified arise if you are maximizing this Gaussian likelihood what what K means has essentially done is that we've decided only to maximize this likelihood in terms of the means and we've ignored the second parameter the at least the scale maybe even the shape of this Gaussian distribution now interestingly we can do that because the mean estimate for a set of assumed Gaussian random variables is simply a weighted sum at least that's true for isotropically weighted Gaussian so if there is just a scalar matrix here this value sigma square it's not necessary to know this if all we care about is Mi now that may seem like a convenient thing but we already saw that we are paying a price for this by having an algorithm in the end that cannot adapt to the shape of the distribution because we don't there's no way there's no no like we don't have an access to a parameter that has to go somewhere in here that tells us what the shape of this distribution is but perhaps more importantly we are also paying a price for just maximizing M and ignoring the fact that we're essentially maximizing this Gaussian representation for the clusters by not being able to draw from this representation not being able to talk about what the clusters actually mean if we think about K means in this way then we realize that we're building a model that implicitly assumes can be thought of as assuming that the clusters have a Gaussian shape it's actually possible to motivate K means in a slightly more general form without an explicit Gaussian assumption this is often the case that probabilistic descriptions of algorithms introduce these new parameters that amount to seemingly additional assumptions about the algorithm but actually maybe it is good to think of this algorithm in this explicit way because it provides us with an intuition for how the algorithm is going to behave because it would behave the same if we make the Gaussian assumption so what would be a better way to phrase this clustering algorithm now that we know what the probabilistic model is now that we see that the algorithm is maximizing this likelihood maybe we can think about what we might like to change to address these pathologies that we have identified before we can start with the perhaps simplest issue this shape why do we have what can we do to have allow this algorithm to take into account that each cluster could have a slightly different shape well a simple thing to do I mean a very generic thing to do would be a very general thing to the word you're almost universal thing to do would be to change this what's called the base measure this Gaussian assumption we could have another parameterization and say the distribution of the cluster so each cluster is a probability distribution which could have a more general form that doesn't have to be Gaussian necessarily but maybe that's a bit strong already it's a bit of a big step to take at this point in our analysis but a simple thing we could do is just allow for these Gaussian clusters to stay Gaussian but to have a more general covariance function sorry covariance matrix a symmetric positive definite matrix called sigma and the other thing that is maybe a bit displeasing about this model is that it makes this assumption that every single data point is in a hard fashion assigned to exactly one cluster so that would amount as a generative model to some external force telling us that you're now going to draw this new cluster and I'll tell you exactly sorry you're not drawing going to draw this new datum Xi and I'll tell you exactly which cluster it belongs to if we wanted to make it a bit more general then we could say there is a probability distribution a discrete distribution that randomly decides which cluster to draw from and it draws the responsibility this binary variable from a distribution these two changes so drawing responsibilities from a probability distribution and assigning a general Gaussian form with a general covariance matrix to the clusters that this class of models is then called a Gaussian mixture model they are models that are designed for datasets that look like this this is a reproduction of an image that again comes from Chris Bishop's book and was made by uncutting child camp that construct models that say if you look if you think about a data set like this if someone gives you a data set like this then maybe visually you can immediately see that you could think of this data set as having been generated by drawing from Gaussian distributions in this case three different Gaussian distributions in three different colors where each datum consists it was drawn from exactly one cluster but the decision to draw from from this particular cluster is itself given by a probability the discrete this distribution that is being drawn from so there is an underlying probability density under this model that is given by these three Gaussian distributions these are called the base measures and we draw from each of these clusters with a certain probability that sum to one in this case probability 20% to draw from this cluster 30 from this cluster and 50 from this cluster we can write in math such a probability distribution as this conditional distribution this is a generative model for the data that says every single datum is drawn I want there should be an I here let me just fix that bam now it's correct so this model written in math assumes that every single datum is drawn IID from this distribution this distribution says that we draw a cluster by first drawing from a discrete distribution represented by a vector pi with entries that lie between zero and one and sum to one and then once we have the cluster identity we draw the actual datum from a Gaussian distribution that has a parameter two parameters a mean and a covariance matrix so this is our idea for how to improve k-means by a explicitly representing what the clusters actually look like explicitly so their Gaussian explicitly representing the process of assigning clusters with a discrete distribution and improving the representation of the clusters by allowing for a general shape variable so why isn't this an algorithm in the first place why don't we just do Bayesian inference on a model like this so let's maybe try whether we can come up with a proper fully Bayesian algorithm directly to replace k-means Bayesian inference on what's called a Gaussian mixture model so to do that we can write down a graphical model corresponding to to understand whether that's possible or not we can write down the graphical model that corresponds to this representation so we write down a variable node for every single parameter of this model for all the means all the variances covariances and the vector of probabilities for the clusters you could also put this into this plate with k copies but then we would have to put a constraint that they sum to one so I'll just write it like this and even just from this simple graph we can already read off without doing more advanced math that or we can get an insight for why it's hard to do Bayesian inference in a model like this so why because well put it quite simply you can clearly see that this graph contains a collider so these variables all have arrows pointing inwards so when we condition on the data that we get to see then those parents of these variables will become interdependent on each other this is explaining a way all over again so another way of saying this is that if we wanted to do Bayesian inference even if you had a convenient prior on the mixture components mu and sigma the parameters of the mixture components and the weights of the mixture components pi then after multiplying with this likelihood we would get a complicated fully interdependent joint probability distribution that doesn't have an obvious conditional independent structure in our previous work with the topic model we saw that sometimes you can get conditional independence even in such situations if you're lucky because you this likelihood might turn out to be an exponential family for which you can find a conjugate prior but in this case it's clear that we can't find a conjugate prior because this isn't an exponential family this it's a product over a sum of terms and exponentials so we can't hope to get rid of this sum by somehow putting it into an exponential right it's just not going to work because of this sum structure here so in general it'll be very hard to find a closed form solution and a family of priors such that we multiply with this likelihood we just get out of tractable posterior and that maybe might explain why we have to deal with approximate inference algorithms like well that are over going to be of a similar nature to k-means so in search of a probabilistic replacement for k-means we might do a similar thing or actually the same thing as k-means and do maximum likelihood find the maximum likelihood estimate for those parameters pi and mu and sigma and we've already realized that if we would have hard assignments if you would just put binary values here for pi and ignore sigma and just optimize for mu and assume that sigma is a scalar matrix then we would get back k-means so we've introduced these probabilities pi and the full covariance matrix sigma in an attempt to heal some problems of k-means to make it more performant now let's see whether that has actually made our job harder whether we can still do maximum likelihood or whether we've introduced actually a problem for ourselves so remember how we got here we took the loss function of k-means took its exponential and then an affine transform and noticed that it has this sum of a Gaussian shape now let's see if we can go back and take the logarithm again and do the same thing again do maximum likelihood but the main key thing that has changed is that well apart from sigma that we've introduced previously pi were the responsibilities so instead of pi we had responsibilities and those responsibilities were binary so they were all all zero except for one which was one and I use this as a little dirty trick which was to drag the sum outside of the logarithm now when we going to go back and try the same thing again so if you take the logarithm of this expression that's not going to work anymore now that we have introduced this vector pi of non-zero probabilities for every single cluster so if you take the logarithm of our likelihood the log likelihood will still get the sum over the individual data but the this other sum over entries in pi we now have to keep inside of the logarithm because every single term in the sum now contains numbers that are non-zero in general so if we now want to find the maximum likelihood solution we have to be a little bit more careful and again treat this as an optimization problem compute gradients and see if you can set them to zero and we'll notice that because of this sum things get a little bit harder and we end up with an algorithm that takes a little bit more work but actually in the end is still very analogous to k-means so if you take the gradient of this expression with respect to mu then we can first remind ourselves of the form of a Gaussian PDF so I've written this down here just for convenience sake so that we can stare at it while we do the derivations. Let's take the derivative of this expression with respect to a particular mu j one of them then first of all this sum remains because it every single one of these terms in the sum contains a term in mu j we get the derivative of the logarithm so that gives us one over the expression inside of the logarithm then we by the chain rule take the inner derivative so there's only one term in this sum where we get our mu j that's this term and if you take the derivative of that inner term with respect to mu j then we realize that we have an exponential function where mu shows up so the derivative of an exponential of something is the exponential of something times the inner derivative of the argument of the exponential so that keeps out this term here around and the inner derivative of this term in the exponential is the derivative of a quadratic function some of you might know otherwise you can convince yourself by writing down element wise the entries in this double sum here that this derivative of a quadratic form is similar to how the derivative of x square is two times x here we get the two kind of quote unquote comes down so it cancels with the one half in front and we just get minus the covariance matrix inverse the precision matrix there's an inverse here missing let me just fix that and here is our inverse well actually I it doesn't really matter I mean it's correct now but actually we'll get rid of this sigma inverse in a moment we what we're looking for is the root of this gradient so the point where it's zero we can multiply from the left with sigma j actually to get rid of this expression but that's not really the challenge in here if you want to find this route then our problem will be that mu j shows up both here and also in this expression twice actually so solving this for zero will be very tough because mu j shows up in an exponential in here several times but we can do a little trick we'll suspend the belief disbelief for a moment and just introduce a an auxiliary variable called rji suggestively you can guess that those are eventually going to become responsibilities but we can just pretend for the moment that we know what these numbers are rji they're given by this expression and then we can actually solve for mu because if you now multiply with sigma from the left we have on a an expression that is given by the sum over all i rji times xi minus we could also give it off the left hand side here minus a sum over all i rji times mu j so mu j can be taken out of the sum because it's the same term in every term in the sum we could give a name to this sum over rji over i let's call it capital rj because it like has it's like a well it's like the the row wise sum over this matrix r and then write down an explicit form for mu j mu j is given by one over capital rj times the sum over i rji xi so this is a weighted mean of xi where each xi gets weighted by these responsibilities we can already guess what's going to happen here we'll find these will somehow find a way of setting those rji these responsibilities for the clusters and then we'll set the means to these weighted responsibilities and this will clearly be a variant of k means right instead of summing over in a hard fashion assigned cluster members xi every datum will now be in a soft fashion be a part of every cluster and we'll just sum over the weighted cluster memberships to get our weighted estimates for the mean can we do the same thing for sigma actually we can but in the interest of time I will just throw a few matrix differential algebra identities at you and say we could do the same thing in correspond correspondingly computer derivative with respect to sigma and then the expressions will be a little bit more complicated because sigma shows up to in two different ways in the Gaussian and both of them are a bit complicated once it shows up in a determinant of which we even take the square root and then it takes it shows up with an inverse inside of the argument of the exponential so there you can look up in books or maybe you've learned in undergraduate calculus classes that those functions of matrices also have derivatives that you can compute in particular the derivative of the determinant of sigma with respect to sigma is the determinant of sigma times sigma inverse so if you take the derivative of the determinant to minus one half with respect to sigma we get the minus one half down times the derivative of respect to the power of minus three half times the inner derivative which is sigma times a single determinant times sigma inverse so one power of this expression here cancels and we're left with minus one half times sigma to the minus one half times sigma inverse and also you might be able to look up that the derivative of an an inner form so V transpose for any vector V times sigma inverse times V with respect to sigma is given by minus sigma inverse times V transpose times sigma inverse can use these identities to do the corresponding derivative so we take the derivative of this with respect to sigma as before we get a sum over the derivative of the log which is the same expression as before times the inner derivative of each expression there is only one term in the sum where our sigma j shows up it's the inner term where we have j here that will give us the in in one case well so now there's this expression here contains sigma in two different ways so we have to use the product rule maybe the first term we take the derivative of the of the exponential term which gives us as before for the derivative of mu a you know e to the x with the derivative of e to the x with respect to x is e to the x times the inner derivative which is so we have here this is the outer derivative and the inner derivative is this the derivative of this quadratic form with respect to sigma which is minus one half times sigma inverse times you know v v transpose times sigma inverse minus the derivative of this expression with respect to sigma to the minus one half which was just computed here so that's minus one half times the minus one half goes here in front times the determinant of sigma to the one half power which conveniently we already have in our expression here so this all goes back to outside of the of the brackets and we're left inside of these brackets with sigma j inverse so now we have a similar situation to our computation of mu beforehand to find the root of this expression we again pretend that we have these auxiliary variables called rji the responsibilities multiplied this is inner expression from the left and the right with sigma inverse which we exhumed it with we which we assume to exist then those two sigma inverses cancel and we're left with a sigma j here and we can solve and find that sigma j should be set to analogously to the derivative for the mean one over r capital rj so the sum over rji over i times a similar expression as before but here we don't just have a weighted sum over the xi but instead a weighted sum over their quadratic distances that's a typical form for a maximum likelihood estimate for a covariance so if we had these rji then we could estimate both mu and sigma what about pi let's see if you can estimate pi as well so here the situation is actually quite similar but there's an extra caveat which is that the pi j the weights are supposed to be elements of a let me just go one slide forward they're supposed to be probability distribution so they should lie on the simplex their sum should be one they should be non-zero values whose sum is one so to put in that constraint we introduce a Lagrange multiplier in case you don't remember how constraint optimization with a planche Lagrange multipliers works and well okay maybe you want to check your notes of your math for machine learning lecture if you've taken it but let me give you the like the three-minute waving my hands about explanation we're trying to minimize some function in our case that's a log likelihood and we also want that function to satisfy a constraint where we represent that constraint as another function such that the constraint is satisfied whenever that function has a particular value let's say zero so in our case that's our sum over the sigma at the pi j minus one right if that is zero then the sum of a pi j is one so we want to find a point where that's the whole idea the gradient of our objective function the log likelihood points in the same direction as the gradient of the constraints function because if that were not the case so if the gradients were pointing in slightly different directions then that means we could move along the equipotential lines of the constraints function so that we could move without changing the constraints function and simultaneously reduce the value of the objective function of course you would do that because then we wouldn't be at a minimum so the only point where we should stop our optimizer is a point where the two gradients are pointing in the same direction well up to maybe a sign change so they are parallel to each other so they only differ by a scalar and that's exactly what we're writing down we want the gradient of our objective function to be a multiple of the gradient of the constraint functions if you take the gradient of this whole expression then this should be zero now so this gradient is supposed to apply to both of these terms here right okay so let's take those gradients so if you take the gradient of our log likelihood with respect to pi then again we get the sum stays we get one over this expression as before times the inner derivative now the inner derivative of this is really simple it's just the entry this Gaussian distribution for the particular cluster J that we're trying to optimize for and then for the second part for the constraint function well we just get lambda times pi J sorry lambda right lambda times one at the entry J okay this you want to be zero because we want these two gradients to point in the same direction so one simple thing to do to get an expression we can work with is if you multiply from the left with pi J because then we see that we here get back our expression that we've now come to know and love our ij the responsibilities plus lambda times J so now comes the part where we have to figure out what lambda is so that's an extra piece of information that we have to get in using usually the constraint the fact that the constraint is supposed to be satisfied so if we think about what would happen to this expression which is zero if we sum over all of these terms over J then every single sum here so in inside of each i each term for i right we sum over J we get a sum over r ij over J so if you look at this expression well we go back one for the definition of r rji we see that if we sum over J then we have the same sum in the numerator and the denominator so we just get one so here we have if you sum over J every single term in here has a one and so the whole thing is just n right the total number of of data points and here the sum over J is exactly what you want to constrain on it gives one right because the sum of ij is supposed to be one so zero supposed to be equal to n plus lambda and therefore lambda has to be minus n oh sorry that should be a little n okay that means we now we can plug in our value for lambda say it's minus n and solve for pi J pi J is given by the sum over r ij over i so that's rj capital rj divided by n so that was a lot of arithmetic and let's not lose sight of the overall picture now we've actually found a way of optimizing our hyper parameters or our parameters pi mu and sigma of this Gaussian mixture model not in a closed form but in an iterative process that will obviously remind us of k-means why is that well okay maybe let me close the loop I haven't actually told you yet what we do with the r ij rji maybe so with all of these three update steps we just derived for mu and sigma and pi we had to pretend that we know what those variables rji are if we knew those then we could set mu and sigma and pi but notice that if we knew what mu and sigma and pi were then we also know what rji are because they are completely determined by these expressions so what we can do is we can iterate between those two and that's exactly what k-means does so we can here write down an algorithm that gives an iterative maximization of the log likelihood for this Gaussian mixture model and this algorithm for reasons that will become clear in a moment is called the EM algorithm and it's already maybe good to mention this now because we'll get to it in a moment so we initialize the algorithm by setting the variables mu and maybe sigma and pi in an arbitrary fashion similar to how we did this for for EM and then use those values to compute the responsibilities r ij according to our definition of what they are having set those responsibilities we can then compute our estimates for mu and sigma and pi we first use our computer helper variables capital R and then set mu and sigma and pi according to the rules that we just derived to the roots of the gradient at that point there is a actually the fact that we are setting pi here is a little bit too perfluous because pi is essentially just a representation of the values of r ij rji I'm obviously very bad at keeping the order of the r ij and rji here and so therefore we actually basically don't have to keep track of an explicit representation for pi we can just write it in terms of what the values for r ij are now what is the connection of this algorithm to k means I mean maybe you can already see it right it's we're iterating between setting the responsibilities and then setting the estimates but let's maybe make the connection like to recall our connection make it a bit more precise so if we go back to the simplified assumption that the Gaussian distribution we're trying to fit doesn't have a full covariance capital sigma but instead it's an isomorphic or scalar covariance then the connection to to k means or maybe to a soft version of k means and sometimes called soft k means that's a construct from David McKay's book becomes a bit clearer so then the r ij the responsibilities are given by those ratios which simplify a bit because these gaussians all of the same normalization constants of normalization constants cancel out and we are left with something like a a softmax expression for the for the responsibilities and we can make that softmax expression hard by setting beta towards infinity by setting the variance beta here shows up as beta inverse if we set the this beta infinity that means we're setting the variance of these gaussians to zero and we can get back the exact k means algorithms because because then this assignment here is literally just set r ij to whichever value along j in here maximizes is maximized or is well these values are all then all all zero except for one that is exactly one so we can maybe like this is one first interesting takeaway we can identify k means very specifically as a maximum likelihood estimate of a gaussian mixture model where the assumption the specific assumption is that the gaussian components have not just an isotropic covariance so they're not just scalar no they're actually point masses they are concentrated at exactly one point with infinite precision so that maybe also explains where the pathologies in k means come from it's not just that they're making a gaussian assumption and an isotropic gaussian assumption at that which would explain why it doesn't work well with elongated clusters no they also really in a hard fashion by assigning hard responsibilities are binary variables this algorithm actually assumes that the clusters are point masses and that's maybe why it also can't deal with clusters of varying size because there is no way of accounting for them and it explains why there is no easy straightforward way to take the original k means and use it to sample additional data points as use it as a generative model because the underlying generative model is a point mass of gaussians okay fine so now maybe we found a probabilistic interpretation of the k means algorithm was that really worth all this time an extra lecture on on this of course not where our goal is not to just talk about k means we wanted to find a particular structure that we can use elsewhere as well an algorithmic trick that can be used in probabilistic generative models in which it is not straightforward to find a maximum likelihood expression in closed form and this is this algorithm that I've already mentioned is called the EM algorithm what we're going to do next is trying and figure out where this name EM comes from and we will arrive at that by maybe asking a preliminary question which is what actually is the nature of these variables are these responsibilities of individual clusters for individual data how can we think about this responsibility where what what kind of variable is this it turns out that a very interesting interpretation arises if we change our notation a little bit if you go back to our gaussian mixture model which has so far the parameters mu sigma pi and we introduce a new auxiliary helper variable let's call that variable z and let's interpret it as the heart assignment of an individual datum to a particular cluster so this is an intermediate variable that we can use to generate the data set to generate the data set we first draw cluster assignments for each of the data and then for each datum draw the instance of the datum by drawing a gaussian when the variable from this particular cluster you'll notice a that this gets us closer to the idea of k means because k means directly assigns a value to this z variables in a hard fashion ignoring pi so that's another way of thinking about what k means does as a kind of a maximum likelihood assignment to the z i and it also of course reminds us immediately of our topic model where we had a similar kind of structure we introduced the word topic assignments see di right which would just like this z here they are the variables that if only we had those then we were able to do efficient inference on all the parameters of the model so here let's do this a little bit let's take a step back and do this more a little bit more precisely and we will consider these binary variables z ij which are providing a one hot encoding so that means they are all zero except for one of the entries along the j direction which identifies the class the cluster that the element belongs to the data set the datum belongs to once we introduced this variable our notation conveniently can be updated a little bit so we first notice that the marginal probability or maybe the prior probability for the jth entry for data point i to be the one that is switched on is exactly given by pi j so this prior distribution is a very simple form we can also use this simple this notation set to write the joint over all of the sets in a very efficient way as a double product independent over all the individual data points and as a product over the individual entries in pi j to the power raised to the power of z ij which is convenient because the z ij are binary variables so all but one of the entries is zero so that's like a writing it's the probability over it's the product of all the individual data points times the probability for the one cluster that that one datum belongs to the generative probability for the instances xi the individual data points given these cluster identities can also very efficiently be written as or given the entire set z let's do that directly as a product over all the individual cluster probabilities like the generative probabilities for each cluster gaussian distributions raised to the power of these z ij so we can write the joint probability so the conditional probability for all the xi as a product over all the individual data times a product over all the cluster probabilities raised to this power and in particular this allows us to write the marginal probability for the xi themselves the gaussian mixture probability as just using the structure of the graphical model as the probability for as a marginal probability right a marginal probability for x by taking the joint over x and z and summing out z so the joint over x and z can be written by the product rule as the probability for x given z times the probability for z and those two terms are exactly the bits that we've seen in our gaussian mixture model so far the probability for z it's just pi times the probability for x given z which is this gaussian distribution a gaussian probability evaluated at the particular value of z where this name em algorithm comes from begins to become clearer when we now look at the joint probability distribution for x the data we see and the latent unknown variable z notice that here in this graphical model i've already made this a little bit clearer we can we are going to do maximum likelihood inference so we're i've written little black dots for these parameters mu pi and sigma so these are the the variables that we're not going to assign a probability distribution over but let's assume that we would like to do inference on z on the unknown variable that is actually just a helper variable then you could consider to do bayesian inference the joint distribution for x and z that will give us the numerator of base theorem if we write that down then we can use this notation i just introduced on the previous slide to write the joint probability for x and z given pi mu and sigma as the as this right so that's the probability for z given pi mu and sigma which is just the probability for z given pi by this graph times the probability for x given z and pi mu and sigma and we can use this joint to compute the conditional distribution for the z's the posterior if you like for z given the latent variables that's the joint or the prior times the likelihood divided by the normalization constant the evidence which is the joint marginalized over z and we've just done that on the previous slide but notice that this term up here is given by you know the probability for z which is pi times the probability for x given z which is the cluster probability and the probability for x under this particular cluster element divided by the sum of all of these terms well that's exactly our responsibility it's exactly this r i j variable that we've encountered in the previous derivations so what is that variable well it's the probability the discrete probability to choose this particular instance of z or another way of thinking about this is that because this is a discrete distribution this p those values r i j those are actually the expected values for the value of z right the expected value of z that's yeah because the expected value under a discrete distribution is just the probability of for each individual variable so one way of thinking very specifically about what our version of maximum likelihood k means or soft k means does is that it iterates back and forth between computing the expected value of this latent quantity z then setting z to that value and using that particular value to maximize the probability for mu and pi and sigma here is on this slide again let me just say that again our pseudo code version for our maximum likelihood estimate for a gaussian mixture model is to randomize initially for the parameters of the of the mixture model and then iterate between computing the expected value of the latent unknown label assignment variable z that amounts to computing the responsibilities r i j and maximizing the likelihood for the parameters of the model the parameters are mu and sigma and pi and that's why this algorithm is called the expectation maximization algorithm because it iterates between computing and expectation and maximizing a likelihood another way of thinking about this algorithm and I always never quite know whether this is really helpful or not so see if it helps you or doesn't help is that this algorithm kind of tries to invent as much as possible a value for this latent quantity z so one way to describe what's happening here is that we have this directed graphical model this generative model which has this latent variable set we don't know what set is if we knew what it was if you could color it in black then it would be easy because then this whole graph gets sort of neatly separated we know how to infer pi if you know z and we know how to set mu and sigma if jointly if we know x and n set so we don't know what set is so what em does is it iterates between not actually in a hard way setting z to a particular value but computing the expected value of z and then using this expected value of z to and the structure of this graph to then estimate mu and pi and sigma in a maximum likelihood fashion and then the other way around keeping those values from mu and pi and sigma to compute the expected value of z now you notice that this graph has a similar structure to the more slightly more complicated one we've noticed in our topic models and I've already mentioned now several times that we are going to add this algorithm the em algorithm to our toolbox it will become the well actually the pen ultimate tool in our toolbox the ultimate one will be a generalization of the em algorithm that is called variational approximations but actually it's useful to think of both of them as separate kind of aspects of the of a very similar related idea so without just derived em in a kind of pedestrian fashion and an individual like a slow progression for one concrete model for Gaussian mixture models what I'd like to conclude this lecture with is to is to take this particular specific algorithm and just tell you the story again the story of this algorithm but in a more abstract on a more abstract level like go one level up on the hierarchy of concepts and they go away from Gaussian mixture models and talk about em again so that we can find the structure that really we're really leveraging here when we're constructing this algorithm that'll help us understand how we can apply this tool really this algorithm really as a tool in more general settings like for example our topic model as a general tool em is an idea a algorithmic concept that you might want to try in situations where you want to compute a maximum likelihood estimate or actually also a maximum apostrophe estimate we'll see that that's not particularly more challenging in models where you can't directly compute maximum likelihood estimates so there is a p of x given theta a maximum likelihood type problem we want to find the maximum value the value for theta that maximizes this expression or in particular also the log of this expression but we can't do this in closed form if you could then we would just do just that now there are sometimes situations and this is when em applies where you can invent a latent variable z such that this log likelihood can be written as the log over the marginal of the joint well of course that everything can always be written like this because this is just the the product rule but this special special aspect of this variable z is that when we introduce it the optimization becomes easy or would become easy if we knew z so another way of phrasing this is em is an algorithm that works in models where you want to do maximum likelihood and there is a latent variable z that you don't know but if you knew it everything would be easy if you knew it the optimization would be easy the idea or the structure of em then is to initialize the estimate for the optimization with a particular value for the latent parameters theta and compute the conditional distribution the predictive distribution for these informative latent quantities z given the data and this particular parameter value theta and then this is a somewhat mouthful of a sentence set the parameter estimate theta to a new estimate that is given by the maximum by the location of the maximum of the expectation of the so-called complete data log likelihood where the complete data log likelihood is the joint the distribution for both the data and the latent variable z so why is this a good idea well it's a good idea because we had now have an optimized instead of having to optimize a function without the z here which we don't know how to optimize we now have a sum over a bunch of terms which we know how to optimize so our assumption was that this function is easy to optimize for a particular value of z and we now have a sum typically a convex combination well not just typically a convex combination because these are all positive numbers of such terms that are easy to optimize and those maybe way easier to optimize than the original function another question you may have is whether why this is why i'm talking about this as a general tool why can we hope that such situations are frequent well that's not immediately clear but maybe it helps to understand that you get to set what those variable z's are so you get to invent what z is right you can always for any variable z write this expansion because it's just a sum rule so you could imagine that maybe they are quite frequent cases where you can invent an explanatory variable z that typically decouples things from each other so that they become easier so this is the abstract generic representation of the em algorithm maximize the expected value of the complete data log likelihood at a particular value in the parameters and then keep iterating by the close to iteration by assigning the computing the expected value or the probability distribution for the latent quantity how does this relate to our situation in in gaussian mixture models so just let's make this connection again and see how our way of doing iterative maximum likelihood for the gaussian mixture model looks like from the point of view of this abstract representation so here is our graphical model again for our gaussian mixture model we wanted to do maximum likelihood inference on this model so that's our probability distribution for the data given the parameters which are given by pi mu and sigma and that log likelihood is exactly of this form that it's although it's iid over the individual data points xi each of these iid terms contains in the log a sum over individual terms so log of a sum is hard to optimize in closed form it would be nicer if there were a product here because then those expressions are easy to maximize in mu and sigma and actually also in pi so what we do is we introduce this latent quantity z this explanatory variable that tells us which cluster each individual datum is supposed to be to belong to well it would tell us if you would know it we notice that when we introduce this this variable that the structure of our problem gets easier now we saw that this we can turn our sum into basically a product this seemed like a sort of an accidental neat thing in our notation but it's actually a really powerful thing instead of having to write the log over a sum we now get to write the basically the log over the product where we can write the individual terms in the sum as a product over these terms raised to the power of this explanatory variable z which is binary right so only one of them is one because we have a product now in here we can basically take the log inside right and the both of these products just turn into sums where the z and k come down and we have terms here that are easy to optimize actually one of the reasons they are easy to optimize is that these are exponential family distributions they are discrete distribution that's an exponential family and the Gaussian distribution which is an exponential family and we know that finding the the maximum likelihood hyperparameters for exponential families here mu and sigma and pi is possible because it can be done by just computing the gradient of normalization constant and setting it to zero that's actually what we've been doing in these derivations it's sort of we just did it explicitly but actually we computed gradients of normalization constants so these expressions are easy to optimize and we want to do that the only problem we have is that by introducing our latent variable z we have that z now lying around somewhere in our computation so the idea of em is to is to compute the expected value of this expression under z that's just another sum over the the values for for z and just optimize that function instead and because it's it's going to be a sum over functions that are easy to optimize that is actually well we hope that it's easy and it turns out that it is easy that's exactly what we did so we iterated between writing down the conditional distribution the predictive distribution for this explanatory variable z we did this and noticed that it gives us these quantities that we called responsibilities these are just a bunch of terms and because it's a discrete distribution it actually isn't a big problem that these are somewhat kind of onerous complicated terms because they are just numbers that we can compute and we can call them the responsibilities and then maximize this expected complete data log likelihood this expression which is a sum over individual terms weighted by the responsibilities where each individual term is easy to maximize or it has a simple gradient that is easy to maximize and so we found that each individual term is actually it's easy to set this sum to zero as well and solve for it because it's just a linear combination of easy to maximize terms yeah and with that we've actually arrived at an early end of today's lecture why are the detour of k means and Gaussian mixture models and our manual derivation of a maximum likelihood estimate an iterative maximum likelihood estimate for these types of models we have arrived at a relatively generic algorithm called the em algorithm or the expectation maximization algorithm it's an algorithm for iterative numerical optimization or the construction of maximum likelihood or in fact also maximum apostrophe or estimates that essentially leverages the power of the sum rule the fact that probabilistic models can always be expanded with latent variables can be used in the hope of finding a particular choice of latent variables such that the individual terms in the sum rule then become easy to optimize or that their log becomes easy to optimize the em algorithm works as follows where in a situation where we would like to compute a maximum likelihood estimate for this probability distribution whatever it may be but we don't have a closed form estimate for it however we might be able to invent a latent variable z an explanatory variable such that this so-called complete data log likelihood so if the log were inside of the sum are actually easy to optimize in such situations instead of optimizing this expression we can instead iterate between initializing with a particular value of theta then computing p of z the explanatory variable given x and theta so computing the predictive distribution for z and then computing the expected value of the complete log likelihood and maximizing it as a function of theta then we recompute the probability the conditional probability distribution for z so the predictive distribution for z recompute the expected complete data log likelihood maximize again and iterate this process over and over again until we reach a point where the algorithm converges at a particular value of theta perhaps the next thing we might want to try is to apply this em algorithm to our topic model but actually before we do that in the next lecture we will pause for a moment at em and first try and understand why this algorithm actually converges and try to confirm to ourselves that it actually does converge at the maximum likelihood estimate and not somewhere else when we do that we'll discover that there is an opportunity to actually generalize the em algorithm even further and what we'll get out of that is a very powerful algorithmic notion for approximate probabilistic inference called variational inference but that's the content for the next lecture for today let's keep it at this thank you very much for your attention see you in the next lecture