 Hello and welcome to lecture nine of our introduction to machine learning course We are going to Start talking about unsupervised learning in the remaining three lectures. We spent eight lectures talking about supervised learning problems Just remind you supervised learning means that we have some input data and we have some output data that we are trying to predict So these are always prediction problems Essentially in supervised learning we're learning the mapping between the input and the output and this can be Regression if we're predicting continuous variable or classification problems if we're predicting categorical outputs and so on and so forth in the unsupervised learning which is our topic from now on we just have Input data if you want we do not have output data. We don't have labels We don't have any values to predict we just have a data set and we want to do something with so we want to we Want to find some structure in it usually So unsupervised learning is often connected to data mining and Similar similar things so What can we what can we hope to find in the input data if we don't have labels to predict? So broadly at least in this course We're going to talk about two different two different aspects to different applications parts of unsupervised learning That's dimensionality reduction and clustering we're going to start with clustering. That's our topic for today. So what is a clustering problem? Here's some toy data very simple data set with two variables and And these are the points are the samples and in the clustering problem. We want to find if there are groups Clusters in this data right so here. It's quite obvious to the human eye that this is a cluster This is a cluster. This is one cluster. Perhaps two clusters And that's what we want the clustering algorithm To tell us especially if the data are higher dimensional than two dimensions and we cannot just plot it in 2d and look at it so easily So it turns out that the clustering problem is actually a very very difficult a very complicated problem And it starts the the problems start with even formalizing it. So what does it mean? to To cluster the data what does what does what is the definition of the of the cluster? Very very hard to say what captures our human intuition about what the cluster is I can give difference and and we will later In this in this video in this lecture, we will see some examples Even people might disagree even human observers might disagree about whether this for example is one class or two clusters, right? And so this is hard to formalize Many questions arise like how many clusters are in the data is it is it is a three or four in this case Or perhaps perhaps it's only one cluster. They have the data are not clustered at all. So you just have one one blob and If you if you put it in some clustering algorithm, it will maybe give you several clusters But perhaps perhaps the problem is ill-posed and the data this particular data said is just not clustered Or if you have two clustering algorithms and they give you some clustering results, how can you compare them? How can you choose the better one in the prediction setting? You take a test set and you apply two models on a test set and you see what performs better, right? It's very it's very easy to in principle at least to say which prediction is better here You have two clusterings. So how do you decide which one is better? So there's a lot and a lot of these conceptual problems people sometimes say that in clustering is is not a science But an art And there's some truth to it We definitely don't have time to cover all of that or even discuss all of that in this lecture today But we will we will talk about two clustering algorithms This is just should serve as a caveat for you to know that these this is If you apply this in practice, there are problems like that that you have to to think about but in this lecture, we're going to discuss two particular clustering algorithms One is k-means and another is Gaussian mixture models. So we'll start with k-means In a k-means clustering. So this this loss function that I'm going to define now that serves to formalize this notion of clustering, right? So we want now to To put it into math. What does it mean to cluster and that's just one particular way of doing that? Okay, so we have some input data x i we want to split it into k clusters and that's something important from the beginning So in k-means clustering, we're choosing k from the beginning So we're clustering a data set a given data set into four clusters or into ten clusters, right? This is the question of how to choose the number of clusters That's a question that we're actually not going to to be discussing today Let's say, you know that you want to split it into four clusters. How do you do it? So we are going to in the k-means clustering We're going to represent each cluster by its average by the mean by some vector That I call mu here that will that will in some sense represent the cluster and the loss function of k-means It's just it's very simple. It's just a squared error in some in some sense. So it's the sum over all overall clusters from one to to capital K and then the second sum is Overall examples overall samples that belong to this cluster and what I'm summing is the squared distance from my sample to this representative vector mu Okay, so this is what we want to minimize so we want what do what are the parameters here? The moves are the parameters the splitting into clusters. So this s K sets, right? These are just the sets of indices the set of in the set of samples that goes to cluster one the set of samples that goes to cluster two And so on so these s and moves is is something that we can adjust in order to minimize This loss. Okay, there's another way to to write the same loss where we just some the second sum is over the entire is the entire data set and I'm introducing here this are I K variables that will be equal to one if the sample I belongs to cluster K and zero Otherwise, so it's the same thing just written slightly differently Okay Yes illustration So here is here is how it may look for a simple two-dimensional again data set with three clusters So these points here are my my sample points this in the middle is Is the mu so this is mu two mu one and three and these are the lengths lengths, right? That enter the the loss function So I'm taking the sum over all these edges of the squared length Of each edge and then I'm summing everything that belongs to this cluster and everything that belongs to this cluster and to this cluster Okay, and it's it is rather intuitive that if you like assign this point for example to this cluster and Then your loss function will increase right you will perform worse because you will replace this short distance with this long distance Or if you move this this mean Point mu somewhere to the different location all these lengths will increase Okay, so the loss function makes sense. Good. How do we minimize it? So here's the same loss again so the First piece of bad news is that it is not analytically solvable. There's there's no way for the arbitrary data set to to have a formula for what the moves and the Rs or s variables are Okay Second piece of bad news is that it is not convex. That's something That is not so easy to see from this formula directly, but you will see this later on It's non convex and it's badly non convex. It has a lot of local minima. It can have bad local minima And we have we have to deal with it The third piece of bad news is that gradient descent can be used to minimize that But it actually is a little messy to write down the derivative especially derivatives It's it's not You have to deal with some constraints here if you operate on these are variables here with this s It's not even clear how to take a derivative of that. So it's it can be a little messy The good news is that there is an alternative approach that actually is very simple very intuitive and works well and It's 4k means it's called Lloyd's algorithm. So let me explain what the Lloyd's algorithm does here It's iteratively optimizes over our and over moves. So this means that we first Optimize over me. So we find optimal moves having the R's so this means having the the split into clusters fixed we optimize the position of the new Vectors and then we hold the new vectors fixed and we optimize the the splitting of points into clusters and we keep Repeating these two steps until it converges. So let's discuss both steps separately And it turns out as you will see now that each of the steps is very very easy actually So the first step is for example, we're holding the moves now fixed. So what does that mean? We want to find in which cluster should each point go if you think about that each Sample will contribute to the loss function the square distance to the cluster Representative vector right so if the representative vectors the moves are all fixed and you have your sample Where should it go? Well, it should go to the closest new that will this will minimize its contribution to the loss So you just assign each point to the nearest cluster center. That's all it's very simple So here is how you can write that? Mathematically that for each point you assign it so each sample I goes into cluster K Where the K is the is the value that minimizes this so that's just the closest new to X I okay The second step would be now we having the We having these sets fixed so we're saying all these points belong to cluster one and all those points belong to cluster two How can we choose the new vectors to minimize the loss and that is also very simple and directly given here So if you look at this look at this sum if these are fixed then basically the sum decomposes into some for each cluster That you can optimize separately and then you take the sum that is that relates to cluster one For example, and you want to choose the new that minimizes that but this is just Minimizing the the squared error loss. It's basically Like the estimation of a Gaussian the the best the maximum likely estimate is just to take the mean The the average of the samples as as the new vector So that's a very simple regression problem or Gaussian Estimation problem you can directly see from here that if you want to minimize This one sum for one cluster then you just take the average also. I think is very intuitive if you remember the Let's catch on the previous slide, okay So we have these two steps that each of them is super simple here. We assign each point to the nearest cluster here we We choose the mean Of each cluster as the average of all points that are currently assigned to it and then we alternate So it's actually so simple that it may even be Surprising the first time you encounter it that this works at all like this seems that seems even Like this seems so simple. It's almost dumb. So let's see how it works because it does work. I Find this remarkable. So here's the illustration in a very simple case where you have a two-dimensional data So this is taken from from this textbook You have two-dimensional data and We will just we will just go over the iterations and see what happens We have to start somewhere. So this is actually one One aspect that I didn't mention you have to start somewhere and in this case I think we just start randomly. So we choose to to move so we choose the New one and the tool and they're just randomly chosen within the Within the range of the data Okay, so now first step we assign all points To the closest cross right so these become red points these become blue points now We do the second step we put the cross the crosses will move them to the average of The all points of the same color right so the red cross moves here and the blue cross moves here Okay, great now We reassign all the points so that each point gets to the closest cross and now we move the crosses again and now with the reassign the points again then we move them again and Again and you see that it almost doesn't move anymore and actually after this point I can stop because this converged now So if I now keep keep iterating nothing will change anymore now every point is already Assigned to the closest cross and the cross is in the middle Is in the average position of all the blue points so you can keep iterating if you want but nothing changes So you can stop the Lloyd's algorithm when it converges and it's actually pretty easy to see that it will always it will always converge To some minimum to a local minimum in this case and that is because there is just a finite number of Of the ways how you can split these points into two classes And it's easy to see that each step of the Lloyd's algorithm decreases or leaves unchanged the loss function So the loss will never go up During the iteration so it has to go down or stay constant and there are a finite number of ways you can reassign points So at some point you will just you will just hit You will converge you can however converge to a bad local minimum So you converge to a local minimum, but it doesn't have to be the best possible solution And this is something I want to illustrate here This Strongly depends on the number of clusters So if you do this in 2d and you have two clusters like on the previous slide it will I think it will more or less always converge to the to the good solution But if you have something like that So in this case you have a lot of clusters right and notice that the clusters are actually very simple here So there's not really there's not really this conceptual question of what is a cluster doesn't really arise here because these data are Sampled from a bunch of identical spherical Gaussians Just located in different places here Right the I think the number of points per Gaussian is the same here So it should be like the easiest possible data set for for clustering in a way, right? You have a bunch of identical spherical Gaussians the same number of points per cluster and then you start came in and this will be an example Solution that it will converge to depending on the initialization and you see clearly that it is not it is not very good and What happens in particular is that you have areas like that where one cluster? Well one actual Gaussian is one underlying cluster is gets split into several three Here in this case or two in this case and there are other examples like this one Where two Gaussians are actually assigned to one cluster and this is a minimum So if you if you iterate Lloyd's algorithm nothing changes because there's no No like local simple change There's no point that you can reassign to the neighboring cluster so that the loss decreases You can decrease the loss, but you need to do a drastic change You need to say I split this cluster in two and I merge these two classes together at the same time and then And then the loss will decrease but This the Lloyd's algorithm will not achieve that right so you have to think of a of the loss function of this Of this problem having having a lot of a lot of local minima So if you are in this local minimum you can do gradient descent or you can do Lloyd's algorithm And you are already you are already in the optimum. You need to change it a lot to move it to another local minimum That actually contains a better solution. So there's what can we do? there's several Essentially heuristical ways of how you can how you can do better so you can there are smart ways to Initialize the K-means clustering Something called K-means plus plus for example chooses the these initial points somehow in a smarter way this can help another thing that can help is that Okay, very a very simple thing that can help is that you just run it a number of times for example starting with different random Initializations you ran it ten times and then choose the solution with the lowest With the lowest loss this can help a little bit too a Smarter thing that can help is that after after in converged or or even during the iterations You check whether you can decrease the loss if you split if you split some cluster into or merge to class to neighboring clusters together So this is the there are several of approaches like that split and merge Heuristics, so you iterate the Lloyd's algorithm and then you're checking Perhaps if you if you if you merge this and split that then your loss may go down and then you do that and then you continue with the Lloyd's iterations All right, so this is the the K-means clustering so apart from these difficulties of conversion that can be to large extent Addressed with this with the smarter heuristics that I'm not going to explain in detail There are some more more fundamental Drawbacks of K-means or even the global minimum of the K-means loss function may not be the best clustering that you're after So this is something I want to discuss here. So here is an obvious An obvious example where the data are just very far from from from being Like a Gaussian blobs in space, right? For example, you have one blob and then this ring around so you would maybe say these are two clusters Just one of them has a funny shape K-means will will never be able to find that right because Because assigning all these points to one cluster actually is is This will have a high loss In that in that loss function the distances Of all these points to the mean of this ring cluster will be large and K-means will think it's a bad solution There are though situations that may seem Less extreme, but K-means will still give you something that you don't expect so one example is that where the guy where the Classes even maybe perfect Gaussian, but are stretched, right? So they are not like a spherical blobs, but they they are very elongated blobs So if they're elongated enough for example on this on this sketch K-means may decide to split them in Like that, so this will become one cluster and this will become the second cluster Because this will this will just minimize the the squared error if you assign all these points to one cluster Then some of the distances will be really really large and this will be sub Optimal as as as far as the K-means is concerned here is another example here the The blobs are spherical But one is really small and one is large just the radius The the variance of this Gaussian if you want In this case what can happen is or what actually will happen is that they will split like that And that's because even if the means are exactly correct So here's the mean of my cluster one and here is the mean of my cluster two So where does this point go it goes to its closest mean which in this case may be this one, right? So the part of this cluster will be a part of the Gaussian will be chopped off and assigned to the wrong one And that's not the convergence problem. That's just what K-means will do in this case And here is the fourth example in this case. They are These two clusters have the same size Like geometrically they have the same size, but the not but the sample size is different So there are few points here, and there are many points here And what can happen here is that again came in will choose to chop off a part here Because even though these distances are all pretty large. That's the price It's paying but then actually these distances here will be mean will be a little smaller To its mean and since there's so many points here Then it's it can be The loss can be smaller if you if you assign some of these points to that other cluster So you will get problems with came or you may get Problems with K means if you have very different sample sizes in in your clusters if the or if the covariances I've like either just the the variance is is very different or the covariance is very far from spherical and all of that can be addressed if we're using if we if we were to use a model of the cluster that is more complicated than what came in Implicitly assumes and this brings us to the Gaussian mixture models, right? So all three examples on the previous slide Were examples where the data within each cluster were a Gaussian and If that is if you happy to assume Gaussian data Then the it's it's much it's often better to use a Gaussian mixture model That just explicitly assumes that your points your samples come from a mixture Distribution a mixture of several Gauss and so you have a Gaussian here And you have a Gaussian here the third Gaussian here and the the distribution that is a mixture of them Which has this this density is just a sum Well a weighted Sum of several Gaussians So these are the weights and these are the means of the Gaussians covariance matrices The weights sum to one so that the entire thing sums to one as the probability density Function should right so and that is something that we want to fit to our given data set That's the problem of the Gaussian mixture model fitting. So If you think about that, it's actually it's it's a little similar to what we did before with the with the K-means and with the Lloyd's algorithm, so intuitively one could try to use here and Iterative approach similar to Lloyd's Algorithms right, so I'm not deriving in here. I'm just saying well one could try something like that. What would it be? We assign each point to the closest Gaussian not to not to the closest in K-means the closest just meant the closest move vector right the move with the smallest Euclidean distance here it will be a bit more Complicated because the Gaussians have some shape the covariance matrix, and they also have these weights Pies so we can but we can still assign each point to the nearest Cluster like to the best fitting class in the sense that this posterior Expression is the is the is the largest so we just check for every point to which of the Gaussians it is Most likely have come and put put it there And then so that's one step and on the next step we update the parameters of each Gaussian based on the points there So one could try something like that and it will probably work, but let's see what happens if we actually try to derive Derive the the how to optimize the likelihood of this model because this is now a nice probabilistic model, right? This is the probability If we have a sample then we have the the complete probability under our depending on these parameters Of our Of our data set to observe the data set given the parameters, right? So we can we can use the maximum likelihood principle that we want to find the the parameters that That that give us the data set with the highest likelihood So let's see what happens if we actually if we actually go through this math here I will directly write the log likelihood and not the likelihood itself, right? So this is the sum and the log this is the sum of all the all the Samples this is the logarithm because this is the log likelihood And within is just this the same thing here, right? Which is another sum over? components over Over this mixture components, okay, and just to remind you the that the Gaussian Multivariate Gaussian has this exponent in it in the density function. So just to see what happens Let's try to set the derivative with respect to one of the one of the moves to zero Because we want to find the the maximum of that, right? So we just take the derivatives and set them to zero and see what happens So let's let's do this We're taking the derivative of this thing So we have the sum it stays a sum then we have the log of of of this sum So this goes the in the so it's a chain rule what I'm doing here So the sum goes in the denominator because I'm that's the derivative of the logarithm Then I have the sum but only one one Term in the sum depends on new k So only this one term survives the derivative and it goes here in the norming numerator and then I have to take the derivative of this thing inside the exponent Right, and I have the sigma k in the inverse of sigma k goes here and This x minus mu Is over here and then we're done and this would be equal to zero if we are at the At the maximum and notice that this thing here Is just constant in the sense that it does not it's the same When you sum over I this thing is always the same so I can take it out of the sum and And cancel out and I'm left with Let's rewrite it here and I'm left with this Red term and x minus mu and that should be equal to zero So let's call this thing zik and let's let's look at that for a second and try to understand the meaning of this zik term It's and it should remind you of the Bayes formula and it good it is good if it does if you have If you take one sample I so if you if you fix one sample I here then This expression the zik Let's say you fix the I and the k then the zik will tell you what is the posterior probability of your sample I to belong to cluster k right because this in the numerator Is the probability that it that it and that it comes from from cluster k and this in the bottom is the sum overall Overall possible clusters So if you divide one by another you get this zik that will actually if you sum over k will sum to one as it should with posterior probability and Yeah, as I said for each I for each k it just tells you what's the probability that the sample I came from from the cluster k Remember what we're doing here. We're trying to solve for mu right so this is now very simple expression that we can immediately solve and this is just the The weighted mean of all points So you you have the sum of the entire data set But the points that have high probability of having come from of coming from cluster k will contribute More than the points that have very low probability of coming from cluster k so you take the average Of the entire data set but with weights That are That are given by the probability that these points belong to this cluster k Okay, so this this makes a lot of sense if you think about that you you have some points Assigned to this cluster some with high probability some with very low probability Those are far away and then you just take the average of essentially the points that have high probability And this gives you the optimal mu However, this brushes a very important thing under the carpet sort of if I write it like that because this might seem as I solved it Here's my solution But of course I didn't solve anything because I have this z z z terms in here and the z depends on this On this and this depends on mu And on sigma and on everything else right so this is This should be true If I am at the maximum But these guys over here depend on mu two so this is not a formula that I can use to just write down Write down analytical solution unfortunately I can follow very similar steps Taking the derivative with respect to sigma is just a bit more cumbersome Or difficult to derive but if one does that then one gets that the sigma should be the weighted covariance matrix And the if you do the same with respect to pies then you get this formula for pies that this is just the Basically weighted fraction of points that belong to each to each cluster So I think these formulas for mu sigma and pi make a lot of sense But as I said, they are not really giving you analytical solutions So what can one what can one do we have the mu the optimal mu defined through The the z's that are given through the mu again But at this point we can finally do something very similar to what we did in the in the Lloyds algorithm We can optimize them iteratively and in this case, it's called expectation maximization algorithm On one step, which is called the e-step. So the expectation step We compute this probability zik for each point to be in each Gaussian component And given the parameters mu sigma pi fixed. So this expectation step just means You're computing these red terms that I had on the previous slide. Okay That's the entire step. So simple Okay, and then in the m step, you now update all parameters of each Gaussian To using the formulas that we derived these weighted averages and the weighted covariance and so on And then after you updated them, you go back to reassign the points By computing this new posterior probabilities depending depending on the new parameter values And then you update the the parameters again And this should remind you very very strongly of the Lloyds algorithm that also assigns points Then updates the moves assigns point reassigns points and then updates the moves Here we have more parameters to update because we have to keep track of the covariance and this In the weights of each Gaussian component And another so one difference is that we have more parameters in the second difference is that we don't just hard assign each point to one of the clusters, but we're computing this Probability that for each point that it belongs to each of the clusters It can be probability can be very high for one of the clusters and very very low for others But it will not be exactly zero for any of them So it turns out that expectation maximization is a very generic algorithm that can be used for many different and is used In machine learning and in statistics for many problems where you're optimizing likelihood Of a probabilistic model that has some latent variables So in this case the latent variables are the True class membership. So each point actually belongs to one of the Or we can think that each point in our data set Belongs to one of the like really came from one of the components of the mixture But we don't know which component it really came from. So this is latent variables that we cannot observe But we can use this this approach where we estimate the probability over latent variables and then we update the parameters of the model having the latent variables fixed And then we estimate the posterior gain and then we update the parameters again and one can in a very general setting Um derive and prove that if you do this in a latent variable model, then your likelihood will never go Down it will always increase And again you can run there there can be local minima But at least you will you will always you're guaranteed to reach to converge Towards one of the one of the maxima Of the likelihood We will I will mention another example of the of the latent variable model in the next lecture That that can also be optimized using em. So it's it's a very powerful It's a very powerful approach Okay illustration I'm taking the same data from the same textbook And now instead of clustering k means I'm clustering a Gaussian mixture models again We need to initialize with something. So and here This shows two Gaussians. So this is the me one and this is the me two and the sigmas are initialized Spherically and the pies are probably initialized half As 0.5 weight here and 0.5 weight here and now we assign all points To these components and here you immediately see one of the Like the second different thing from from from the k means is that the points don't just have two colors, right? They are They have they have some posterior probability to belong to cluster to this or to that cluster So for some points here in the middle, it may be close to 50 50 and they will be of intermediate color So you get this spectrum of colors here. Okay, so once you did that you can compute the weighted mean Which is essentially the mean of the red point the more red the redder Points here, right, but you compute the weighted mean and you compute the weighted covariance And you get these two Gaussians After the after the first step So now you hold the parameters of these Gaussians fixed and reassign the Reassigned the points not much changed here to step two If you skip scabral steps, then actually one Gaussian will start to cover this region And another Gaussian will start to cover this region, but this is already step five. So actually This usually happens slower than than than in k means we need more iterations, but if you keep iterating then after 20 iterations, for example, you have the split That that that makes sense, right you recovered the two The two components here and notice that the final Gaussians Have different not only the means but also the covariance is different. So this is like a bit more vertically oriented This is a little bit more horizontally oriented This will So the Lloyd's algorithm will just converge in the sense that it will stop You will you will reach an iteration After Which nothing will change anymore. So this is not the case here. You can all this this will converge In a sense that it will become stable and the differences like the updates from iteration to iteration will become very small But it not necessarily or usually it will not like entirely converge, right? It's like creating descent. You will just approach the minimum Close and close. Okay There's one There's one tricky point that I also brushed under the carpet until until now when discussing the GMMs the gaussian mixture models And that is so let's consider the the gaussian mixture equation again And let's consider a very simple case where we actually just have one dimensional data Right. So these are these these points are my data points here. That's one dimension And here on on the y-axis I have the actually the density function so and and let's say I want to cluster it in two components and It so happens on on one of the steps that all these points are assigned to to one of the components Or basically one component looks like that And another component looks like like this very high gaussian so that this point is very likely to come From that component and all these points are more likely to come From the other one and then you keep iterating and what will happen Is that this gaussian will shrink more and more and more and more around this one point and basically Become completely localized and diverge Essentially to the delta function on this and why does this happen? This happens because then the likelihood of this one point Will become super high right the likelihood of this point will diverge to infinity if the likelihood is just this this value So if this becomes narrow and narrow and narrow and high and high and higher then The likelihood will diverge to infinity and when you compute the The the sum log likelihood or the entire likelihood of the data set it will also diverge So it turns out that there is actually a way to make the likelihood go to infinity And we're trying to maximize the likelihood so that should be like the best um solution as far as the um as the likelihood function is Um concerned and but that is of course not the solution we want. It's trivial. It can always happen if you make one of the components Variances go to zero Okay, so this may become a problem if you actually implementing gaussian mixture model algorithm and you're doing the iterations And one of the gaussons basically becomes Localized around one point and then you get in this regime that it just shrinks and shrinks and shrinks and your likelihood explodes So one has to in practice It can happen that one of the gausson collapses So to say and one has to to do something about this to prevent this degenerate solution to happen So for example, if you see that one gausson The standard deviation of one of the gaussons becomes super small. You basically kill these gaussons and randomly reassign it in some other part of the Part of the dataset and and then keep it a rating. It's um, it's like an edge case That a good implementation should take care of And also shows but more like philosophically it shows that there is some problem with using this um With using loss function this loss function in the first place So in practice we can prevent this from happening and get away with it from it But I think conceptually it means there's something Something About this this loss that's that's not great because it allows these degenerate solutions This can be prevented if you if you have some prior Like a hyper prior on the parameters and say the the sigma Cannot is not allowed to to become very small Like in a fully Bayesian setting you can put priors on sigma and a mu And this may may may address that but in In this setting as I explained it actually this model can diverge Okay Some comments on the expectation maximization versus the gradient descent. So we did Um here for the for the Gaussian mixture in principle. It's possible. I said it's cumbersome for k-means, but one can do one can do Gradient descent for k-means or Gaussian mixture model They are both iterative algorithms, right? They can both converge So they they both converge to a local minimum in the sense that they Guaranteed to bring you to a local minimum the likelihood will not increase but the minimum can be local and You you are not guaranteed to to reach the global minimum actually With the gradient descent you are not guaranteed that your likelihood will that your loss will not That it will not increase right? It's possible that you make a too large step and your loss goes up Or the likelihood goes down in e m. This will never happen So e m current comes with a guarantee that after each expectation maximization iteration your likelihood increases or stays the same So that's actually an advantage it's good and in fact, it doesn't need a learning rate There's no learning rate in the e m algorithm So you don't need to think about how to choose the learning rate or how to change the learning rate If you get stuck and and and so on and so forth. So that's also that's a good thing another good thing about e m is that You don't need to impose constraints on the parameters. So for example, all the pies should sum to one So if you do gradient descent You somehow need to ensure that all pies Always sum to one and the all coverances should be positive Definite matrices So you do a gradient descent step and maybe one of the covariance matrix becomes non positive definite, right? So what do you do? You need to somehow fix that or the pies don't sum to one. You need to make them sum to one It's possible to to to do all of that, but you need to take care of that In expectation maximization these problems don't arise on each step You get meaningful covariance matrices because your covariance is just the Is the weighted covariance matrix Of of your points, right? It will by construction. It will always be positive definite your pies will by construction sum to one and All yeah, so all parameters and also the z i k everything will be meaningful on each step of the e m So this can be actually This is a nice property one thing That I still wanted to say on the previous slide is that G m m So the Gaussian mixture model can also converge to a bad local minimum If you start the Gaussian mixture model on that example that I had before with A lot of different Gaussians in 2d It can converge to to a suboptimal solution similarly to chemins So all these heuristics that I briefly mentioned Split and merge Of of clusters or smart initialization smarter than just random this all applies also um to Gaussian mixture models and in fact G m m typically converges slower than chemins. So what is often done is that you run chemins first You get the chemins solution and then you initialize Gaussian mixture model with the um with the chemins solution This can help in practice too So let's revisit again. What is what is the difference between chemins and Gaussian mixture models? similar To what we discussed in the lecture on discriminant analysis one can constrain the covariances in the Gaussian mixture model In different ways So you can constrain them such that the covariance of each cluster has to be the same That's what linear discriminant analysis does right remember in the Classification setting here. We have unsupervised setting. We don't have labels, but still we can we can say our covariance matrix In let's say I'm fitting two clusters and I am fitting this I don't fit two covariance matrix one here And and and one here and then they can be like that. No, I'm saying they have to be the same So they have to also turn the same you can it's relatively easy to update the formula for this weighted covariance matrix such that you Only have one covariance matrix covering all clusters This is not always a good idea, but it can be a good idea in some cases or you can constrain it differently You can constrain it to be diagonal You say maybe you have a lot of variables a lot of features. So if you have a lot of features, then you have quadratic Quadratically large number of parameters in the covariance matrix, right? It's a lot of things to fit So maybe one can say we'll forget about the off diagonal terms the correlations I will just fit the diagonal covariance. We can even constrain it to be spherical So they say all the same choices and all the same considerations that Applied for LDA apply here apart from in LDA We can at least hope to cross validate or to have a test set and then choose the the best working model here You have to go with intuition or with some heuristics of what is a good clustering result To choose between these things but in practice if you have Many features and your sample size is not is not big enough Then it's it's often helpful to choose a simpler parametrization than than the full covariance matrix Um, okay, so a special case would be if we take spherical Covariance matrix. So just sigma some sigma squared times identity matrix, which is shared across all clusters, right? So we say all clusters have the same covariance and this covariance is spherical And this actually then becomes very very similar to the k-means The main difference Is that the k-means perform something that I earlier called hard cluster assignment on the east step of the Lloyd's algorithm each point gets to the To the cluster That's closest that the point is the closest to and and that's it. So it just gets here and And belongs is said to belong to this cluster. So that's the hard assignment in the mixture model update in the east step. We compute these posteriors and if the point is very close It will get a high posterior probability to belong to this cluster But all other probabilities for all other clusters will still be non-zero. So this can be called soft cluster assignment but the if if the covariance is spherical and shared then This notion of what Gaussian is the closest that's the same as the I'm just looking at the distance to the To the to the to the mean right because if you think about the the Gaussian multivariate Gaussian function if you plug this In there then the covariant There's no covariance anymore and everything just depends on your Euclidean distance From your point To the respect of mu that this becomes very close to the K means with this one difference hard clustering Hard cluster assignment on the east step versus soft cluster assignment in the east step. So in some sense Lloyd's algorithm and you can you can if you now put sigma to zero Then so if you don't fit sigma, but say sigma is really really small that actually the soft assignment will converge to the hard assignment so you can You can see Lloyd's algorithm as the limiting case of expectation maximization for gaussian mixture model if You impose the constraint that the covariance is spherical and shared and the variances go to zero Yes, I already mentioned that it may be convenient to initialize gaussian mixture model where they came in solution and just to get back On this last slide to to this picture from before so these these three examples were the cases where k means would not give you the The clustering that you expect maybe if you If you look at this picture and these three examples are the cases where gaussian mixture model Will actually work will be more appropriate and will work correctly in the sense that it will get you um These two components here and these two components correctly here and also these two components here Of course, if you use in this case if you use the non non spherical full covariance matrices, right So you can think that these cases are solved by the by going to the gaussian mixture model These cases don't won't be right. So this is clearly a non gaussian shape Um So if you have if you have complicated non gaussian shapes that you still want to Call one cluster just one cluster with a funny shape Um, then the gaussian mixture model will not help you and then there is actually a huge literature Um on how on the alternative and completely different partially non probabilistic density based and so on ways of clustering Uh data where the gaussian mixture model assumptions do not apply, but this has to remain for another course Thank you