 Gaussian mixture model using EM method. Where do you use and why do you need a mixture of Gaussians? We have seen in many of our analysis specifically if you recollect back some of the discussions which we had on Mahanabeg distance criteria, the covariance matrix under the Bayes paradigm when we converted distance short of a multivariate Gaussian function into a distance criteria. We made an assumption in most of those analysis that the distribution or scatter of the data in whatever dimension it may be 1, 2, 3 or even higher is a Gaussian distribution okay. This sort of assumption may not happen in practice but of course most scientists and engineers still use a Gaussian distribution for modeling many analysis in many different applications including signal processing, communication theory, vibrations whatever it may be. The advantage is that the Gaussian seems to be the one which is more closer to the natural distribution okay. Number the other main reason is it is easy to do mathematical manipulations if you have a Gaussian function specifically is differentiation exist up to as much of an order as you need in finite order. It is super smooth it has various other types of advantages which this function provides over other distribution functions but there may be situations where a distribution may not be strictly Gaussian in nature or nature is strictly not Gaussian. In such cases there are methods which deal with multiple Gaussian it is like as if I want to cluster the data into several components or several parts and I assume or we assume that each of those clusters forms a Gaussian distribution. So this leads us to an analysis which is based on GMM which is casually called or the Gaussian mixture model. Let us look at the expression of the Gaussian mixture. This is the univariate Gaussian distribution in 1D where the mu is indicated by the mean okay and of course sigma is the standard deviation or the sigma square is the variance which is here okay. This is actually the variance remember the sigma is called the standard deviation variance is sigma square which is here as well okay. So this is the normalizing part of the function and this is the exponential function. We had seen this and we had also extended this to a multivariate Gaussian distribution case where this mu becomes a vector of the dimension of the data sample or instance X okay. I have changed this G to N here indicating this is a univariate Gaussian distribution in 1D in higher dimension it is an N okay where the sigma square is replaced by the covariance term root over this and this is the model of the distance function within the exponential term which you have here inverse of the covariance matrix. This is nothing new. We had this discussion earlier under multivariate Gaussian distribution. We put this under the base paradigm and we formulated distance functions and we know under what properties of the covariance matrix we are going to have between class linear decision boundaries or DBs or nonlinear decision boundaries actually the covariance matrix and its inverse of the covariance matrix dictates the corresponding property of the decision boundary okay. But now what we will do is we will extend this to a case where we will have not only a multivariate Gaussian distribution one of them in higher dimensions but multiples of these spread over the data and to do this we basically need to estimate the covariance matrix and mu for a particular distribution and one such method is actually called the maximum likelihood estimation when we need to estimate this for a particular data samples. So to do this we look at this ML method which is a simpler one to visualize if you take the log of this probability function from the previous expression let us go back to the expression here. This is what we are taking the log about and we did this when we actually formed a discriminant function for a particular class when we derived the distance criteria. So if we do that these are the terms. So this expression is also not nothing new to you. You have a nonlinear term here and you have a certain constant terms in these expressions okay. We need to take the derivative of this because you need to actually maximize this. So take the derivative with respect to the mean and the covariance term because these are the two parameters which you need to estimate for that function. This will give you an expression based or to estimate the mean. So the mean estimated by the maximum likelihood or ML method where n is the number of sample points is given by this which is a trivial expression which you can get it from here and the covariance matrix is actually given by the overall scatter matrix is given here. So the ML method for estimation of the parameters mean and the sigma is giving you the same expressions which you have seen earlier. This is the covariance matrix. This is the way you estimate the covariance matrix and the corresponding mean for the data. What happens if you have multiple gaussians or what is called as a mixture of gaussians or gaussian mixture and this is sometimes called as a linear superposition of a set of k number of gaussians. Capital K is the total number of gaussians. Typically k is more than 1 but of course in a very special case capital K can be equal to 1 where you have just have one gaussian and the overall probability is now a summation of all k number of gaussians where this pi k is called a mixing coefficient hence we will call this as the mixing coefficient for the kth gaussian. The subscript k indicates the corresponding gaussian and this expression is the normal distribution, normal multivariate gaussian distribution. Normal multivariate gaussian distribution for the class k, mu is a vector of the corresponding dimension as a sample k and this is the covariance matrix for the kth gaussian. Again I repeat this is the mixing coefficient for the kth gaussian. This is the normal multivariate gaussian distribution for the kth gaussian. The only constraint which you put with respect to the mixing coefficient is that each of them lie within 0 to 1 and the summation of all this is equal to 1. So these are basically can be considered as weights. They are also called the weights for the corresponding gaussian function and if you take the log likelihood, if you take the log likelihood of this overall function which is a function of the mean, the covariance and the weight coefficients which is also given as a function over all the data samples n, capital N is the total number of samples and the pxn is given here. This is the p to the probability for the distribution for a sample x as given as this. So this is what you will get. You can take the sigma. So this is the same. So replace this expression p of xn by this here and this is what you will get. Remember inside the logarithm you have the summation over k gaussians and then you have it for as many number of samples. k is the index indicating the kth gaussian and n is the index referring to a particular instance or the sample. Total number of samples is capital N, total number of gaussians is capital K. This is the expression you have for the lack likelihood for a mixture of gaussians multivariate gaussian distribution is what we are considering. In this case the maximum likelihood may not work to yield a close form solution and you need a method of optimization which is iterative and that is what em or expectation maximization will give you. So let us take this example to show what we are planning to intend. If this is a short of a scatter of the data set of data samples which is obtained by a mixture of three gaussians you can see here that this the overall trend of the data does not follow a gaussian distribution by itself. So we can cluster the data into three different components and say each of these individual components is a cluster following a gaussian distribution. In fact this particular data has been obtained or synthetically generated by three different gaussian distributions as given by these three different iso contours. See these are asymmetric gaussian distributions in 2D, three different class means indicated by three different colors and their iso contour lines and actually if you look at this particular plot this is showing a surface plot in two dimension where the height of the surface at each individual point reflects the probability density of corresponding cluster or a gaussian. I repeat again if you look back into the slide this surface plot can be visualized to be an extension of this plot here where this plot indicates iso contour lines or curves of equal distance with respect to the class mean but at each point if you compute the probability say you compute the probability at a point here and translate that to a height this is the plot which you will get. So you will get a surface plot where the height here on the right hand side is indicating the probability density of that function. So what I mean is this is a synthetic data obtained by three gaussian functions and now overall so this cluster density may look like this with respect to the three clusters. So it is difficult to actually model this under a single gaussian distribution even in 2D and these are the three sample points corresponding to three clusters. If you remove the cluster level or the color of this data this is the data which you will have. So you cannot model this perfectly using a single gaussian and this data shows the example why do you need multiple gaussians or a mixture of gaussians to model this data of course you could ask me a question how do you know a priori how many gaussian functions you need okay. So that is something which is not under the scope of the discussion today and it is a matter left for individual researchers to find out for a given data set what is the optimal number K. There are methods to find out the optimal number K. If you have a data set a priori given to you often you can find out some methods by which the best K number of gaussians you can fit on the data. In this case of course since I know the data beforehand I will say that the number of gaussians is 3 but if you give you an arbitrary distribution which is non gaussian nature whether K is equal to 2, 3, 4 or 10 or even more is very difficult to visualize in practice in general but there are methods which people adopt to find out what is the ideal value of K for the expression. So we can think of mixing coefficients as priori probabilities for this individual components and so for given a value of K we can evaluate the corresponding posterior probabilities called responsibilities which can be visualized as some latent variables in our expression which we saw couple of slides back. Let us get back to the expression of the log likelihood but before that we look at this Bayes rule in which we define an expression of lambda K as a latent variable given under the Bayes paradigm so this is nothing new for you this is the posterior probability the class priors and so on and so forth which we have discussed earlier. So what we are doing here is taking the class conditional probability to be our normal distribution a mixture of gaussians so we have replaced this term here by the numerator as given by the mixture coefficient and the corresponding unconditional prior to be the summation of all the gaussians in the bottom. What is the mixing coefficient mu K here? The number of samples for a particular class divided by the total number of samples here. So interpret that the number of samples for a particular class as the number of points assigned for a particular which is not assigned beforehand we do not know how many samples belong to a particular class K. So that has to be obtained and found out which in turn will actually give you the mixing coefficient pi K. So what does EM algorithm do? So the EM algorithm is an iterative optimization technique which is operate locally to find out the set of values of the parameters. What are the parameters now we need to estimate here in a gaussian mixture of gaussians mixing coefficients the set of class means for individual class so if there are K class let us say somebody decides that I want to fit K mixture of gaussians or K GMM's or K gaussian mixes to be very precise on the data so K is known. So if there are K gaussian classes which you want to fit so you have K different means each of this mean has the dimension which is the same as the dimension of the data but there are K means K covariance matrices and K mixture coefficients so 3 multiplied by K seems to be the number of what not in terms of the number but K sets of parameters which you need to estimate okay well of course the mixing coefficients each of them is a scalar quantity that is all right for mu K there are K means each of dimension D K covariance matrices how many elements D square elements within each covariance matrix so these are the set of parameters one needs to estimate let us see how the EM does it. So there are two steps in EM one is called the expectation another is called the maximization and this is done iteratively one after another you have an expectation step or an estimation step as it is called followed by a maximization step and iteratively you follow this pair of steps one after another in a sequence unless you have a condition of convergence which is satisfied at the end so estimation is for given parameter values we can compute the expected values of the latent variable and hence it is called an expectation step as well and then you have a maximization step which updates the parameters of the given model based on the latent variable calculated using the ML method okay. So let us look at the EM algorithm now so given a Gaussian mixture model our goal is to maximize the likelihood function which you have seen if you slide back with respect to the parameters well pi, sigma and mu comprising the means which is the mu the covariance matrix sigma of the component and the mixing components pi K okay. So the first step is initialize the means J is just an index running from 1 to K the covariance terms and maxim coefficients and evaluate the initial value of the log likelihood. You can start with arbitrary set of random values here as well you can start with arbitrary set of random values but instead of being absolutely random what you could also do is take since you do not have something like a class information you can take the overall mean of the entire dataset and take the individual mean of the clusters or the Gaussian to be the data mean okay. The covariance matrix also could start but in fact there is lot of some degree of research which is happened to find out what should be the good starting point for any iterative method or any optimization methods the more closer you are to the final solution the faster you will converge the better and accurate solution is what you will have so instead of starting with absolute random values in this case it is possible to do that you can have better estimates but I am not touching those aspects in this particular talk. So initialize let us say with some initial values or which could be even random values you go to the E step of the M which is the expectation step and that you compute the latent variable as given by the expression here. So this is for the kth Gaussian this is the gamma k the latent variable for the gate Gaussian which and the corresponding expression is given here for the kth. So in the denominator you sum up all the Gaussian distributions for that the value for the corresponding x which you have estimated using the random number. So sum up all the Gaussians in the denominator can the corresponding Gaussian is in the numerator after you have estimated this let us go to the third step which is the M step in the second step mind you you have estimated this gamma j or the gamma k whatever the index has changed but it is the same variable and that goes inside the expression here to compute the corresponding mean and the covariance term and this is the same as the ML step done earlier except that the latent variable it is sort of a weight here which comes here and helps you to estimate a more accurate value of the mean and the covariance matrix remember the mixing coefficient must also be calculated using the latent variable computed in the E step in step number 2 earlier as given here correspondingly. So one this mu j's are available here so all the three you can see the three expressions here the mixing coefficients and the covariance and the mean they are all computed using the radar sample points and the latent variable which uses the normal distribution expressions as given here for the E step then what you do obtain the Gaussian mixtures using the parameters estimated in step number 3 to have your gmm. Now what you need to do here is find out if the corresponding likelihood estimated here truly represent the data samples if this is not which will not typically happen in a few iterations you go back to step number 2. So you keep repeating steps number 2, 3 and 4 what essentially you are repeating step number 2 and 3 which is the E and the M part of it the expectation and the maximization of that and that will help you to actually converge to better solution will have an illustration of this very, very soon in the next few slides and you put a convergence criteria and say that I will keep on repeating this process till my log likelihood of the Gaussian distribution satisfy some criteria of convergence. So the convergence criteria could be such that in successive iterations or over a set of few iterations the parameters do not change what are the parameters the mixing coefficients pi the mean mu and the covariance scattered sigma they do not change over successive iterations or over a last few iteration that is one condition of convergence which you can put. The other similar one which you can use is given in step number 4 is when you estimate this log likelihood if this itself does not change over a set of iterations. So you compute and pre store the log likelihood value which you have computed in the previous iteration compare that with the current one and if the change is negligible below a certain threshold you say that you have met a criteria for convergence and you say that you have estimated the corresponding Gaussians over the data distribution which you have. So to wind up let us take an example now in 2D and these images have been obtained also from the book by Bishop. So we have taken it from the e-book passion recognition and machine learning the reference of this book has been given at the beginning of this particular lecture. So what does it show here? This is a scatter you can almost see it is visible that there are two different clusters of data. So it is possible to fit them with two different Gaussian distributions. So almost blindly I am selecting the value of K to be 2 but you can select K to be 3 or 4 also. There will be some amount of convergence whether good or bad is another thing to be seen but in this case let us take the example of A equal to 2 and let us see the Gaussians have been initialized at these two places with the corresponding mean and scatter. These two ellipses show that there is a Gaussian here K equal to 1, K equal to 2 Gaussian is here that these are the corresponding means at the center of these ellipses and the distributions the scatter is in 2D. So as you keep proceeding you remember what did we do in the EMA algorithm there were two typical steps. One was the E step in which you are estimating after the initial set of random values have been used to start the cycle for the corresponding set of variables or the parameters of the Gaussian mixture model. You use the E step to estimate the latent variables then using in the third step you go to the M step where you actually estimate the parameters again using the latent variable and continue this process of course you keep a watch in the log likelihood. As you proceed you can see that these are the initial set of points which are marked in blue will be assigned to this particular Gaussian and what will happen to this set of points because they are the ones which are going to be close closest to the corresponding Gaussian. Remember this was the initial stage and you start assigning the distribution because this will the points lying close to the blue cluster here or the blue the Gaussian marked in this blue ellipse will be closer to this set of points and hence they will be assigned to this corresponding Gaussian and the points which are labeled in red now will be assigned to this. So what do you do now after this assignment you will recompute the parameters the mixing coefficients the mean and the scatter let us see how the recomputed means after the first iteration look like. So once you recompute they will appear like this from the data this is the distribution for the points which have been assigned to the Gaussian this is step number 1 and the corresponding one which is in 2 will be marked here this is after the second step. So this is how it is totally start converging you reassign the points once again and this is how the this is at step number 5 we can see that the one of the Gaussian is tending to converge to this close cluster of points marked in blue the points here are becoming red following another Gaussian distribution after 20 iterations you have almost converged to this point and this is the final stage of iteration where if you still iterate you will not have a change in either the parameter values or the log likelihood criteria after you compute with this set of parameters. Let us have a look at this animation once again so this is the starting point this is the starting point you do not know where are the clusters where it should put the Gaussian so you can start almost anywhere and then this is how the convergence takes place second iteration fifth iteration and the 20th iteration. So this is a method by which you can fit a set of Gaussians to a certain scatter of data points which do not follow a univariate single Gaussian function and this is a method which is actually used not only to form clusters but actually used to model data points because after this you can actually apply all your methods of classification or if you want to transform clusters group them under certain criteria and so on so this is one method which is used to model as well as form clustering thank you very much.