 So, we have considered the density of the multivariate normal distribution in the previous class. Now, if we are doing random sampling from multivariate normal distribution, then we want to do the estimation of parameters or we want to do the test on the parameters. So, in general inferences on the parameters of a multivariate normal distribution. Firstly I will discuss the estimation part here. So, let us consider say random sample from a multivariate normal population. So, we can consider say u 1, u 2, u n let me use this notation. So, these are independent and identically distributed n p mu sigma random variables. So, that means these are the observations from a multivariate p variate normal distribution. So, we have basically expectation of each u i that is equal to mu and the dispersion matrix of u i is equal to sigma for i is equal to 1 to n. So, clearly we can see that if I define u bar is equal to 1 by n sigma u i, i is equal to 1 to n, then expectation of u bar that will be equal to mu. So, an unbiased estimator for mu is the sample mean, sample mean vector. Similarly, we can consider say 1 by n minus 1 s that is equal to 1 by n minus 1 sigma u i minus u bar into u i minus u bar transpose i is equal to 1 to n. This is unbiased for sigma. Let me give the interpretation of this here. Let us consider the sample in this fashion u 1, u 2, u n. I will write it in this fashion. The components of this is u 1 1, u 2 1 and so on, u p 1. Similarly, the components of u 2 are u 1 2, u 2 2, u p 2 and the components of u n are u 1 n, u 2 n and so on, u p n. Let us consider say I consider this row vectors as y 1 prime, y 2 prime, y n prime and this entire matrix, we can use the notation say u transpose which is of order p by n. So, we can consider say for example, y i prime represents a random sample on the ith component that is n mu i sigma i square. Now, let us also use the notation u i minus u bar that is equal to u 1 i minus u 1 bar and so on, u p i minus u p bar. Here, individual u i bars are denoting 1 by n sigma u i j, j is equal to 1 to n. Therefore, this term that is 1 by n minus 1 sigma u i minus u bar into u i minus u bar transpose that will represent. So, we can write it here 1 by n minus 1 sigma u i minus u bar into u i minus u bar transpose. This will represent 1 by n minus 1. The first component will be sigma u 1 i minus u 1 bar from here we can write here because if I am considering this multiplied by the transpose of this, then the first term will become sigma u 1 i minus u 1 bar square. Similarly, in the second diagonal it will become sigma u 2 i minus u 2 bar square and in the half diagonal it will become sigma u 1 i minus u 1 bar into u 2 i minus u 2 bar etcetera. Here it will be sigma u 2 i minus u 2 bar square and so on, sigma u p i minus u p bar square. Now, you can see that 1 by n minus 1, 1 by n minus 1 sigma u 1 i minus u 1 bar square. This is unbiased for sigma 1 square and so on, 1 by n minus 1 sigma u 1 i minus u 1 bar into u 2 i minus u 2 bar. This is unbiased for sigma 1 2 etcetera. So, s by n minus 1 is unbiased for. So, we are able to consider the unbiased estimation for mu and sigma. We had the concept of minimum variance unbiased estimation in the case of a scalar parameter. Since we are dealing with the vector parameter here, that concept is no longer valid here. Of course, we can consider component wise minimum variance unbiased estimation here. Now, in the case of one variable we have seen the like for example, in normal mu sigma square we have also looked at the maximum likelihood estimators. In the one dimensional case the maximum likelihood estimator for mu was the sample mean and for the sigma square it was 1 by n sigma x i minus x bar square. So, here we can consider analog to that and we will get here we have defined s. So, for the variance covariance matrix sigma we will get s by n and for mu we will get u bar. Let us prove this here. So, maximum likelihood estimation of parameters of a multivariate normal distribution. So, as before we have u 1 u 2 u n is a random sample from normal n p mu sigma. Now, let us go back to the density function of u i. Yesterday we have seen the density function of the multivariate normal distribution when the rank of sigma is full it is given by 1 by 2 pi to the power p by 2 determinant of sigma to the power half e to the power minus half x minus mu prime sigma inverse x minus mu. Now, we write this density for u 1 u 2 u n. So, the likelihood function the likelihood function that is we write l mu sigma and then of course, your u 1 u 2 u n. I continue using capital letters here just for convenience. So, it will become determinant of sigma to the power minus n by 2 2 pi to the power n p by 2 e to the power minus 1 by 2 sigma u i minus mu prime sigma inverse u i minus mu i is equal to 1 to n. So, firstly let us simplify this expression sigma u i minus mu transpose sigma inverse u i minus mu. Here we add and subtract sample mean vector. So, this becomes sigma i is equal to 1 to n u i minus u bar plus u bar minus mu prime sigma inverse and we expand this. So, sigma u i minus u bar prime sigma inverse u i minus u bar plus n times u bar minus mu prime sigma inverse u bar minus mu plus twice sigma u i minus u bar prime sigma inverse u bar minus mu i is equal to 1 to n. Now, if I consider here summation and apply on this I get u bar minus u bar. So, this term is actually 0 this term vanishes. So, we are getting only this part here. So, we can rewrite the likelihood function as l mu sigma that is equal to determinant of sigma to the power minus n by 2 divided by 2 pi to the power n p by 2 e to the power minus 1 by 2. Now, if you look at this term this is actually scalar. So, if it is a scalar term I can also write it as trace of this. Now, trace of this can also be written as trace of this I interchange the order here I multiply it on this side. So, let me write it here it is equal to sigma u i minus sorry this is sigma u i minus u bar prime sigma inverse u i minus u bar minus half n u bar minus mu bar prime sigma inverse u bar minus mu. Now, this term I write as e to the power minus half trace of sigma u i minus u bar prime sigma inverse u i minus u bar. Now, in the trace so, this will become summation here I can take this summation outside. So, this will become determinant of sigma to the power minus n by 2 divided by 2 pi to the power n p by 2 e to the power minus 1 by 2 trace of sigma inverse u i minus u bar u i minus u bar transpose minus n by 2 u bar minus mu sigma inverse u minus mu. Now, I take this summation sign inside then this will become summation here that is becoming S here. So, this is determinant of sigma to the power minus n by 2 2 pi to the power n p by 2 e to the power minus half trace of sigma inverse S e to the power minus n by 2 u bar minus mu prime sigma inverse u bar minus mu. Now, we want to maximize this with respect to mu. Let us consider firstly the maximization with respect to mu we first maximize with respect to mu. Now, there is no mu term appearing here. So, that means, basically we have to minimize this term. So, that is we minimize u bar minus mu bar mu prime sigma inverse u bar minus mu. Now, sigma and sigma inverse they are positive definite. So, u bar minus mu prime sigma inverse u minus mu is always greater than or equal to 0 with equality at mu head is equal to u bar. So, u bar is the maximum likelihood estimator of mu. So, if I have reduced the second term in the likelihood function to 0. So, my likelihood function is now reducing to this term alone. So, now let us consider the maximization of this with respect to sigma. So, now, l mu sigma reduces to determinant of sigma to the power minus n by 2 divided by 2 pi to the power n p by 2 e to the power minus half trace of sigma inverse S and we want to maximize. So, let us consider maximization of log of l that is equal to minus n by 2 log of determinant of sigma minus half trace of sigma inverse S. So, this we want to maximize with respect to sigma. So, if I here actually you can see I can put plus here and then this will become minus. So, if I consider in terms of sigma inverse and we can denote denoting the terms of sigma inverse by say sigma i j, if we differentiate log l with respect to sigma i j and equate to 0 we get sigma i j is equal to S i j by n. Now, in order to prove that this is maximum likelihood estimator we should show that actually this is maximizing. That means, we must show that to prove that S by n actually maximizes log of l we must consider we must show that n log determinant sigma inverse minus half trace sigma inverse S is always greater than or equal to n log S by n inverse minus half trace of S inverse n S or you can say that this difference should be greater than or equal to 0. Now, this difference if you consider this is n log determinant of sigma inverse minus trace of half I can remove here because in this term 2 divided by 2 was here and here also divided by 2 was there. So, I can take it out. So, this I can remove and this term also I can remove. So, this will become trace of sigma inverse S minus n log. So, now it is S by n inverse and here S inverse S will become i. So, i of p dimension. So, trace will become equal to p. So, the term will become p here minus and there is a n in the denominator. So, it will become n p and this will become plus here. Now, this term we can write as n times log of sigma inverse S by n determinant I am combining this term with this minus trace of sigma inverse S by n I am taking out n here plus p that we can write as n times log of. Now, here we do some manipulation this sigma is actually sigma inverse S. So, we can write it as sigma to the power minus half sigma to the power minus half S. Now, determinant of A B is equal to determinant of B A. Now, this type of breakup was allowed provided we assumed sigma to be a positive definite matrix that is why sigma inverse is existing it is a positive definite matrix we were having the decomposition and from that decomposition the inverse was also possible and then the half matrix was also allowed. If we remember the calculations that we did in it in one of the previous lectures. So, we will use that thing here. So, this I can write as log of determinant of sigma to the power minus half S sigma to the power minus half by n minus trace. Now, in the trace also same argument can be used because trace of A B is also equal to trace of B A. Now, this is equal to n times log of product of lambda i is equal to 1 to n minus sigma lambda i is equal to 1 to n plus p. Here lambda 1 lambda 2 lambda p they are characteristic roots of sigma to the power minus half S sigma to the power minus half by n and since this is we are starting with the positive definite these are all positive. Now, if we consider log of x minus x plus 1 this is always less than or equal to 0 if I take x is equal to 0. So, if you look at this term this is always going to be less than or equal to 0. So, actually we wanted to prove that S by n maximizes. So, in the log l I substituted sigma is equal to S by n that is why I got n log S by n inverse minus trace of sigma inverse that was becoming S by n inverse S. So, I should show this is less than or equal to 0 not greater than or equal to 0. So, this is what we are able to prove. So, S by n is maximum likelihood estimator of sigma. Also if you look at the likelihood function which is actually the joint density function form that we have written here. So, from here we can also conclude that u bar and S it is S efficient statistics for this problem. From an application of the factorization theorem on the joint pdf of u 1, u 2, u n we conclude that u bar and S is sufficient. So, this fact will be further useful in the inference problems. So, let us summarize we have considered the multivariate normal distribution and we have discussed several properties of the multivariate normal distribution. Now, one or two important points that we saw was use of a non-central chi-square distribution because we have seen that the sum of squares of independent normal random variables is a central chi-square. So, if we are considering normal distribution with some non-zero mean and then if I consider the sum of squares then we will get a non-central chi-square. So, I will now introduce non-central distributions they are extremely useful in the multivariate theory. So, let me start with the non-central chi-square and then gradually we will talk about non-central t and non-central f distributions also. So, we talk about non-central chi-square distribution. So, let us consider say x following normal mu 1 distribution. We have seen that if x follows normal 0 1 then y is equal to x square as a chi-square 1 distribution. Now, if x is normal mu 1 then x minus mu square will be chi-square 1, but what about x square itself? So, let us derive the distribution here. To derive the distribution we can consider we derive the distribution of y. So, let us consider the CDF of y. So, naturally this is going to be 0 if y is less than 0. So, this is equal to modulus x less than or equal to root y if y is positive. So, let us consider this portion here. So, this is equal to probability of minus root y less than or equal to x less than or equal to root y. Now, that is equal to we transform it to a standard normal then this is becoming z is less than or equal to root y minus mu and here minus here z follows normal 0 1. So, this in terms of capital phi function which is the CDF of a standard normal distribution we can write it as. So, we have derived the cumulative distribution function of x square. So, we can also find out the probability density function. So, derivative of capital phi will be small phi. See let us revise the definitions. This small phi t denotes the PDF of standard normal distribution that is phi t is equal to 1 by root 2 pi e to the power minus t square by 2 and capital phi x that is nothing but the cumulative distribution function that is CDF of normal 0 1 distribution. So, if I differentiate capital phi I will get a small phi root y minus mu and I will get 1 by 2 root y and here I will get there is a minus here and there will be a minus here. So, it will become plus 1 by 2 root y small phi of minus root y minus mu this is for y greater than 0 it is 0 for y. So, of course, equality at 0 we may include at one of the points that does not make any difference here. Now, this we simplify we can write it as 1 by 2 root y and 1 by root 2 pi will also come out and I will get e to the power minus half root y minus mu square plus 1 e to the power minus 1 by 2 minus root y minus mu square. I am writing the part where the density is positive in the 0 part I am not writing here. Let us simplify this portion. So, this is becoming equal to 1 by 2 root 2 pi y and this term here I can write e to the power minus y by 2 minus mu square by 2 plus mu root y. Similarly, in the second part it will become minus y by 2 minus mu square by 2 and this one will give me the minus sign minus mu root y. So, this term I can keep outside it will become e to the power minus y by 2 minus mu square by 2 divided by 2 root 2 pi y I get e to the power mu root y plus e to the power minus mu root y. If we consider the expansion of e to the power mu root y and e to the power minus root y minus mu root y. So, this becomes simply e to the power minus y by 2 minus mu square by 2 divided by here alternative terms will be plus and minus. So, they will get cancelled out and the even terms will get added up and if you add then you will get 2 times. So, this 2 will go away I will get divided by root 2 pi y sigma mu root y to the power 2 k divided by 2 k factorial k equal to 0 to infinity. Now let us substitute here say mu square by 2 is equal to say lambda then this will become e to the power minus lambda minus y by 2 divided by root 2 pi y sigma lambda to the power k 2 to the power k y to the power k by 2 k factorial. I multiply and divide by k factorial here. So, let us simplify this here this will become equal to I combine that terms in a particular way e to the power minus lambda lambda to the power k by k factorial 1 by 2 to the power 2 k plus 1 by 2 gamma 2 k plus 1 by 2. e to the power minus y by 2 y to the power 2 k plus 1 by 2 minus 1. So, I have combined all the terms in a very particular way let us see how it is coming. So, e to the power minus lambda lambda to the power k by k factorial I am writing here. Then there is another term that is 2 k factorial and then there will be a k factorial here. So, here I can actually cancel the terms like here I will get 2 k. So, I will cancel with the first term here then I will get 2 k minus 2 that I cancel with k minus 1 and so on. Now, what is remaining is 2 k minus 1 2 k minus 2 and so on 2 k minus 3 and so on. So, that terms I combine and it can be written as a gamma 2 k plus 1 by 2 because there is a divided by 2 term coming here. Now, the terms which I left here that is for example, 2 k. So, there was a 2 here then there was another 2 in the 2 k minus 2 and so on that will be again coming here and then there is a square root 2 here. So, that I put together as 2 to the power k plus half. Then there is a e to the power minus y by 2 term that I write here and the power of y to the power 2 k by 2 and then here we have minus half here. So, that I write as half minus 1. So, this particular way of writing down this gives it an interpretation that it is equal to sigma e to the power minus lambda lambda to the power k by k factorial f of 1 plus 2 k y where this f m y denoted by 1 plus 2 k y. So, that is what the denotes the density of chi square m distribution. So, the interpretation for the density function of a y which is equal to x square here y is equal to x square and the interpretation for the density of that is it is a weighted because these are Poisson weights of central chi square. So, this is the PDF of non-central chi square on 1 degree of freedom and the non-central t parameter lambda. Lambda is actually mu square by 2. This is a weighted PDF and the weights are actually the Poisson weights here. So, now, let us consider x to be n p mu i. That means, I am considering p components. So, x 1, x 2, x p are independent normals with means mu 1, mu 2, mu p and variances are unity and they are independent and in general I am assuming mu 2 be non-zero because at 0 this will simply give me chi square central chi square. So, now, I am looking at y is equal to x prime x that is sigma x i square i is equal to 1 to p. So, then this has non-central chi square with p degrees of freedom and non-central t parameter lambda that is half mu prime mu that is sigma mu i square by 2. Let us look at this. Let us define gamma to be an orthogonal matrix with first row as say mu prime by norm of mu and other rows are orthogonal to it. That is I am writing gamma is equal to something like mu prime by norm of mu and other rows these are orthogonal to first row. Let us consider say z equal to gamma x, then z will follow n p gamma mu i, but what is gamma mu? Gamma mu is equal to because this is mu. So, mu prime mu you will get. So, that is norm of mu square. So, you will get norm of mu and other terms will become 0 because the other rows are orthogonal to the first row. So, if I consider x prime x that is equal to z prime gamma, gamma prime z that is equal to z prime z. Because gamma is orthogonal that is equal to simply sigma z i square. So, if I consider the first component that is z is equal to z 1 z 2 z p, then z 1 square will follow chi square 1 lambda that is non-central chi square distribution with 1 degree of freedom and lambda as the non-central parameter. See this we will write as chi square p lambda and this one we are writing as chi square 1 lambda. Sometimes we write as chi square 1 lambda like this also that is chi square p lambda. So, these are various forms of this notation here. So, z 1 square is chi square 1 lambda and what you are getting is z prime z that is equal to z 1 square plus z 2 square plus z k square. So, these are central chi square. So, what you are getting chi square 1 plus 2 k where k is Poisson lambda plus chi square 1 plus chi square 1 these are central and these are all independent. So, we conclude that z prime z follows chi square p plus 2 k that is pdf of say v or say w is equal to z prime z that will be equal to e to the power minus lambda lambda to the power k by k factorial f of p plus 2 k y for k equal to 0 to infinity. We can look at some elementary properties for example, we have written actually y is equal to x prime x now x prime x is equal to z prime z. So, this is actually y the density of y. So, if we consider expectation of y we can write it as expectation of expectation y given k that is equal to expectation of p plus 2 k because for a chi square distribution if k is given then it becomes central and it is equal to the number of degrees of freedom that is equal to p plus 2 expectation of k that is equal to p plus 2 lambda that is equal to p plus norm of mu square. We can also consider the characteristic function of y that is equal to psi y of t that is expectation of e to the power i ty. So, that is equal to expectation of expectation e to the power i ty given k. So, that is equal to given k it is a chi square. So, we know it is equal to expectation of 1 minus 2 i t to the power minus p plus 2 k by 2. Now to consider the expectation of this with respect to k we consider the k following Poisson lambda here. So, it is equal to 1 minus 2 i t to the power minus p by 2 expectation of 1 minus 2 i t to the power minus k that is equal to 1 minus 2 i t to the power minus p by 2 because k is following Poisson lambda. Now this term can be combined with this. So, you get it as simply 1 minus 2 i t to the power minus p by 2 sigma e to the power minus lambda by k factorial lambda by 1 minus 2 i t to the power k. So, you get that is equal to 1 minus 2 i t to the power minus p by 2 e to the power lambda by 1 minus 2 i t e to the power minus lambda by 1 minus 2 i t e to the power minus lambda. So, after combining the terms now once we are able to determine the characteristic function of the non-central chi square distribution other characteristics like its variance and other things can also be found out easily. So, I am leaving this discussion at this point. Now if you remember the definition of t distribution the definition of a f distribution we had made use of the chi square. So, now if that chi square is replaced by a non-central chi square the similar changes will occur. So, let me define non-central f. So, let say w 1 follow chi square by 1 minus 2 i square p lambda and w 2 follow chi square say q and w 1 and w 2 be independent. Then w 1 by p divided by w 2 by m is said to have a non-central f with say sorry this is q here with p and q degrees of freedom and non-central t parameter lambda. Now there can be a possibility that the denominator chi square is also non-central. So, we call it w non-central say w 1 follows chi square p lambda and w 2 follows chi square q tau. So, in that case if we consider w 1 by p divided by w 2 by q then this is called w non-central f. Of course, they should be independent. We can also define x following normal mu 1 and say w follows chi square n. Then if I consider x divided by root w by n then this is called non-central t. So, these are some of the distributions that are used in the when we deal with the general multivariate normal distribution and these quantities will be appearing in the distributions of the test statistics which we use for constructing the test for the parameters of a multivariate normal distribution or for constructing the confidence intervals etcetera. Now in the case of univariate normal distribution we had that x 1, x 2, x n if they are it is a random sample and if we consider the sample variance sigma x i minus x bar was square divided by n minus 1 and we called it s square. We had obtained the distribution of that is n minus 1 s square by sigma square it follows chi square distribution on n minus 1 degrees of freedom. Now what could be a possible generalization of this 2 multidimension because in the multidimension for variance covariance matrix sigma we are getting the sample variance covariance matrix s. So, we are considering s divided by n minus 1 as the unbiased estimator. So, what could be the distribution of that? So, we need the concept of a matrix distribution. So, in the next lecture I would be covering this matrix distribution for this. Let us look at one or two applications of the sampling from multivariate distributions. Suppose I am considering say x 1, x 2, x 3. So, these are considered as sweat rate, x 2 is considered as the sodium content, x 3 is considered as the potassium constant. So, this is data on 20 items. So, it is assumed that x 1, x 2, x 3 this is having n 3, n 4, n 5, n 6, n 3 distribution with say mu and sigma. And the data is calculated, data is observed in the following fashion for item number 1. So, for each item we are writing down the values of x 1, x 2, x 3. So, it is 3.7, 48.5, 9.3. For the second item it is 3.7, 48.5, 9.3. 5.7, 65.1, 8.0 like that up to 20. We are having for 20th item 5.5, 40.99.4. So, if we want say maximum likelihood estimators of mu and sigma, then here we can consider mean vector sample mean vector. We can consider here sigma say x 1 i, i is equal to 1 to n, 1 by n. So, 1 by 20 here, 1 by 20, sigma x 2 i, 1 by 20, sigma x 3 i etcetera. So, this will give the MLE of mu. For calculation of the MLE of sigma, then I need to consider 1 by 99. 1 by 20, sigma x 1 i minus x 1 bar square, 1 by 20, sigma x 2 i minus x 2 bar square, 1 by 20, sigma x 3 i minus x 3 bar square. We also need to calculate the cross product terms like 1 by 20, sigma x 1 i minus x 1 bar into x 2 i minus x 2 bar, 1 by 20, sigma x 1 i minus x 1 bar into x 2 i minus x 2 x 3 i minus x 3 bar, 1 by 20, sigma x 2 i minus x 2 bar sigma x 2 i minus x 2 bar into x 3 i minus x 3 bar. For calculation purpose the simplifications can be done for example, one may use like 1 by 20 sigma x 1 say x 1 i square minus 20 x 1 bar square. So, this will get cancelled out and similarly for the cross product terms we may consider 1 by 20 sigma x 1 i x 2 i minus x 1 bar x 2 bar etcetera. So, one may consider these quantities also for simplification. I end up with some exercise let us consider say x following n 5 mu sigma and I define mu as the 2 4 minus 1 3 0 and sigma matrix as 4 minus 1 half half 0 minus 1 3 1 minus 1 0 half 1 6 1 minus 1 0 sorry minus half minus 1 1 4 0 0 0 minus 1 0 2. So, this is a 5 by 5 matrix. Let us consider say x is equal to x 1 and x 2 where x 1 I am taking to be x 1 x 2 and x 2 I am taking to be x 3 x 4 x 5. So, find conditional distributions of say x 1 given x 2 is equal to say 0 2 minus 1 and x 2 given say x 1 equal to 1 5. Let us also take a is equal to say 1 minus 1 1 1 and b is equal to say 1 1 1 1 1 1 minus 2. Find the distributions of a x 1 b x 2 covariance between a x 1 b x 2. Also find p that is 2 by 2 matrix and q 3 by 3 such that p x 1 and q x 2 are independently distributed. So, I am leaving it as an exercise you can try. So, in the next lecture I will consider a matrix distribution for the sample dispersion matrix S. So, it is called the short distribution and in the next lecture I will introduce this