 In module 6.1 we talked about 2 schools of estimation one is the Fisher school another is the Bayesian school. Fisher invented this notion of maximum likelihood estimation technique for point estimation of unknown constant vector or unknown scalar. We generally will not be using maximum likelihood estimation techniques in the parlance in our discussion of data simulation we have not used we largely depend on least squares. But I believe because of the underlying importance of this at least one should have an nodding understanding of what is maximum likelihood estimation technique. Once we talk about the some of the basic aspects of maximum likelihood techniques which belongs to the Fisher school then we will talk about the Bayesian estimation techniques in the next module 6.4. So 6.1, 2, 3 and 4 together contain an expose of the basic idea of statistical least square principles illustrations some of the fundamental theorem intrinsic properties of estimate gas Markov theorem and there is also a couple of other fundamental theorem that comes out in the Bayesian estimation. Once you understand these basic estimation you now know how to evaluate the goodness of the estimate once we do a data simulation procedure that is the reason for us to be able to be for us to include all these fundamental results from statistical estimation theory these are no less important than tools from multivariate calculus tools from the matrix theory tools from linear algebra and so on and so forth. So a quick expose of maximum likelihood method let Z be equal to again HX plus V or Z is equal to HX plus V linear non-linear assume X could be a random V is always random you X is random V and X are uncorrelated in the Fisher's case we assume X is an unknown constant. So there is no prior distribution as in the case of Bayes X is an unknown constant given X Z is random. So given X Z has a distribution that distribution is called the conditional distribution. Conditional distribution essentially relates to the properties of the observation condition on the unknown nature plays a game with us she picks a value of X and keeps it constant we do not know what X is we are going to make nature is teasing us we make measurements on the nature. So the measurements are going to be providing information about X but the measurements are random so Z is a random vector it is an underlying distribution but Z the properties of Z is condition on the value that mother nature has already chosen but did not care to tell you our aim is to be able to uncover what mother nature had picked. So the information of X is to be gleaned from Z gleaned from the conditional distribution of Z given X so that is what the conditional distribution. So conditional distribution is always the new information that arises because we are able to make observations about the system. This P X of Z has two ways of looking at it as a dual interpretation if given X as a function of Z is called a conditional distribution but Fisher turned the table around for a given Z so what is that he asked I am I have observed something observations given to you given Z what is the most likelihood value of X that mother nature must have picked that I observed Z. So let me talk about the differences here what is the conditional distribution the right hand side means given X mother nature has already picked yes but she did not tell you but Z exhibits randomness. So the randomness of Z condition on the value of X is P Z that is a function for a given X distribution over Z that is a conditional distribution but Fisher asked the turn the table around what is that he said yes I know mother nature has picked X but she did not tell me I can but I have the ability to make observations on mother nature is going to give me a Z I have gotten a Z Z is related to the X so he asked the following question what is the most probable value of the unknown X that the mother nature should have picked that will exhibit in my viewing the observations Z that is that is the difference the quantities are same but if you turn the table around one is a function of X another function of Z one is called likelihood function another is called the conditional distribution that is the fundamental difference and this difference is an enormous difference that let Fisher to be able to concoct a new class of methods called maximal likelihood method. So Fisher in 1920 Fisher's principle given Z what is the value of X that will minimize what is the value of Z I think this maximize maximize sorry it is the maximum likelihood I have said it correctly that is maximize the probability of observing the sample Z given max that is the basic idea here so the maximal likelihood method so V that is the underlying principle of Fisher's strategy maximum likelihood ML method without loss of generality I can start with the non-linear observation with the non-linear case let Z be equal to HX plus V V has this property mean 0 X and V uncorrelated the I am sorry this must be R I will correct that V is mean 0 X and V uncorrelated the way the covariance of V is R I am sorry in this in this in this particular case it is not R I am going to change this I am sorry in here I am assuming this is this is sigma therefore V is normal with 0 and sigma as the covariance therefore if HFX is deterministic V is random Z is random if V has a distribution 0 mean and variance covariance sigma Z has a distribution whose mean is HFX and covariance sigma so that is the conditional distribution so one of the basic tenons of Fisher's theory is that I should know the conditional distribution in its exact form so given X I should know the distribution of observation Z a conditional on X so this is the conditional distribution that Fisher's method rests on so we need to have that distribution so this distribution has X and X is unknown so what is that we are looking for now please remember I can relate P to L so looking at L as the likelihood what am I looking for let X hat be any estimate let X hat ML be the maximum likelihood estimate and how do I define the maximum likelihood estimate maximum likelihood estimate is an estimate that maximizes the likelihood of observing Z given that estimate compared to any other estimate so the likelihood likelihood is a probability the probability observing so what does the left hand side say the probability of observing a sample Z when you set the parameter to be X hat ML is larger than the probability of observing Z for any other estimate X hat so among all the estimate the maximum likelihood estimate gives you the most probable value of the unknown based on which you will observe what is being observed so this inequality essentially underlies the definition of maximum likelihood estimation technique in other words I am interested in a X hat ML that satisfy this property please remember L and P the conditional distributions are related as we have seen in the previous slide if L must be greater than this logarithm is an increasing function so if I took the logarithm of both sides the inequality must be preserved so the natural logarithm of the likelihood of the left hand side must be greater than equal to the natural logarithm of the likelihood on the right hand side I am not going to derive this Rauff's book gives a beautiful definition the book by Melza and Cohen gives a very good very good derivation of this a necessary condition for this to happen is that the gradient of the log of the likelihood function please understand the log of the likelihood function is essentially P of Z of X so this is also equal to log of P Z of X log of P Z of X I have to compute the derivative with respect to X that derivative is given by 1 over L times the gradient with respect to X that must be 0 that comes from the maximization property of the likelihood function this is a this this necessary condition is extremely simple to be able to look at why this is necessary for this inequality hold the good now I am going to illustrate this fundamental principle using a very simple example so let us pretend I want to be able to estimate an unknown mu mu is a constant but mu is not observable Z is observable Z is equal to H times mu plus V in this case I am I am assuming mu is not even a vector mu could mu could be just a real number I so I am assuming n is equal to 1 I have m observations so H is equal is m by 1 it is simply a vector which is simply a vector I am assuming H to be all once therefore I have Z1 Z2 Zm is equal to 1 1 1 1 times mu plus V1 V2 Vm so each Zi is equal to mu plus Vi that is the observation that I am such observation so Zi is equal to mu plus Vi for I is equal to 1 to m I am going to assume my V is such that my V my I think that is I should have this is this is this is not right this is equal to R equal to this so we go from here to here to here the covariance of V is equal to R which is equal to sigma square I so H mu is a constant V is a random vector so if I add a constant to a random vector it essentially shifts the mean so the distribution of Z is given by normal with mu H mu sigma square I so everything is right but this should not be here I am sorry I will correct this later I hope that is clear now so the likelihood function is given by this this likelihood function is given by this so I know the functional form the functional form of the likelihood function is normal with H mu as a mu sigma square I as the variance this is the explicit form of the function please understand the variable to be estimated is not x is we call it mu because it is not known constant so this is the multivariate Gaussian distribution this is the expression for the multivariate Gaussian distribution this is the function which is this is called the likelihood function when consider this given mu when consider the function of Z is called a conditional distribution function given Z consider the function of mu is called the likelihood function so there are 2 variables mu and Z mu and Z so whether you are going to consider this a conditional distribution or a likelihood function the maximum likelihood estimate tries to find the optimal value for mu optimal in the sense of trying to maximize this distribution likelihood function I hope that is clear now so I can compute the derivative so let us go back now so given this I can solve 2 problems one is to be able to estimate mu sorry one is to be able to estimate mu I can also formulate this problem as one of estimating sigma square please go back mu is known and the noises covariance is not known so there are number of estimation problem associated with it so if you are interested in estimating mu you consider the derivative of the likelihood function with respect to mu whichever variable you are interested in estimating you have to you are we are interested in maximizing the likelihood with respect to that particular parameter so if you are interested in estimating mu you have to make the log of the likelihood function maximum with respect to mu that means that derivative of the log of the likelihood function with respect to mu must be 0 at the maximum if you are trying to estimate sigma square again the same principle you have to compute the derivative of the max of derivative of the likelihood function of the sigma square at the maximum I am sorry at the maximum the derivative must be 0 standard principles and optimization so you can see as early as the 1920 he has mixed several ideas conditional distribution interpreters likelihood function maximizing likelihood function maximization as an optimization problem so you can see the role of optimization embedded in estimation theory least squares least best maximum likelihood maximum best so optimization theory and estimation theory are inseparable every estimation that every estimation problem we are going to solve we are going to solve with as an optimization problem that is why if data simulation is estimation your estimation is posed as an optimization problem you can see the intrinsic interest in optimization intrinsic role played by optimization in data simulation so I could I could compute so I am killing two birds one stroke I am computing so I am assuming the unknowns of the to be estimated x r r r r equal to a vector mu and sigma square so I am computing the gradient with respect to these two values please remember L is a scalar function if I am going to differentiate a scalar function with respect to vector variable the gradient is a vector the vector has two components given the expression for the likelihood function as given at the bottom of page 4 I could compute these derivative explicitly the derivatives the derivatives the derivative of L with respect to mu is given by this the derivative of the derivative the log of L with respect to mu is given by this the derivative of the log of L with respect to sigma square is given by this these are interesting exercise I would like you I would very strongly urge you to do these exercise now please understand at the maximum these derivative must vanish so it must be equal to 0 that means this component the first component must be 0 the second component must be 0 the first component being 0 gives raise to a function or form of the estimator please look at this now this must be 0 that means that so let us look at this now 1 over sigma square summation Zi minus mu must be equal to 0 this is a fraction a fraction is 0 only the numerator is 0 therefore if the numerator is to be 0 summation Zi minus mu must be equal to 0 this essentially tells you summation Zi must be equal to summation mu the summation is over I I running from 1 to 1 to M so this is equal to M times mu therefore mu must be equal to 1 over M times summation Zi what is that that the average value tada we have now rediscovered a formula that we already knew what is that we knew from least square estimation when we did statistical estimation theory average is the best least square estimate average is also best in the sense of maximum likelihood so mu hat M L is the maximum likelihood estimate that is also Zi hat so have average has very beautiful property of being optimal simultaneously from the least square sense from the maximum likelihood sense again I can I can I can equating the second term to 0 and simplifying you can readily get an an an estimate for the variance so sigma square hat M L is the estimate for the is the maximum likelihood estimate for the variance this is the expression for the for the for the for the variance and and I I I think this is this this expression is not correct this is unbi this is this is unbiased this is unbiased and well I think I think I am sorry I should I should be able to erase this that is right so both the estimates I had given you here now I would like to talk about another related property so you you you that is what is called Kramer-Rau bound in the sense of in the sense of optimality intrinsic optimality of the of the maximum likelihood estimate this is the likelihood function this is the log of the likelihood function so the likelihood function you know what the Hessian of the log of the likelihood function with respect to X we can compute that Hessian exists the L of X is called the information matrix the information matrix is essentially negative of the expected value of the of the Hessian of the log of the likelihood it can be shown that this information matrix is also equal to the the outer product now look at this now L of the log of the likelihood the gradient with respect to this that is the vector the transpose is the outer product the the the expected value the outer product this is the expected value of the matrix so what is the theory here the theory here is that the the the the outer product matrix the outer product matrix and and and the Hessian matrix are related so what is the fundamental result again there is a ton of theory goes with it but I want to tell you to expose you to some of the existing results X hat be an estimate of X then the covariance so if X hat is any other estimate if I have a maximum likelihood estimate information matrix what is the information matrix information matrix is the reciprocal of the covariance matrix so inverse of the invariant of information matrix on the right hand side covariance of estimate of any estimate so what does this inequality says this essentially tells you the covariance of the maximum likelihood estimate is always less than or equal to covariance of any other estimate that is what this inequality is this inequality is very fundamental so what does it mean I have two estimates one is X hat another is X hat M L the the I L in I am sorry L inverse this I X is the information matrix please understand information matrix is the inverse of the covariance matrix so I inverse X is the covariance of the estimate of this the covariance of the estimate of this is covariance of X hat given X so this is the covariance of any other estimate this is the covariance of maximum likelihood estimate one of the fundamental results is that the covariance of any any other estimate is greater than equal to the covariance of the covariance of the maximum likelihood estimate. This inequality is called Kramer-Rauv inequality or Kmer-Rauv bound. What is the bound? The covariance of the maximum likelihood estimate is the lower bound and the covariance of any other estimate. So this is the lower bound that is the least value and this bound is attained by the maximum likelihood estimate. So what does this mean? Maximum likelihood estimate gives the best estimate in the sense the covariance of the estimate resulting from the maximum likelihood estimation is the smallest among the possible values that the covariance estimate can take. So when we are dealing with when we are dealing with linear functions in the observation z is equal to hfx plus v when you are dealing with non-linear functions in the observation h is equal to this. The linear observations are continuous or computation is simpler in the case of a non-linear observations you can see the non-linear function comes in here. In this case the log of the likelihood function is a non-linear function maximizing this is not easy. We cannot get explicit expressions for the easily and we cannot we may not be able to solve for the 0 of the gradient of these. So we may have to find the maximum in the non-linear case only iteratively. So log of the likelihood function computing the derivative of the log of the likelihood function equating the derivatives of 0 solving the resulting equation the solution of the resulting equation gives raise to the optimal maximum likelihood estimates all these processes are simpler in this context when there is linear observation all these processes are little bit complex in the case of non-linear observations. In the non-linear observations all the methodology the methodology still holds good except that the solution process have to be obtained only numerically iteratively it gives raise to iterative optimization. Of course we have already provided methods for iterative optimization namely gradient method conjugate gradient method. So we can use one of the very well known techniques that we have already covered to do the maximization of this log likelihood function. Therefore the theory applies to both linear as well as non-linear functions of the state. We would like to end this talk by asking you to do a homework problem of trying to compute the Hessian of the log likelihood function and computing the derivative. So computing the first and second derivative of the log likelihood function is a homework problem and again my favorite coverage of this is in Melso and Cohen 78. We also cover this in chapter 15 with this we cover a broad and a quick overview of maximum likelihood estimates. There are very few papers in the data simulation literature relating to maximum likelihood I would not say there is none there are couple of them they are done within the context of Kalman filter and the relation to other problems. And though in our illustration we will not invoke the maximum likelihood estimation we are largely going to be concerned with least squares. It is better to know what are the things out there and what are the alternate ways of thinking about the problems that the reason I am trying to introduce you to some of these techniques so that it will open our windows and our eyes to other related areas in estimation theory. Thank you.