 last class I have introduced methods of estimation and one of them was themethod of moments and the other was the method of maximum likelihood estimation. Now,classically speaking in the chronological order the method of moments was given first bythe British statistician Carl Pearson around 1900 andthereafter it was used for quite sometime. However, around 1922 onwards R A Fisher he proposed a new method that is called the maximum likelihood estimation and actually the popularity of the maximum likelihood estimation stems from the fact that the estimators which are obtained by this method are more efficient and they satisfy certain asymptotic properties also. So, first I will describe a few properties of the maximum likelihood estimators and then we will look at a couple of examples before moving on to other methods here. So, let us consider some properties of maximum likelihood estimators. Now, these properties are proved under certain conditions these are called regularity conditions. So,in general we have the model that we have observables from a distribution which may have a probability mass function or a probability density function which we describe by f x theta and theta belongs to a parameter space script theta. Usually if I am considering one dimensional parameter then theta is considered to be an open interval in the real line. For example, if I consider say Poisson distribution so, parameter lambda is positive. So, naturally 0 to infinity as an open interval on in R suppose we are considering say normal distribution with mean mu and variance unity then mu is considered to be the whole real line. So, in many of the practical problems this condition is always satisfied. So, we have some regularity conditions we assume that the third order partial derivative of the log f exists for almost all x in the interval theta minus theta naught less than delta for some delta positive. So, we assume that theta naught is the true value of the maximum likelihood estimator,theta naught is the true value of the parameter and then in the interval in an interval around that we assume we also assume expectation theta naught del log f x theta by del theta given at theta is equal to theta naught to be f prime x theta naught dx is equal to 0. So, when I am writing the integral actually I am assuming the continuous case where f is a density. However, similar statements can be written if I am assuming discrete case and this integral will be replaced by the summation sign and here this f prime denotes the derivative with respect to theta and then the value is taken at theta is equal to theta naught. Then we further assume second order derivative condition also we assume that the second order derivative is strictly positive. We assume a boundedness condition for the third order derivative that it is less than some m x for all theta in a neighborhood of theta naught this is integrable basically we are assuming expectation is bounded again for all theta minus theta naught less than some delta. Let us write down the likelihood equation. So, we have a random sample x 1, x 2, x n which is having the same distribution as x we write the likelihood function we call it say L theta x which is nothing, but the log of the joint density of x 1, x 2, x n which is actually equal to sum of log of f x i theta i is equal to 1 to n. So, dL by d theta is equal to 0 this is called the likelihood equation. So, we have the following result which I state in the form of a theorem the likelihood equation has a root with probability 1 as n becomes large which converges to theta naught with probability 1 under theta naught. So, this is consistent you can say strongly consistent here. Further we have a efficiency result here if I define say the information i theta to be expectation of del log f x theta by del theta square let theta bar be a consistent root of the likelihood equation then square root n theta bar minus theta naught i theta naught minus 1 by n dL by d theta naught goes to 0 with probability 1. As a consequence square root n theta bar minus theta naught has asymptotic normal 0 i inverse distribution that is asymptotic normality is also satisfied. So, these are some of the you can say desirable strong properties of the maximum likelihood estimator and which made it a very popular method of estimation. Now, let me give some example here. So, let us consider say x 1, x 2, x n follows normal mu sigma square distribution. Now, when we have normal mu sigma square distribution the likelihood function in fact, the like this was actually the log likelihood function. So, we write the likelihood function which is say capital L here it is function of mu and sigma square that is product of the densities that is equal to product i is equal to 1 to n 1 by sigma root 2 pi e to the power minus 1 by 2 sigma square x i minus mu square. So, that is equal to 1 by sigma to the power n root 2 pi to the power n e to the power minus 1 by 2 sigma square sigma x i minus mu square. So, the log likelihood l mu sigma square x that is equal to minus n by 2 log sigma square minus n by 2 log 2 pi minus 1 by 2 sigma square sigma x i minus mu square. The reason for considering log likelihood in place of likelihood function is thatfirst of all log is an increasing function ofx. Therefore, the optimization that is the maximization of capital L is same as the maximization of L the problem does not change. And secondly because of the distributions nature is in the exponential family when we take log then the terms become simplified. So, nowhere we are considering a two parameter case. So, the likelihood equation has to be differentiated the likelihood function has to be differentiated with respect to mu and sigma square both and we have to check a second order Hessian to be a positive definite matrix for the maximization of this. So, we consider here say del L by del mu then that gives us sigma x i minus mu by sigma square is equal to 0. Now, this can be easily simplified and we get mu is equal to x bar. Now, if we consider the derivative with respect to sigma square then we get minus n by 2 sigma square plus 1 by 2 sigma to the power 4 sigma x i minus mu square is equal to 0 which gives sigma square is equal to 1 by n sigma x i minus mu square. It can be checked that these are actually the maximizing choices of mu and sigma square. However, in this I just skip this calculations. So, now we can see that the maximum likelihood estimators for mu and sigma square. So, for mu it is x bar and for sigma square the solution consist of mu here. So, we put the solution for mu as x bar. So, we get the maximum likelihood estimators as. So, the maximum likelihood estimators for mu and sigma square are mu head let me call it mu head m l is equal to x bar and sigma head square m that is equal to 1 by n sigma x i minus x bar square. Note here that sigma head m square is not unbiased whereas, mu head m is unbiased. In fact, in this particular problem these are also the same as the method of moments estimator. In this problem these are also method of moments estimator. Let us consider a special case sometimes due to physical interpretation of the parameters in a given application. We may have restrictions on parameters in the form of some constraints. Say for example, mu lies in an interval say a to b. Now, by a linear transformation we can translate the data. So, that we may assume mu to lie in an interval say minus m to m. Then the solution here will have to be modified here that is here we are considering mu is equal to x bar actually the solution is coming from the derivative here. Now, if you look at this condition this is nothing, but n x bar minus mu by sigma square. Since sigma square is positive we can concentrate on the numerator part. So, if we are looking at it as a function of mu then for mu less than x bar this is positive that means, it is increasing up to x bar there after it is decreasing. So, the nature of the function and the likelihood function with respect to mu can be considered as that it is increasing up to x bar and there after it is decreasing. Now, if I make the assumption that mu lies between minus m to m then we have to have x bar between this for this solution to be satisfied because in the method of maximum likelihood we maximize the likelihood function over the given parameter space. So, now if we put a restriction say minus m to m then we have to see that the solution also lies in the interval minus m to m. So, now if x bar lies between minus m to m we do not have to worry about it. So, then the solution of the ML equations must lie in minus m to m for mu. Now we analyze the behavior of l mu sigma square x as a function of mu. So, we observe that it is increasing up to x bar and there after it is decreasing and therefore, if x bar is between minus to m to m we do not bother. However, suppose x bar is here in that case naturally you can see in this because mu is between minus m to m and x bar is less than minus m. So, our value that will be considered here because x bar has become less than minus m therefore, we will consider minus m here. Similarly, if x bar is greater than suppose x bar is here which is greater than say m then we will consider m here. So, the maximum likelihood estimator will become. So, the maximum likelihood estimator of mu under this restriction becomes mu head let me call it restricted. So, restricted estimator that is equal to x bar if minus m is less than or equal to x bar less than or equal to m and it is equal to minus m if x bar is less than minus m it is equal to plus m if x bar is greater than plus m. Because we are looking at the behavior of the function here since it is increasing here. So, the maximum value will be attained at minus m if x bar is less than minus m that means we will not go beyond that thing. Similarly, on this side if we look at if x bar is greater than m then the maximum value that we will be considering will be m here because we are not going beyond this value here because mu lies between minus m to m only. So, we are looking at the relevant portion of the likelihood function for the maximization problem. Now, naturally if mu head is mu is modified. So, sigma head square R m then this will become 1 by n sigma x i minus mu head R m square that means it will be 1 by n sigma x i minus x bar square if x bar lies between minus m to m otherwise you have to replace by minus m or plus m as the case may be. Now, in many practical problems the solution of the likelihood equation and that means the optimization problem of the likelihood equation may not come so easily. Let us take one case sometimes it is not easy to obtain solution of the likelihood equation in a closed form. Let us consider the underlying distribution to be Cauchy. So, let us consider with the probability density function say f x theta is equal to 1 by pi 1 by 1 plus x minus theta square. So, now let us consider here the likelihood function suppose we have a random sample say x 1 x 2 x n from this population. So, the likelihood function that is equal to 1 by pi to the power n 1 by product i is equal to 1 to n 1 by 1 plus x i minus theta square. So, you consider log likelihood function which we are calling small l theta x that is equal to minus n log pi plus sigma i is equal to 1 to n log of this which we can we can write as minus log of 1 plus x i minus theta square. So, if we consider the derivative with respect to theta that is this is equal to 0 that is the likelihood equation then we get minus sigma 1 by 1 plus x i minus theta square and here we will get derivative of this term that is equal to twice x i minus theta with a minus sign. So, this is equal to 0 i is equal to 1 to n naturally you can see that this equation is a polynomial it is involving rational functions here. So, when you have some i is equal to 1 to n then the solution of this is not cannot be obtained in a closed form ok. So, let me call this equation 1 the solution to equation 1 cannot be obtained in a closed form and even for a moderate value of n say n is equal to 5 or 8 etcetera the equation will be of a high order and therefore, it will not be easy to solve this thing. Therefore, some numerical methods are available C R Rao the Indian statistician he proposed a method which is called the method of a scoring or a scoring method. In the method of a scoring we consider. So, let the likelihood equation be written as del log L by del theta is equal to 0 let me call it equation number 2 let theta not be an initial value and assume that the exact solution of 2 lies in an neighborhood of theta that is suppose exact solution say theta. So, we are assuming that theta is equal to theta not plus some delta theta. So, we consider the term that is del log L by del theta in Taylor series around theta not and neglect third and higher order derivatives. So, basically what we are doing we are writing del log L by del theta is equal to del log L by del theta not plus theta minus theta not into del square by del theta not square log L. So, we have ignored third and higher order terms. So, this we can approximately write as del log L by del theta not plus theta minus theta not and this term we replace by expectation. Now, the reason here is that if you look at this capital L term here this capital L is actually the product of the density. So, log is the sum term here if we look at this log product here that is sigma log f x i theta. So, this is the sum now if we are assuming x 1 x 2 x n i i d random variables and we are making condition on the existence of the first moment here then by the law of large numbers this will converge. So, if it converges to its expectation. So, we can replace by that. So, this term we replace by its expectation here and then we can theta minus theta not is equal to delta because we have assume. So, that is minus delta theta i theta not this is the information which I introduced a little earlier this i theta not here. So, now if we use this in this equation. So, we are saying using this in equation 1 using this let me call it say 3 using this relation 3 in 2. So, if we substitute there we get delta theta is equal to del by del theta not log L divided by i theta not. So, basically if we start with an initial approximation theta not and evaluate this then delta theta is given by this. So, now you consider theta 1 is equal to theta not plus delta theta and that will become the next approximation and then using that theta 1 we can again calculate delta theta by substituting in this equation theta 1 and continue till we achieve desired accuracy. So, so we take theta 1 is equal to theta not plus delta theta and continue till desired level of accuracy is achieved. So, as an example, let us consider let us consider Cauchy's distribution. In the Cauchy distribution we just now saw your f x theta is 1 by pi 1 by 1 plus x minus theta square. So, we take log of f that is minus log of pi minus log of 1 plus x minus theta square. So, del log f by del theta is simply equal to twice x minus theta by 1 plus x minus theta square. Let us consider say expectation of del log f by del theta whole square that is equal to 4 times expectation x minus theta square divided by 1 plus x minus theta square whole square. Now, for Cauchy distribution this term can be evaluated. So, this term is equal to 4 by pi integral minus infinity to infinity x minus theta square divided by 1 plus x minus theta square. Now, this will become cube because in the Cauchy distribution we have another 1 plus x minus theta square in the denominator coming here. So, if I substitute x minus theta is equal to y I get 4 by pi minus infinity to infinity y square by 1 plus y square cube dy. So, this can be easily evaluated it is equal to 8 by pi 0 to infinity y square by 1 plus y square cube dy and if we make a simple transformation like y is equal to tan theta then this becomes integral from 8 by pi 0 to pi by 2 tan square theta sec square theta by sec square theta cube d theta that is 8 by pi 0 to pi by 2 sin square theta Cauchy square theta d theta. So, this can be evaluated and it turns out to be simply half. So, this i theta naught. So, i theta naught is simply equal to n by 2. So, delta theta which we wrote as del by del theta naught log l by i theta naught that will be simply equal to 4 by n sigma i is equal to 1 to n x i minus theta divided by 1 plus x i minus theta square. So, we have obtained the formula which can be used for the method of scoring. That means, if I consider theta naught as the initial approximation then in substituting on the right hand side we get the value of delta theta here. So, that theta 1 will become theta naught plus delta theta. So, as an application let us consider one problem. Suppose, a random sample of size say 8 is 210 by 0. So, this can be 195, 190, 199, 198, 202, 185 and 215. For initial approximation we take say median of observations. So, that is say theta naught is equal to 198.5. Now, you can carry out the calculations. So, theta 1 will become equal to 198.4784887, theta 2 is equal to 198.4656064 and so on. If we continue like this we get theta 14 is equal to 198.4464555. So, this theta 15 is equal to 198.4464509. So, you can see here up to 5 decimal places the value is same. So, we may stop here. So, we may have 5 places of accuracy after decimal if we stop at theta 15. So, this method of scoring is quite useful to obtain the solutions of the likelihood equation if the likelihood equation cannot be solved in a closed form. That means the solution cannot be obtained in an analytical form. Now, I will just consider a couple of examples for application of method of moments, the maximum likelihood estimator etcetera. So, let us consider say here the x is the time for time between successive orders. So, let us it is given it is set to follow assume to follow gamma distribution with parameters say p and alpha. Suppose, 10 observations are taken to be say 15.5, 4.5, 6.8, 46.0, 34.5, 4.7, 20.9, 8.2, 14.9, 17.7. We want the we will find the method of moments estimators p and alpha. That means, we are assuming here the form of the distribution as alpha to the power p divided by gamma p e to the power minus alpha x x to the power p minus 1 that is the form of the density function here. And here it is assumed that both alpha and p are unknown. So, the problem is of estimating both the parameters in this case. Sometimes in a gamma distribution the parameter p is known and then we estimate only alpha. In that case, maximum likelihood estimator can be easily derived, but if both the parameters are unknown then for maximum likelihood estimator becomes quite complicated. In fact, the likelihood equations become quite complicated. As you can see here, suppose we consider M L estimation. If we consider the M L estimation then here the likelihood function will become L p alpha x that is equal to alpha to the power n p divided by gamma p to the power n e to the power minus alpha sigma x i product x i to the power p minus 1. So, if I take log here that is equal to n p log alpha minus n log gamma p minus alpha sigma x i plus p minus 1 log of product x i which we can write as sigma log of x i. Now, easily you can see that if I want to differentiate with respect to alpha easily we can do it, but if we want to differentiate with respect to p then there is a problem because p is occurring inside the gamma function here and therefore, the solution of the likelihood equation will become complicated and we have to apply some numerical methods such as is coding method etcetera to get the solutions. So, you can see here that analytical solution to likelihood equation are not possible. So, we consider the method of moments. So, if we consider the method of moments here we look at the first two moments about the origin. So, mu 1 prime here is p by alpha and mu 2 prime is equal to p into p plus 1 by alpha square. So, now, you consider the solution of this let me call it equations 1 solutions to equation 1 are that is p is equal to mu 1 prime square by mu 2 prime minus mu 1 prime square and alpha is equal to mu 1 prime by mu 2 prime minus mu 1 prime square. If we remember in the method of moments we estimate mu 1 prime by alpha 1 that is the first sample moment that is x bar and we estimate mu 2 prime by alpha 2 that is 1 by n sigma x i square that is the second sample moment. So, if we substitute that so, the method of moments estimators of p and alpha are p head is equal to let me call it mm x bar square divided by 1 by n sigma x i square minus x bar square and which we can of course, write as x bar square divided by 1 by n sigma x i minus x bar square and alpha head mm is then equal to x bar divided by 1 by n sigma x i minus x bar square. For the random sample observed we can see that x bar is equal to 17.37 and 1 by n sigma x i square is equal to 467.443. So, p head will be equal to 1.82 approximately and alpha head is equal to 0.1048 approximately. So, these are the method of moments estimator in this particular problem. Let us consider one application where I can calculate both the method of moments estimators and the maximum likelihood estimators. Suppose birth times of children recorded in a maternity hospital are uniformly distributed over the day. So, if we say over the day we can consider say 0 hours to 24 hours we can consider like this and so, based on 37 birth timings find maximum likelihood estimates and method of moments estimates of the limits of the uniform distribution. Since we are recording over the day it is between 0 to 24, but it is some interval say a to b here. Now, we want to find out the realistic a and b here which will be estimated from the data. Let me write down here the method of moments estimates and the maximum likelihood estimates here. So, for maximum likelihood estimation the likelihood function is that is equal to 1 by b minus a to the power n where a is less than or equal to x i less than or equal to b for i is equal to 1 to n each observation lies between a to b. Now, if you look at this term if you want to maximize this it will be equivalent to minimizing the value of b minus a. Now, minimizing of the b minus a can be done if we can find the minimum value of b and the maximum value of a. Since all the observations are between a to b this restriction is realistically reducing to a less than or equal to x 1 less than or equal to x n less than or equal to b where x 1, x 2, x n they are denoting the order statistics of x 1, x 2, x n. So, if we consider the L is maximized when b minus a is minimum which is possible if b is chosen to be x n and a is chosen to be x 1 that is the maximum likelihood estimates of a and b are a head m l is equal to say x 1 and b head m l is equal to x n. Now, let us consider the method of moments estimated in this particular problem for method of moments since here the parameters are a and b that is the two parameters are there. So, we take the first two moments mu 1 prime for uniform distribution on the interval a to b that is a plus b by 2 and mu 2 prime will become equal to a square plus a b plus b square by 3. So, we consider the solution of this the solution of the above equations a is equal to mu 1 prime minus square root 3 mu 2 prime minus mu 1 prime square b is equal to mu 1 prime plus square root 3 mu 2 prime minus mu 1 prime square. So, the method of moments estimators of a and b they will be equal to a head let me call it m m that is x bar minus square root 3 by n sigma x i minus x bar square and b head m m is equal to x bar plus square root 3 by n sigma x i minus x bar square here. If we apply on the data that is available let me briefly mention the data. So, the data is in terms of the timing the recorded timings of 37 birth records give based on that we consider a head m l that is equal to that is 0 0 26 hours that means night 12 o clock 26 minutes and b head m l is equal to 11 46 pm and a head m m turns out to be 0.1 16 am and b head m m is equal to 10 21 pm. You can observe that there is some difference in the values they are not the same. Now the question comes that which one should be used as I already mentioned for example, the mean squared error criteria. If we consider the mean squared errors of the estimates here the maximum likelihood estimators would be preferred over the method of moments estimators here and in that case we will prefer these as the realistic estimates of the limits of a and b in this particular problem. The there are some other methods of estimation. So, for example, least squares estimation. So, I will discuss in detail the method of least squares estimation in the next module on regression then there is a method of minimum chi square. Then there are other methods which have been developed using the concept of decision theory that means we consider base estimation we have minimax estimation and then there are some special things in the base and minimax estimation etcetera. So, that means we put some conditions and then under those conditions we do the base estimation we have base rules we have empirical base rules we have limit of base rules we have generalized base rules we have extended base rules and similarly in the minimax t we have the concept of like gamma minimax t l minimax t and so on. We have the concept of admissible estimators admissible estimators and consequently when we consider admissible estimators and then we have inadmissible estimators. So, therefore, we consider improved estimators. One of the prominent concepts here that we have not discussed here, but which is extremely useful in the decision theory that is the concept of invariance. So, for example there are many statistical problems which exhibit natural invariance say I consider normal mu sigma square distribution. If I consider say x being shifted by say c then we are having the observations say x plus c. So, now x plus c will follow normal mu plus c sigma square that means the same shift is observed in the mean of the distribution or one of the parameters. So, if I say x follows normal mu sigma square then x plus c will follow normal mu plus c sigma square. Along with this if we impose some condition on the estimator also that means the estimator for mu should also shift by the same constant then it will be called location equi-variant estimators. So, this is actually translation or location equi-variance location equi-variant estimator and then we consider best location equi-variant estimator. Similarly, we can consider say c x then that will follow normal c mu c square sigma square. So, this is called scale invariance and we consider scale equi-variant estimators or best scale equi-variant estimator etcetera. We can consider x going to A x plus b then A x plus b follows normal A mu plus b A square sigma square. This is called affine invariance and we consider affine equi-variant estimators. In many of the estimation problems it has been observed that if we impose the condition of invariance then we are able to get better estimators than the usual maximum likelihood estimators or the UMVU is. For example, in the case of estimation of sigma square in the normal distribution suppose I consider suppose we consider estimation of sigma square in sampling from in sampling that means we are considering x 1 x 2 x n from normal mu sigma square. If we consider scale equi-variant estimators then they are of the form c times sigma xi-x bar whole square. If we minimize the mean squared error of c times sigma xi-x bar whole square with respect to c then we get the minimizing choice as 1 by n plus 1. So, 1 by n plus 1 sigma xi-x bar whole square is the best equi-variant estimator here. So, problems of this nature abound in practice and in fact, then if we consider some other group then even this can be improved and in 1964 1964 Charles Stein proved that even 1 by n plus 1 sigma xi-x bar whole square can be improved and he proposed an improved estimator. So, there are various methods of estimation which are extremely useful in providing improved estimators. So, those who are interested in this can refer to the books by Lehmann, Jax, Ferguson and many other texts and also the lectures on statistical inference in NPTEL. So, in the next part of this parametric methods I will be starting the confidence intervals.