 In the previous lecture, I have introduced the concept of point estimation what is the problem and we are considering the parametric methods. That means, we are assuming that the unknown populations distribution is known however, it may depend upon unknown parameter. We have considered certain criteria for judging the goodness of estimators. For example, we have considered the criteria of unbiasedness then consistency. I also introduce the concept of mean squared error criterion that means, an estimator which has a smaller mean squared error over the parameter space will be considered to be better than the one which is having slightly larger mean squared error. If the estimator is unbiased then the mean squared error reduces to the variance of an estimator. So, therefore, we have the concept of uniformly minimum variance unbiased estimator which we call shortly UMBUE. I mentioned that in order to obtain the UMBUE, we have broadly speaking two methods one is the method of lower bounds. So, under certain conditions or sometimes without conditions one can obtain a lower bound for the variance of an unbiased estimator. Therefore, an estimator which will achieve that lower bound will be called the minimum variance or it will be the minimum variance unbiased estimator. In this particular course, we will not be discussing those methods. However, let me briefly introduce another method which is based on the concept of completeness and sufficiency. So, I introduced a sufficiently statistics and I gave a consequence of that which is called Raub-Lackwell theorem that if there is an unbiased estimator which may not depend upon the sufficient statistics then we can consider, we can construct another unbiased estimator which will be simply a function of the complete sufficient statistics of the sufficient statistics and whose variance will be less than or equal to the variance of the original estimator and this will also be unbiased. Now, coupled with another concept of completeness. So, let me introduce that and firstly let me consider the applications of the factorization theorem applications of which basically produces the sufficient statistics in given problem. Of course, one may see that from the definition if conditional distribution of x 1, x 2, x n given t is independent of the parameter if t itself is a function of say u then u will also be sufficient. However, we can consider something called minimal sufficiency that means, maximum reduction of the data. I will not get too much into technical details here rather we look at the direct application. So, let us consider say say x 1, x 2, x n follow say uniform distribution on the interval 0 to theta. Now, how do you write down the joint density? The joint probability density function of x 1, x 2, x n is. So, I will just write f x that is equal to product of f x i theta that is equal to 1 by theta to the power n for 0 less than x i less than theta for i is equal to 1 to n. Now, in order to apply the factorization theorem we need to represent in a slightly compact form because here this range is coming separately. So, we write it as 1 by theta to the power n indicator function of x n over the interval 0 to theta into the product of x i i is equal to 1 to n minus 1 and all of them will be from 0 to x n. If we look at this, this can be considered as g theta and x n and this is a function of observations alone. So, here x n is sufficient that is the maximum of the observations. If we remember one exercise which I did for the consistency in this one I proved that x n is consistent for theta. Now, here I am observing that x n is also sufficient. Now, here in the uniform distribution theta by 2 is the mean that means x bar will be unbiased, but x bar is not based on x n. Therefore, I can construct another estimator which will be based on x n and whose variance will be smaller than x bar by then 2 x bar for theta it will be 2 x bar. So, we will show it later. Now, let us consider some more examples. Say consider x 1, x 2, x n follow say beta distribution with parameters alpha beta. That means I am considering the joint pdf. So, that is product of f x i alpha beta that is equal to product of i is equal to 1 to n 1 by beta function alpha beta x to the power alpha minus 1 1 minus x i to the power beta minus 1. So, this you can see it will be this can be written as 1 by beta alpha beta to the power n product of x i to the power alpha minus 1 product of 1 minus x i to the power beta minus 1. Here this entire thing can be considered as a function of parameters alpha beta product x i and product 1 minus x i and then h x you can consider to be 1 itself. So, here product x i and product of 1 minus x i that is sufficient. Another way of looking at this concept of sufficiency is in the form of we can also consider we can consider the distributions in exponential family. Let me define 1 parameter exponential family and multi parameter exponential family. So, we consider c theta h x e to the power q theta t x. This is called 1 parameter exponential family. To give an example say you consider x following Poisson lambda. How do you write down the distribution e to the power minus lambda lambda to the power x by x factorial for x is equal to 0 1 2. This we can write as e to the power minus lambda 1 by x factorial e to the power x log lambda. So, if I define q lambda is equal to lambda t x is equal to x c lambda is equal to e to the power minus lambda and h x is equal to 1 by x factorial. Then this is an example of 1 parameter exponential family. That means the Poisson distribution belongs to 1 parameter exponential family. Note that this exponential family is different from exponential density that we discussed earlier. This is exponential family. Let us take say exponential distribution itself say f x mu that is equal to e to the power mu minus x for x greater than mu 0 for x less than or equal to mu. Then this is not an exponential family. Let us consider say f x lambda is equal to lambda e to the power minus lambda x. Then here this can be considered as c lambda h x is 1 q lambda is equal to minus lambda t x is equal to x. So, this is again 1 parameter exponential family. Let us consider this beta distribution that I wrote beta alpha beta. This is 1 by beta alpha beta x to the power alpha minus 1 1 minus x to the power beta minus 1. Now this we can write as 1 by beta alpha beta. This is e to the power alpha minus 1 log x plus beta minus 1 log 1 minus x. Now that gives rise to multi parameter exponential family. So, let me introduce that here because here we are having two terms coming here. So, in general we can define multi parameter exponential family. So, let us consider f x as c theta h x e to the power sigma theta i t i x for i is equal to 1 to k. Then this is called. So, here theta is a vector parameter theta 1 theta 2 theta k. This is k parameter exponential family. So, if you look at the distribution that I introduced here of the beta. This one then we can write as c of alpha beta and this is then theta 1, this is theta 2, this is t 1 x, this is t 2 x. So, this is an example of two parameter exponential family. Now if we look at distributions in the k parameter exponential family and let us apply the factorization theorem and see what is the effect. Let x 1, x 2, x n be a random sample from say this distribution star. Then the joint PDF of x 1, x 2, x n is c to the power n theta product h x i i is equal to 1 to n e to the power sigma. Let me put here j because i is being used here. So, j is equal to 1 to n, i is equal to 1 to k theta i t i x j. So, this we can write as c to the power n theta product h x j, j is equal to 1 to n theta i sigma t i x j, j is equal to 1 to n, i is equal to 1 to k. So, if I consider factorization theorem then by factorization theorem I am able to express this as a function of, so this is a function of theta and sigma t 1 x j, sigma t 2 x j, sigma t k x j, j is equal to 1 to n. Therefore, we can say that sigma t 1 x j and so on sigma t k x j is sufficient by factorization theorem. To give an example here if you consider this beta distribution in this case sigma log x i and sigma log 1 minus x i that will be sufficient. Let us take the more popular normal distribution say x 1, x 2, x n follow normal mu sigma square. So, if I write down the joint pdf of x 1, x 2, x n then that is equal to product i is equal to 1 to n 1 by sigma root 2 pi e to the power minus 1 by 2 sigma square x i minus mu square. So, that is equal to 1 by sigma to the power n root 2 pi to the power n e to the power minus sigma x i minus mu square by 2 sigma square. Now this term we can expand and we can write it as e to the power minus sigma x i square by 2 sigma square plus n mu x bar by sigma square minus n mu square by 2 sigma square. So, this is in effect becoming e to the power minus n mu square by 2 sigma square divided by sigma to the power n root 2 pi to the power n e to the power n mu by sigma square x bar minus 1 by 2 sigma square sigma x i square. Now we can put it in the form of two parameter exponential family by defining. So, this term is simply the function of parameters. So, this is some function of mu and sigma square. Now this we can call theta 1 that is n mu by sigma square and T 1 x is x bar. Then we can call theta 2 is equal to minus 1 by 2 sigma square T 2 x is equal to sigma x i square. So, naturally you can see that this is a two parameter exponential family. This is a two parameter exponential family. At the same time we conclude that x bar and sigma x i square is sufficient. We can also write sigma x i and sigma x i square is sufficient because this is a 1 to 1 function. We can also write x bar and sigma x i minus x bar whole square is sufficient because these are all 1 to 1 functions of each other. So, we can write down in this any of this forms. Now after this concept of sufficiency is introduced let me introduce the concept of completeness and that will help in obtaining a form for the or a methodology to obtain the complete umvv. Let us use a notation of p. So, if we are considering the distributions p theta. So, a family of distributions so x. So, we are actually using the notation that x has cdf fx theta. So, in general we can use some abstract notation p theta just to not to mention x there. So, family of distributions of x is said to be complete if expectation of gx is equal to 0 for all theta implies probability of gx is equal to 0, is equal to 1 for all theta belonging to theta where g is any function. Now to look at some simple application first of all what type of what is the meaning of this thing. Let us consider say x following Poisson lambda distribution. If we consider x following Poisson lambda distribution let us consider expectation of gx is equal to 0. Now this is equivalent to sigma gx e to the power minus lambda lambda to the power x by x factorial is equal to 0. Now this we can multiply by e to the power plus lambda on both the sides then that is giving us gx by x factorial lambda to the power x. Now if you look at the left hand side is a power series in x and we are saying it is vanishing identically over the entire positive real line. The only possibility is that the coefficients must be all 0 that means we are having that gx is equal to 0 for all x which implies that the probability that gx is 0 is 1 for all lambda. So, the family of Poisson distributions that is p lambda lambda greater than 0 is complete. Now we extend this concept of completeness of a family of distributions to a statistic. So, we say that a statistic t is complete if the family let me say p t of distributions of t is complete. For example, in the Poisson case x is complete similarly if we take t is equal to sigma x i based on a random sample from Poisson lambda then t will follow Poisson n lambda and so t is also complete and of course, a consequence is that a function of complete statistic is also complete. Now this completeness concept is extremely useful in the sense basically it says that if I am having an unbiased estimator of 0 then that estimator must be 0. Now that yields to some interesting thing for example, if I say t is complete and I say 2 estimators say h 1 t and h 2 t are unbiased for say g theta then expectation of h 1 t is equal to g theta and also you have expectation of h 2 t is equal to g theta. If I take the difference then I will get expectation of h 1 t minus h 2 t that is equal to 0 for all theta. Now h 1 t minus h 2 t is a function of t and if t is complete then this will imply that probability that h 1 t minus h 2 t is equal to 0 that will be equal to 1 for all theta. Basically this means that h 1 t is equal to h 2 t almost everywhere that is unbiased estimator based on complete statistic is unique almost everywhere. Therefore, you can say that uniformly minimum variance unbiased estimator can be obtained. So, there is a result called Lehmann Shafi theorem. In fact, you have a slightly relaxed version of this completeness that is called bounded completeness that means, if I consider here g to be a any bounded function any bounded function then I can change this to boundedly complete. So, that is that means, I am in place of any function if I put only bounded function if for only bounded function this is true then the it will called boundedly complete. However, this is not required here. So, if t is complete and sufficient then h t is u m v u e of g theta that is equal to expectation of h t. Now, once again one can prove actually completeness for various families for example, normal distribution, binomial distribution, Poisson distribution etcetera, but in exponential distribution we have result which can straight away give the completeness property. I introduce the multi parameter exponential family that is of this form f x is equal to c theta h x e to the power minus sigma theta i t i s. So, if we have distribution of this nature and we have the parameter space say theta if it is a k parameter exponential family and if the space theta contains a k dimensional rectangle then t 1, t 2, t k will be complete and this result is very useful in proving completeness in various distributions. In k parameter exponential family star if the parameter space theta contains a k dimensional rectangle then t 1, t 2, t k is complete. Moreover, if x 1, x 2, x n is a random sample from star then sigma t 1 x i x j and so on, sigma t k x j that will be complete and of course, sufficient. That means, the problem of obtaining the UMVUE reduces to actually determination of complete sufficient statistics and then by making use of that we can simply consider functions of that which are unbiased for the required parametric functions and then you will have UMVUEs. So, let me give you example here. So, x 1, x 2, x n follow Poisson lambda then t is equal to sigma x i this is complete and sufficient. So, if I consider x bar which is simply t by n. So, expectation of x bar is equal to lambda. So, x bar is UMVUE of lambda. Now, this results the problem that for example, based on this sample I could have considered any number of unbiased estimators for lambda. For example, in Poisson distribution 1 by n minus 1 sigma x i minus x bar whole square let me call it U. This is also unbiased for lambda, but since this is not dependent upon x bar alone because it is using other observations also. So, you will have variance of x bar less than or equal to variance of U. Let us consider say x 1, x 2, x n from normal distribution the popular one. So, we have already seen that it is a two parameter exponential distribution I showed here in the form x bar and sigma x i square or x bar and sigma x i minus x bar whole square. So, here x bar and sigma x i minus x bar whole square this is complete and sufficient. So, let us look at expectation x bar that is mu if I look at let me call this s square is 1 by n minus 1 sigma x i minus x bar whole square. So, expectation of s square is sigma square. So, x bar is u m v u e for mu s square is u m v u e for sigma square. Not only that we can also consider unbiased estimator for other parametric functions. For example, in this problem a popular thing could be consider say quantile of the form mu plus say b sigma where b is an a real number. Basically in the normal distribution as I have explained this is mu you may have mu minus sigma mu plus sigma and so on. So, in general mu plus b sigma is any position on the curve here. So, if we consider this as a function let me call it q then for mu we have x bar. Now, let us consider estimation of sigma also. So, we can make use of n minus 1 s square by sigma square this follows chi square distribution on n minus 1 degrees of freedom as I mentioned yesterday in the discussion of the sampling distribution. Now, if I make use of this I can consider expectation of say w to the power half. So, that is equal to integral 0 to infinity w to the power half 1 by 2 to the power n minus 1 by 2 gamma n minus 1 by 2 e to the power minus w by 2 w to the power n minus 1 by 2 minus 1 d w this is the density of the chi square distribution on n minus 1 degrees of freedom. So, let us simplify this terms this we can write as integral 0 to infinity and this constants will remain as it is and here I can adjust the power n by 2 minus 1 d w. So, this is nothing but gamma n by 2 and 2 to the power n by 2 divided by 2 to the power n minus 1 by 2 gamma n minus 1 by 2. So, that is giving us a square root 2 gamma n by 2 by gamma n minus 1 by 2. So, what we have proved expectation of w to the power half that is n minus 1 to the power half s by sigma that is equal to root 2 gamma n by 2 divided by gamma n minus 1 by 2. That means, we can write expectation of gamma n minus 1 by 2 into n minus 1 square root divided by square root 2 gamma n by 2 s is equal to sigma. So, we are able to obtain. So, first of all we since x bar and s square is complete and sufficient this gives this is the UMVUE for standard deviation. Another thing is that if I plug in in q. So, I get x bar plus this root n minus 1 by 2 gamma n minus 1 by 2 by gamma n by 2 s this is UMVUE for quantile. So, you can see this concept of complete sufficient statistic is extremely helpful in deriving the uniformly minimum variance unbiased estimators. And then not only that see if we had not considered the complete sufficient statistics then for the estimation of sigma perhaps we would have simply used 1 by square root 1 by n sigma x i minus x bar whole square as for sigma square we were using 1 by n sigma x i minus x bar square or 1 by n minus 1 sigma x i minus x bar square. But if you see this one we are not using that this is slightly different. If we use the concept of minimum mean squared error then some other estimator is also possible, but that I will delay here I will not be considering right now. Now, let us consider the method of obtaining estimators. Right now we have discussed the criteria for obtaining estimator and we have shown that there are estimators which will fulfill those criteria. But for any population we can also give some general methods for obtaining estimators. So, first of such methods is method of moments. This was introduced by Carl Pearson one of the founders of the subject of statistics. So, if we are considering that x 1 x 2 x n is a random sample from a population with distribution say f x theta I am putting it in the vector form. In general I am assuming it is a k parameter distribution for k greater than or equal to 1. So, suppose we want to estimate theta 1 theta 2 theta k. So, let us define sample moments that is alpha k that is equal to 1 by n sigma x i to the power k i is equal to 1 to n for k equal to 1 to n so on. Let me change it I put alpha m here because k is used here. Consider population moments. So, mu prime that is equal to expectation of say x 1 to the power m for m is equal to 1 2 and so on. So, now naturally this mu m prime this will be some function of the parameter. So, let me call it this function as g m theta. So, for m is equal to 1 2. So, we have k equations that is we write mu 1 prime is equal to g 1 theta and so on mu k prime is equal to g k theta. Let me call the system 1. Suppose the solution of the system 1 is theta 1 is equal to say h 1 of mu 1 prime and so on mu k prime and so on theta k is equal to say h k of mu 1 prime and so on mu k prime. In method of moments we plug in for mu 1 mu 2 mu k prime alpha 1 alpha 2 alpha k. In method of moments estimators of theta 1 theta 2 theta k are obtained as theta i head is equal to h i of alpha 1 alpha 2 alpha k. So, you can say that the basic method is that the estimate the population moment by the corresponding sample moment. Of course, when we write these equations this must exist that is this must exist. If they do not exist then you cannot write the equation here. So, this is the basic method of moments here. In general method of moments estimators need not be unbiased that means, sometimes they may be biased and sometimes they may be usually they are consistent. Now, in fact you can write the conditions if this functions h 1 h 2 h k are consistent or continuous functions. If they are continuous then we have already done the weak law of large numbers. So, from there this alpha m will be actually consistent for mu m prime. If alpha m is consistent for mu m prime and h i's are consistent function or continuous functions then this theta i heads will be consistent for h s h i's. So, you can consider here this is following say Poisson lambda then x bar is consistent and this is M M E method of moments estimator for lambda. If I consider say x 1 x 2 x n following normal mu sigma square then what are the moments here mu 1 prime is equal to mu mu 2 prime is equal to mu square plus sigma square. So, if we solve the equation you get mu is equal to mu 1 prime and sigma square is equal to mu 2 prime minus mu 1 prime square. So, if I substitute here. So, method of moments estimators of mu and sigma square they will be mu head is equal to x bar that is alpha 1 prime alpha 1 and sigma head square that will be equal to 1 by n sigma x i square minus x bar square that is 1 by n sigma x i minus x bar square. Note that this is not unbiased note that mu head is unbiased for mu, but sigma head square is biased for sigma square because we have seen actually that 1 by n minus 1 sigma x i minus x bar whole square is unbiased for sigma square. So, if I consider expectation of sigma head square then that will be equal to n minus 1 by n sigma square. So, this is biased biased for sigma square. So, this is a simple and heuristic method for obtaining the unbiased estimators for parameters in any given problem. Now, there may be sometimes some sort of discrepancies for example, here if I am writing two parameters then I am writing two equations here. If I have one parameter I write one equation sometimes it may happen that due to peculiarity of the distribution that the number of required number of equations may be more. For example, if I consider uniform distribution on the interval say minus theta to plus theta then the mean is 0 then the first moment is not useful. So, you can consider the second moment that will be theta square by 3 and then you can use second sample moment to estimate theta. Another thing that was observed in method of moments estimator is that we have to actually solve the equations. In the example that I constructed here it is simple, but sometimes you may end up with some very complicated functions. For example, if I consider gamma distribution or I consider two parameter uniform distribution or if I consider beta distribution where the mean is somewhat complicated function of the parameter. In that case the solution of the equations will give rise to some complicated functions. So, certainly unbiasedness will be ruled out not only that sometimes continuity of the function may also be in question. A more practical and also you can say theoretically sound procedure was proposed in 1925 by R A Fisher which is known as the method of maximum likelihood. So, in the method of moments we are making use of the moment structure of the distribution whereas, in the maximum likelihood estimation we make use of the probability structure or the density structure of the distribution. So, roughly speaking let me give the interpretation here. Suppose x 1, x 2, x n is a random sample from a distribution with either P M F or say P D F. Of course, you may have somewhat different situation in which you may have a mixture also that means partly P M F and partly P D F, but for the time being let me write in a simpler form. So, suppose it is written as f x theta. So, let me consider the P M F representation in the P M F representation in the P M F representation we write probability of x 1 is equal to say small x 1 and so on x n is equal to small x n that will be equal to product of f x i theta i is equal to 1 to n. Now, let me put this in a different form. Here what we are saying if theta is the true parameter value the probability that capital X 1 is equal to small x 1 capital X n is equal to small x n is given by this expression. Now, depending upon different values of theta this value will change. So, if I am considering that means a sample this has been observed we can actually consider it as x 1, x 2, x n is equal to x 1, x 2, x n that means what is the probability of this sample being observed? Now, we can call it likelihood of sample x 1, x 2, x n being observed. So, I give a new name and I call it L theta x this is called the likelihood function. That value of theta we consider as that means we maximize this with respect to theta. Then that value of theta, theta hat is equal to say theta hat x is called maximum likelihood estimator of theta if L theta hat x is greater than or equal to L theta x for all theta. That means, we are considering maximization of the probability of observing or likelihood of observing that particular sample. We can consider some typical example suppose I take say Poisson lambda and I specify say lambda is equal to either 1 or lambda is equal to 2 that means 2 values are possible, 2 values in the parameter space. And we observe say x is equal to 2 for example. Or let us take x is equal to 1. If I observe x equal to 1, let us write down this probability of x equal to 1 that is equal to e to the power minus lambda, lambda to the power x that is 1. So, this is simply divided by x factorial. Now, if lambda is equal to 1, then this is e to the power minus 1. If I observe lambda is equal to 2, then this is equal to 2 e to the power minus 2. So, we look at the comparison of these values which value is larger that is 1 by e or 2 by e square. So, we compare let us just write down. So, I multiply by e square. So, this is e square less than 2 e or if I cancel. So, e less than 2. So, that means this is actually larger. We are getting e is greater than 2 which is true. So, this number is larger. That means likelihood of observing x is equal to 1 is more than lambda is equal to 1. So, lambda so we say lambda hat is equal to 1 is the maximum likelihood maximum likelihood estimate. Since it is observed already, so we call it estimate of lambda. So, look at this. I am telling here that 2 values lambda is equal to 1 and lambda 2 are allowed here. We do not know which one is the correct value. Now, we observe the sample in this particular case 1 observation I take and it is equal to 1. Now, I calculate the probability of this x equal to 1 under this lambda. So, I am getting e to the power minus lambda lambda. I look at under both the conditions for lambda is equal to 1 this is equal to e to the power minus 1. For lambda is equal to 2 this is 2 e to the power minus 2. Now, I compare these two and I just write a simple inequality 1 by e greater than 2 by e square which is equivalent to e greater than 2 which is true. Therefore, we conclude that this probability is higher. Therefore, lambda is equal to 1 will be called the maximum likelihood estimate of lambda here. So, in you can say this is the fundamental principle of the maximum likelihood estimation that we consider the likelihood function. We look at that value of the parameter which is actually maximizing. That means, we are basically maximizing the likelihood function which is actually nothing but the I have given the probability mass function interpretation. So, now, we generalize this in place of this one suppose I consider pdf then we maximize that. So, in general we define the likelihood function as the joint p m f or pdf of x 1, x 2, x n. So, that is product of f x i theta and we call it l theta x and maximize with respect to theta. So, say it is maximizing the likelihood function maximized at theta head x then theta head x is called the maximum likelihood estimator of theta. So, I will be showing through various examples this. Let me consider a simple application which we have been considering earlier for the discussion of consistency and sufficiency etcetera. So, now let us consider this for this purpose. Now you can see that the likelihood function will be 1 by theta to the power n indicator function of. So, let me just write this and this we can actually write as 0 less than or equal to x 1 less than or equal to x 2 less than or equal to x n. Now to maximize this we see the maximum value will be attained when theta is minimum, but the minimum value of theta will be x n. So, theta head m l e that is equal to x n. In fact, we already proved that this is sufficient we can also show it is complete. This was shown to be already shown to be consistent it was sufficient. We can also show it as 0 less than or equal to be complete. We can also show that x n is complete. Just briefly I will obtain actually the UMV we based on this to complete this discussion. See we had obtained the probability of x n less than or equal to x that is equal to product of probabilities x i less than or equal to x that is equal to x by theta to the power n. So, the density function of x n is actually equal to n by theta to the power n x to the power n minus 1. If I consider the expectation of this what I get here this is equal to n by n plus 1 theta. So, that means expectation of n plus 1 by n x n is equal to theta. Also let us consider say expectation of g x n is equal to 0 for all theta. Then this will imply integral g x n x to the power n minus 1 theta to the power n d x 0 to infinity sorry 0 to theta that is equal to 0 for all theta positive. Now, you are saying that integral over intervals of the form 0 to theta for all such intervals. Then you can consider say liveness result by differentiation etcetera. You can prove that actually that g x is equal to 0 almost everywhere that means x n is actually complete. Now, x n is complete sufficient and this is an unbiased estimator based on x n. So, t is equal to n plus 1 by n x n is an UMV we. In tomorrow's class I will discuss a few more examples of maximum likelihood estimation and the method of moments and what is the comparison between them and then we will move over to the concept of interval estimation also. So, we stop today's lecture at this point.