 Let us estimate the population parameters using 2 different methods. The first one involves the use of moments. First I will give a brief description on the method. It may sound or it may seem a bit abstract. I will demonstrate the techniques using some standard examples. So what I request you to do is to first listen to the procedure and then see how the parameters are being estimated using the moments method and then you do it yourself and see whether you get the same answer and if there are some difficulties in the middle, you can rewind and listen to the steps again. So the first procedure is to write down the expression for the moments of the mass distribution or the density function. The mass distribution function applies to discrete random variables and the density function applies to continuous random variables. So you have to write down the expressions for the moments of the mass distribution function or the density function. So we term it as the distribution moment or the population moment. What we are trying to do here is first write down the expression for the population moment and then equate it with the sample moment. We have already come across the moments of the population and we will be defining what is meant by the moment of the sample. When you write down the moment corresponding to the population, it is obviously going to be a function of the unknown population parameters. So we have unknowns on one hand. We have to relate it to the knowns and equate them in a suitable fashion and then estimate the parameters. What do we know? We have a sample with us. So we equate the moments of the sample with the moments of the population and get the population parameters. So the concept is pretty simple. So when you write down the moments of the population, the expressions are functions of unknown population parameters. The next step is equate the moments developed above with the moments of the sample. As of now, we are not exactly clear what is meant by the moment of the sample. We may not even be remembering the moments of the population. But please wait, we will be coming to them shortly. So we have let us say 2 parameters which are to be estimated from the population. Then we need to write down 2 moment equations. So that we have 2 equations in 2 unknowns which may be solved. The first step is to write down the expression for the moments of the mass distribution function or the density function. So we know that the moments, the ordinary moments and the central moments are defined in terms of expectations. For example, the mean of the population was written down as expected value of x. The standard deviation or variance sigma squared was written down as expected value of x-mu whole squared. So here we are talking about moments and they are defining the population mean and the population variance. The first moment of a population is given by E of x which is nothing but mu. This is an ordinary moment which is taken about 0. The second moment about 0 would be E of x squared. We know that E of x-mu whole squared is equal to sigma squared. But we saw in the first example set that E of x squared can be written down as mu squared plus sigma squared. If you had forgotten, you may kindly refer to the first example set to see indeed so. The population or the distribution moments will be functions of unknown population parameters theta 1, theta 2 and so on. So we have to write down that many number of moment equations first. So that we have the same number of equations as that of the unknown population parameters. The main difficulty is sometimes the moment equations may have a combination of these parameters okay. They may not be explicit. As I said earlier, we have to equate the moments written as above for the population with the moments of the sample. The samples kth moments are calculated as depending upon the k value in the moment equation, we have x1 to the power of k, x2 to the power of k so on to xn to the power of k. So these are all summed over the n random variables chosen in the sample. Please note that we are using the sample to find the moment. So we can equate the moments written as above with the moments of the sample. Let us take k is equal to 1. The first moment of the population was expected value of x which was x bar and that is related to the first moment of the sample which is given by 1 by n, sigma i is equal to 1 to n, x1 plus x2 plus so on to xn. I will make a small correction here. The subscript has popped out right. So we see that x1 to the power of 1 plus x2 to the power of 1 plus so on to xn to the power of 1, i is equal to 1 to n into 1 by n is equal to x bar. So the sample mean is a moment estimator of the population mean. Very interesting. Mu hat is equal to x bar. What we then do is we still have not estimated the population variance sigma squared. We write down the second moment. The samples second moment is calculated as 1 by n sigma i is equal to 1 to n, x1 squared plus x2 squared plus so on to xn squared. So this is the samples second moment. This may be equated to the distributions second moment expected value of x squared. The expected value of x squared we saw just a moment back as mu squared plus sigma squared. Hence we can equate the samples second moment with expected value of x squared and express it in terms of the parameters to be estimated which is mu hat squared plus sigma hat squared. So I have just given you the procedure. So as a general rule if there are m unknown parameters that are required to be estimated from a population we can write down the first m moment estimators of this distribution and they are written down as theta 1 hat, theta 2 hat so on to theta m hat. These m moment estimators are equated with the first m moments of the sample. Then we somehow solve the m equations to get the unknown parameters. Let us demonstrate this with a simple example. Here we have a random sample comprising of x1, x2 so on to xn. The populations parameters are mu and sigma squared. Based on the random sample we have to estimate the parameters mu and sigma squared. We may be tempted to write immediately that the sample mean x bar is equal to mu hat which is the estimated population mean and we may also be tempted to write down the sample variance s squared is equal to sigma hat squared okay. We saw that the sample mean and the sample variance are unbiased estimators of the population mean and the population variance. But we are now going to see the moment method and let us see whether the 2 assumptions we made namely mu hat is equal to x bar and sigma hat squared is equal to s squared whether these 2 assumptions are indeed correct. Better not to make any assumption beforehand without proper verification. So the expected value of x the random variable x will be given by x1 plus x2 plus so on to xn divided by n. I will just put the subscripts back in place. So expected value of x is equal to mu which is equal to sigma xi by n and that was equated to the first population moment x bar and we indeed have mu hat is equal to x bar. The estimated population mean is equal to the sample mean. Now we write down this second moment of the population e of x squared and that is equal to mu squared plus sigma squared. Here we have 1 by n sigma i equals 1 to n x1 squared plus x2 squared plus so on to xn squared. Here we are just putting k is equal to 2 because we are dealing with the second moment. That is expressed concisely as 1 by n into sigma i is equal to 1 to n xi squared. Now we write down 1 by n sigma i is equal to 1 to n xi squared in the following way. We are doing some mathematical jugglery to get to the final answer okay. So this can be written as 1 by n sigma i is equal to 1 to n xi minus x bar squared plus x bar squared okay. So just verify this, it is not difficult. The summation applies only for the term xi minus x bar squared. So essentially we have written 1 by n sigma i is equal to 1 to n xi squared in terms of this quantity plus x bar squared okay. The proof is pretty straight forward. So we will be looking at this rather simple derivation. I hope you were also interested or curious enough to work it out by yourself. These are standard derivations commonly encountered in this field of analysis. So what we are trying to say here is 1 by n i equals 1 to n xi squared may be written as the sum of these 2 terms. And how is it possible? What we do is we expand it. We take 1 by n outside sigma i is equal to 1 to n xi squared minus 2 xi x bar plus x bar squared plus x squared which is outside the brackets. Here this can be written as again 1 by n. We take the summation now inside sigma xi squared minus this 2 and x bar are constants. So they may be taken outside the summation sign. So you have 2x bar sigma i is equal to 1 to n xi plus n x bar squared. So you get 1 by n into sigma xi squared minus 2x bar into n x bar because we know that x bar is equal to sigma i is equal to 1 to n xi by n. Therefore, sigma xi i is equal to 1 to n is n x bar. So this is written as n x bar. So you have minus 2x bar into n x bar and that becomes minus 2n x bar squared. Here you are summing the x bar squared term n times. So this becomes n x bar squared. So this becomes minus 2n x bar squared plus n x bar squared. Again you have x bar squared here. I have dropped the indices of the summation i is equal to 1 to n i is equal to 1 to n. That may be added. So crossing the t's and dotting the i's we have added the indices. We have 1 by n sigma i is equal to 1 to n xi squared minus n x bar squared plus x bar squared. And so this becomes x bar squared. There is also plus x bar squared minus x bar squared and plus x bar squared will cancel out and you are left with 1 by n i is equal to 1 to n xi squared. So this is where we started and this is where we have ended. The point is I am saying that this is equivalent to this particular expression. So with this background we know also that e of x squared is equal to mu squared plus sigma squared and i is equal to 1 to n which is the second sample moment that is equal to 1 by n sigma is equal to 1 to n xi minus x bar whole squared plus x bar squared. But we recently found out that mu hat squared is equal to x bar squared. So this x bar squared will cancel out with this mu hat squared leaving sigma hat squared to be equal to this. So mu hat squared plus sigma squared is equal to 1 by n sigma is equal to 1 to n xi minus x bar whole squared plus x bar squared. And we know that mu hat squared is nothing but x bar squared that is from our first sample moment equated to the population first moment result. So once this cancels out we are left with sigma hat squared is equal to 1 by n sigma is equal to 1 to n xi minus x bar whole squared. A very interesting result sum of the square of the deviations from the sample mean divided by n. But we know by now that it should not be n but rather than n-1 if we were to take the sample variance. But we are using n here from the method of moments. So from the above and using mu hat squared is equal to x bar squared we get sigma hat squared is equal to 1 by n sigma is equal to 1 to n xi minus x bar whole squared. This becomes a biased estimator for the population variance sigma squared. If we had used 1 by n-1 sigma i equals 1 to n xi minus x bar whole squared then we would have got the unbiased estimator 1 by n-1 sigma i equals 1 to n xi minus x bar whole squared is equal to s squared which is the sample variance. And the sample variance based on n-1 in the denominator represents the unbiased estimator of the population variance sigma squared. On the other hand the method of moments led us to the expression 1 by n sigma i equals 1 to n xi minus x bar whole squared as the estimate of the population variance as the estimator of the population variance. Obviously this becomes then a biased estimator. But even though it is a biased estimator if you take a sufficiently large sample size the difference between n and n-1 will become small and so we do not really have to worry about the bias in the estimator. So again we see the merits of having a large sample size. Now we will go to the next technique, the method of maximum likelihood. You may feel that I have left the method of moments a bit too abruptly but we will be shortly doing an example set where both the method of moments and the method of maximum likelihood will be demonstrated using suitable examples. The first step is to define the maximum likelihood function. Let us do this with the single parameter. Assume that even if there are two parameters in the population the first parameter is unknown and the second parameter is known. It is a fictitious case but this is mainly meant for demonstration purposes. Then we will take the more general case involving two unknown parameters. So let us represent the probability density function in terms of the variables theta and f of x theta that is the probability density function. So we will take a random sample and once their values are known, the moment you have taken a random sample you are going to do the measurements. That is the purpose of taking the random sample. A sample is available to you and you are going to take the height, weight or their marks in a particular subject or if it is a specimen from an industrial production unit you may be subjecting the specimens you have drawn as sample to certain test compressive strength and strain limit and things like that. So you are going to denote them as small x1, x2 to xn. The small x values denote the values taken by random variable x1, random variable x2 so on to random variable xn. The random variable is denoted by capital X and the value taken by the random variable is denoted by small x. Now we define the likelihood function of the sample to be L of theta, L of function of theta where theta is the single unknown parameter as the product of f of x1, theta into f of x2, theta so on to f of xn, theta. So what we are doing here is in the probability density function we are plugging in the random variable value. Let it be not a number but we will put it in a more general case okay even though x1, x2 and so on have up to xn have taken values. We are putting it in a general case and finally we can substitute the values there okay. So as of now let it be generally x1, x2 so on to xn. So the likelihood function L of theta is equal to f of x1, theta f of x2, theta and so on to f of xn, theta okay. Do not be in a hurry to plug in the actual values of the random samples here. Now this is a function and as the name maximum likelihood implies what we are trying to do here is to maximize this particular function okay. So we have f of x1, theta into f of x2, theta so on to f of xn, theta. That function will be differentiated with respect to the unknown parameter theta okay. We want to find that parameter theta which will maximize this likelihood function. So we are not going to differentiate with respect to x, which x you will use x1 or x2 or so on to xn. No, you are going to maximize the function with respect to a parameter theta and so you have to differentiate it with respect to the parameter theta okay. Many of us are very used to maximizing a function or minimizing a function by differentiating that function with respect to x. So we may be tempted to do the same thing here but we have to actually differentiate it with respect to the unknown parameter or parameters theta okay. There can be more than one parameter and when you are having more than one parameter you have to partially differentiate the likelihood function with respect to each of the unknown parameters. But let us not be in a hurry. We will come to that a bit later. First let us take the simple case involving a single parameter and we will differentiate the likelihood function with respect to this parameter. So the density function f of x, mu equals 1 by root 2 pi sigma exponential-x-mu whole squared by 2 sigma squared represents the very commonly very frequently encountered normal distribution okay. Now we are putting mu because we assume sigma squared to be known and we take only mu to be the unknown parameter which is to be estimated. So this is for demonstration purposes. In several classes from now on we will also be assuming that sigma squared is somehow known to us. This is a kind of an artificial construct because mu and sigma squared are both unknown. Sigma squared represents the spread, spread about what? Spread about mu. So if you do not know mu, how will you find out the spread? Anyway for the time being we will assume that the sigma squared is known to us and mu is unknown for the purpose of demonstration right. Now what we can do is we can define the maximum likelihood function in terms of mu. What we do is we have taken a random sample of size n and we plug in x1 here for the first random variable value then we plug in x2, we plug in x3 so on to xn. So we will be having n such functions. When you multiply all these functions f x1, mu into f x2, mu into so on to f of xn, mu. So this can be represented compactly by 1 by 2 pi sigma squared to the power of n by 2. This 2 came because of the square root. We are writing 2 pi square root of 2 pi sigma as square root of 2 pi sigma squared and when it is multiplied n times we get 1 by 2 pi sigma squared to the power of n by 2. The exponential term is again very interesting. 1 by 2 sigma squared is common. When you are multiplying exponential terms the argument gets added up and so here you are going to have i is equal to 1 to n xi-mu whole squared. So we can differentiate this function directly with respect to mu because mu is the unknown parameter or to make life easier for us we can take the natural logarithm of this particular equation. We take the natural log on both sides. So we get L of mu is equal to 1 by 2 pi sigma squared to the power of n by 2 exponential-1 by 2 sigma squared sigma is equal to 1 xi-mu whole squared. This is the likelihood function. After you have taken the natural log you will get n by 2 into ln of 1 by 2 pi sigma squared and then you will have ln of e power minus this term. ln of e power any term is equal to that term itself. ln e power p is equal to p ln e which is equal to p. So that would be equivalent to or rather equal to minus 1 by 2 sigma squared sigma is equal to 1 to n xi-mu whole squared. So this is where we are getting the logarithm of the maximum likelihood function. Now we have to differentiate it with respect to the unknown parameter mu and equate it to 0. To indeed see whether the solution we find leads to a maximum value of L we have to take the second derivative of the maximum likelihood function. We have to indeed verify. We have to verify that the root we obtain by solving this equation leads to a maximum value of L. For that we know from calculus that the second derivative should be negative okay but we will not be doing that. We will leave that as an exercise and we will take only the first derivative. Since we are having a function which is depending on a single parameter instead of writing dou L by dou mu we should actually write it as dL by d mu. So I will just make that correction here. So instead of dou L by dou mu we are writing it as dL by d mu because there is only one parameter and the differentiation of the constant will become 0 and then when you differentiate this with respect to mu you will get, you are having a negative coefficient to mu. So you will get 2 xi-mu that minus will cancel out with this minus the 2 will cancel out with this 2 and then the differentiation of mu will lead to 1. So finally you will get the 2 by 2 the minus has become plus because you are having minus of minus and then sigma i is equal to 1 to n xi-mu is equal to 0. Sigma squared is a constant the 2 will cancel out you can take sigma squared out because it is a constant and you are left with sigma i is equal to 1 to n xi-mu is equal to 0. So you have to essentially solve for mu. When you indeed do that mu when it is summed n times will become n mu and this will become sigma xi. So mu will be nothing but sigma xi divided by n and that becomes the sample mean which we generalize as mu hat is equal to x bar. So the sample mean is an estimator of the population parameter mu hat. Now let us look at the population described by the normal curve and we are now being general by saying that neither mu nor sigma squared are known to us. Here we stop with the first sample moment because we had only one parameter to estimate. We assumed that sigma squared was already known to us. Now let us go to the second example where we have 2 parameters to be estimated and that is given by f of x mu sigma squared 1 by root 2 pi sigma exponential-x-mu whole squared by 2 sigma squared. What we then do is we take the random sample form the values x1, x2 so on to xn then we put l of mu sigma squared in terms of the product of the distribution functions expression. So you have 1 by root 2 pi sigma into e power-xi-mu whole squared by 2 sigma squared multiplied from i is equal to 1 the first sample to i is equal to n the last sample. So you are just multiplying it and this is the product sign. Just as you had the summation sign sigma we are having the product sign here. So when you take the product again this will become 1 by 2 pi sigma squared whole to the power of n by 2 and then you have to multiply the exponential terms n times and then the arguments will be added up. Again you will have ln of l is equal to n by 2 ln of 1 by 2 pi sigma squared-1 by 2 sigma squared sigma is equal to 1 to in xi-mu whole squared. Remember that we have 2 parameters sigma squared and mu and these 2 are unknown. So we are having an expression with 2 unknown variables sigma squared and mu. So this ln of l should be partially differentiated with respect to both mu to give the first expression and then sigma squared to give the second expression. So this is the expression we have to partially differentiate. We get 1 by l dou l by dou mu is equal to 0. Now we apply the partial differentiation sign because there are 2 parameters to differentiate with and we are first we are differentiating with respect to mu. 1 by l dou l by dou mu is equal to this term and the next expression is slightly a bit more cumbersome. You get 1 by l dou l by dou sigma squared. Remember you are differentiating with respect to sigma squared directly. So you have to take sigma squared as a particular variable. Do not think in terms of sigma. Think in terms of sigma squared. If that is confusing to you, put sigma squared is equal to p. So you will get this particular term. If you want to expand it, n is anyway a constant. You can write n by 2 into ln of 1 minus ln of 2 pi sigma squared. The only important term as far as the differentiation is concerned is ln sigma squared. So when you have ln of sigma squared and you differentiate it with respect to sigma squared, you will get 1 by sigma squared and this n by 2 is outside. So you will get after differentiation minus n by 2 into 1 by sigma squared. Similarly, you will have once a dust settles down here, you are differentiating 1 by sigma squared with respect to sigma squared. It is differentiating 1 by x with respect to x. You know that the differentiation of 1 by x with respect to x will lead to minus 1 by x squared. So differentiation of 1 by sigma squared with respect to sigma squared will lead to minus 1 by sigma to the power of 4. So that minus and minus will get combined to become plus and when you differentiate with respect to sigma squared, you will get 1, you will get plus 1 by 2 sigma to the power of 4 into sigma i is equal to 1 to n xi minus mu whole squared. From these 2 equations, we find that mu hat is equal to x bar and sigma hat squared is equal to 1 by n sigma i is equal to 1 to n xi minus x bar whole squared. So again we are having a biased estimator. We are using the sample and we are taking the sum of the square of the deviations of the sample random variables from the mean and dividing it by n, not n-1, we are doing n. So this expression is rather n-1 s squared by n, okay. This expression is n-1 s squared by n and hence this particular expression leads to a bias in the estimation of sigma hat squared in the estimation of the population variance. So what are the properties of the maximum likelihood estimators? When the sample size is large and theta hat is the maximum likelihood estimator of theta, then theta hat is an approximately unbiased estimator for theta, okay. And the variance for theta hat is nearly as small as the variance that could be obtained with any other situation. These 2 properties imply that the maximum likelihood estimator is approximately an minimum variance unbiased estimator. And it also tells us, we are not looking into the proof that another important property is theta hat has an approximate normal distribution. So this concludes our presentation on maximum likelihood estimation. We have learnt it. It is a very interesting and useful thing to know. We, however, will not be really using this further in our discussions but it is a very interesting thing to be aware of. It is also important to know that the properties of the sample we are taking should lead to an unbiased estimation of the population parameter. So in our inferential statistics, we are going to use the sample mean, sample variance very frequently and it is essential that we understand what their properties are. So this discussion really helped us. We also saw that if instead of using 1 by n-1 in the denominator for sample variance, if we had used n that would have led to a biased estimation of the population variance which is not really very good especially in the case of small samples. From now on, we are going more into the analysis of samples. We are going to work with samples. We are going to not work with x that much. We are going to consider x bar instead of x more and more, treat x bar as a single entity rather than a collection of n random variables. Even though that is undisputable, we are now going to deal with x bar as a single entity rather than x and we are also going to discuss another important and interesting topic. We have discussed about the point estimation so far. Now we will be talking about the interval estimation for the population parameters. All these things will lead to useful techniques or procedures that are essential in the design of experiments and analysis of experimental data. So we will wind up now and we will be looking at interval estimation shortly. We will also be doing a few problems to drive home the concepts we have studied so far. Thank you.