 Today, we will start our second module of this course on parametric methods. I will firstly introduce the problem of statistical inference. Broadly speaking, if we consider our day to day terminology of making scientific statements, then they are like that in various areas. For example, in the area of agriculture, we talk about what is our per hectare production of wheat say in the state of Punjab. Is it more than per hectare production in the state of say Tamil Nadu? We make a statement what is the average size of the holding for Indian farmers. In the field of atmospheric sciences, we make statements like what could be the what is the expected rainfall during the next monsoon season in Indian Peninsula. We talk about in atmospheric sciences, if there is a cyclone approaching, what would be the average wind speed during the peak of the cyclone. In the area of medicine, suppose there is a treatment which is discovered for treating a particular kind of disease, then we will be interested in knowing the average effectiveness of the medicine. That means, out of a n number of people, how many people will be effectively cured from the disease by taking that particular line of treatment. Or if there is another previously known medicine which is already available, then whether this new medicine will be more effective than the previous one or less effective or equally effective, whether it is more costly treatment than the previous one. Statisticians of this nature abound in every area of human activity, be it economics, be it social science, be it industry, be it trade, be it physics and so on. Now statisticians treat this problem as a problem of statistical inference. We broadly classify it into three categories. We assume, so for example, we consider the concept of population. So, we firstly talk about a statistical population is a collection of measurements on any characteristic of interest. For example, it could be heights of adult males in a city, it could be production of say wheat per form in state of say Punjab. It could be say for example, the responses in favor are against an ordinance by government. It could be say age at marriage of adult females in an ethnic group. In trade, it could be say the stock prices of a particular stock each day over a month say. So, all of these are examples of statistical populations. As I mentioned little while ago, a typical problem of statistical inference could be to estimate average heights of adult males in a city. It could be to estimate the average production of wheat. It could be for example, the proportion of people who favor a particular ordinance by government. It could be the estimation of the average age at marriage of adult females in an ethnic group. It could be stock prices that means estimating say average stock price of a particular stock. So, to answer these questions, we are dealing with the populations of the measurements on this. Now, there are two ways of looking at this. We may assume a distributional model for these measurements. And of course, there are methods to determine that what could be an appropriate distribution for that. For example, heights of adult males in a city could follow a normal distribution. The responses in favor or against may follow a binomial distribution. Age at marriage may be following say a gamma distribution and so on. There are methods of fitting the distributions. Right now, we are not going to discuss that. But once we have fitted a model, then we can say that this statistical population can be described by this statistical population can be described by a probability distribution say capital F x. Now, here we may have option. As I mentioned, if we assume a model like normal distribution, binomial distribution, exponential distribution etcetera, then this F is actually specified, but the parameter of the distribution may be unknown. And in that case, we will say the distribution is F x theta. That means, it is characterized by a parameter theta. This is called a parameter. This is called a parameter, which could be of course, scalar or vector. For example, if we are writing say binomial n p distribution, then here the parameter could be p or it could be n p. If the number of total number of trials is fixed here, then n could be known and then the parameter is p. Otherwise, it is n and p. Suppose, I consider exponential distribution with parameter lambda, then your parameter is suppose, I say normal mu sigma square, then my parameter is mu and sigma square. That means, by parameter we refer to the characteristics of the population in whatever way they may be defined. For example, in normal distribution, mu defines mean and sigma square denotes variance, whereas in a binomial distribution, n denotes the total number of trials and p denotes the probability of success in each trial. In an exponential distribution, lambda is actually the reciprocal of the average. 1 by lambda will be the mean. The problem of parametric inference is to make certain statements about the unknown parameter of the population. For example, I want to estimate the average height of adult males. So, this brings us to the one area of statistical inference and that is called estimation. So, broadly speaking let us write the problem of statistical inference, I divide into various parts. So, one is estimation, where I want to know the value of the parameter through some method. The other problem is that of making a doing a testing of hypothesis. For example, I would like to check whether the heights of adult males in a city is different from the height of adult males in another city or is it less or is it more. If you want to do a testing, then the statement has to be in the form of a assertion and then an alternative to that will be given whether it is not true or so. So, broadly speaking, we subdivide the problem of statistical inference into estimation and testing of hypothesis. Later on, I will also discuss the case when there is no theta here that means, the distributional model itself may not be known and that is called the problem of nonparametric methods and or nonparametric inference. We will discuss that in another module of this particular course. Right now, we are assuming that our population is partially specified in the sense that capital F is specified, but the parameters may not be specified. So, if the parameters are not specified and we want to make certain inference about that, then we are dealing with the parametric methods as I mentioned here. Now, let us consider the problem of estimation. In the problem of estimation, we will give a on the basis of a random sample. So, because we do not have the full population with us. So, to make the statistical inference, a sample from that population is taken and then the sample is used to draw any particular inference. It could be in the form of estimator. Now, an estimator can be of two type. It can be a point estimator or it can be an interval. For example, we may say the average height of adult males is say 6 feet. Then we are giving a single value for the average height. Then this is called a point estimator. On the other hand, we may say that 95 percent of the times the average height would lie between 5 feet 11 inches and 6 feet 1 inch. Then we are giving an interval of the values and at the same time, we are associating a probabilistic statement along with that. That means, roughly how many times the statement 2 is likely to be true. This is known as an confidence interval or interval estimation. So, we subdivide the problem of estimation into the problem of point estimation and the problem of interval estimation. To begin with, I consider the problem of point estimation. So, let us consider let x 1, x 2, x n be a random sample from a population with the distribution f x theta. As I mentioned, theta could be scalar or vector and it lies in a space of values which is called parameter space. We will use T x 1, x 2, x n. Let us use a abbreviation x vector denotes x 1, x 2, x n to estimate a parametric function say g theta. Now, the question comes that what is the nature of g theta and what is the nature of T? Let us consider say binomial n p. We may be interested in estimating the probability of success. For example, if we consider responses in favor of an ordnance. So, we have taken a sample of size n and we record the responses which are in favor. So, for example, that number is capital X. So, now, the sample proportion is capital X by n which can be used as an estimator for the probability of favorable response. So, x is a collection of the responses which are in. So, total number of responses are in terms of yes you can say 1 and if it is no you can say 0 and if you add then you get capital X there. So, it is a function of the observations and we are using it to estimate the proportion here. In a similar way, one may consider suppose I take the problem of normal mu sigma square and suppose I am considering the heights of adult males. I assume that they follow a normal mu sigma square distribution. Now, the problem is to estimate mu. So, I take a sample of adult males and I record their heights. So, let me call them x 1, x 2, x n. I can use the sample mean x bar for estimating mu. Now, the question arises that what is the methodology by which I can assign the estimators, what are the criteria for that? So, let us look at this problem. So, for example, to estimate the mean mu of a normal mu sigma square. Mu sigma square population we may use let me call it T 1 x as x bar. We may also use say T 2 x which is nothing, but the x median that is the median of observations. If one looks at the logic that mu also denotes the median of a normal distribution in that case one may consider the sample median or we may use say T 3 we may call it to be x mode that means, the mode of x 1, x 2, x n. If I give the interpretation to mu that it is the point which is the mode of the density function. Now, the question comes that in a given problem out of T 1, T 2 and T 3 these may take different values and therefore, which one should be used? I can give another analogy here. Suppose, I consider sigma square. Now, sigma square is the variance here. Now, for variance one may consider say sigma x i minus x bar whole square the way we define variance one by n this. So, this could be my say u 1 x, but one may consider some other options also in place of this somebody may consider the mean deviation about say mean one may consider say mean deviation about median. Once again if I have twice of estimators there are various estimators which are available then what should be our method to analyze them? There is another problem these are written in some sort of heuristic way. For example, if mu is the mean of the distribution I consider sample mean if I give the interpretation mu is the median I take the sample median. If I take the interpretation mu is the mode then I take the sample mode, but there may be some other parameters or parametric functions for which this direct interpretation may not be available. And in that case what should be our method of getting the estimators? So, there are two problems. One problem is method for obtaining estimators and another is the criteria for going to the good estimators. So, we will take up both of these topics. Let me firstly consider some criteria for good estimators and then when we obtain by some methodology then we will check whether those criteria are satisfied or not. Now, one may argue in this way that if I consider an estimator now this is based on the sample. Now if I take the sample at another point of time or at another point of time or some other person also takes a sample and gets an estimate then on the average it should be same as the original value of the parameter. Now, this can be modeled in statistical terms as the criteria of unbiasedness. So, an estimator T x is said to be unbiased for estimating g theta if expectation of T x is equal to g theta for all theta. That means on the average this T that is expected value of T is same as the original value. If T is not unbiased then it is said to be biased. For example, if expectation of T x is equal to g theta plus b theta then b theta is said to be the bias of T in estimating g theta. Let us consider some example here. Let us consider say x 1, x 2, x n, a random sample a geometric distribution. So, by geometric distribution we have introduced in the first module that we will consider the probability mass function as f x p is equal to 1 minus p to the power x minus 1 p for x is equal to 1, 2 and so on. If this was a distribution the mean was 1 by p. Suppose we want to estimate the mean that is let me call it mu 1 prime that is equal to 1 by p. Then I can consider say T x is equal to x bar then expectation of T x will be equal to expectation of x bar that is equal to 1 by n expectation of sigma x i that is equal to 1 by n sigma expectation of x i. Now each x i will have mean 1 by p. So, this is becoming n by n 1 by p that is equal to 1 by p. Therefore, this is unbiased estimator for estimating the mean of the geometric distribution. Let us take some more example. Let x 1, x 2, x n follow a gamma distribution with parameter say r and lambda. That means I am considering the density function to be lambda to the power r by gamma r e to the power minus lambda x x to the power r minus 1 for x greater than 0. Now suppose I want to estimate lambda here let me call it g 1 lambda. Suppose this is lambda suppose I also want to estimate g 2 lambda actually the mean of this distribution will be r by lambda. So, suppose I want to estimate r by lambda also ok. We may consider here say r is known. If we consider that and let us consider say x bar. So, x bar is expectation of x bar that is equal to again if I apply the same argument it will be equal to 1 by n sigma expectation of x i and each x i has mean r by lambda. So, this will become r by lambda that is equal to g 2 lambda. So, x bar is unbiased for g 2 lambda. Now to consider the estimation of lambda let us consider in a slightly different way. Let me consider for example y is equal to sigma x i. Then the distribution of y that will be gamma in the previous module I have mentioned the gamma distribution if the scale parameter is kept fixed then it is additive. Therefore, this will become gamma n r lambda. That means, if I write down the distribution of y that will be lambda to the power n r divided by gamma n r e to the power minus lambda y y to the power n r minus 1 for y positive. If I consider say expectation of 1 by y then it is equal to 1 by y f y d y 0 to infinity. Now that is equal to lambda to the power n r by gamma n r e to the power minus lambda y y to the power n r minus 2 d y. So, that is equal to lambda to the power n r by gamma n r gamma n r minus 1 divided by lambda to the power n r minus 1. So, that gives me 1 by n r minus 1 lambda. Since I am assuming r to be known here I can adjust this coefficient that means, I will get expectation of n r minus 1 by y is equal to lambda. So, this gives say t 2 that is equal to n r minus 1 by sigma x i is unbiased estimator of lambda. So, the problem of unbiased estimation can be solved by suitably choosing the functions. However, this method is more heuristic in nature a more general form could be to actually consider that means, you need to basically write down an equation of the type expectation of T x is equal to g theta for all theta and solve this equation. However, you have some general result which you have seen for example, I considered geometric distribution and as well as I considered the gamma distribution. If I want to estimate the mean of the distribution I can take the sample mean and it is unbiased. So, if the moment exists that means, the mean is defined then we can actually say that the sample mean is always unbiased for the population mean. So, we can consider if mu that is equal to expectation of x exists then the sample mean x bar is unbiased for. In fact, we may prove a little more general result also we may also consider suppose sigma square is equal to variance of x this exists then we define something called the sample variance. Then the sample variance I call it s square that is equal to 1 by n minus 1 sigma x i minus x bar square then this is unbiased for sigma square. These results are true irrespective of any distribution the only condition is the existence of mu and sigma square here respectively. So, these results are pretty useful and they are used to derive heuristic estimators. For example, in this case of gamma distribution I could easily derive the estimation of estimator of r by lambda. However, now to get an estimator of lambda I considered an improvisation because if I am considering a function of sigma x i for 1 by lambda then it will be reversed for that and that is why I considered 1 by y here or 1 by sigma x i here. And here some sort of distribution theory is used here that means, the sum of the gamma distributions. Similarly, in the case of geometric distribution suppose I want to estimate the probability p here probability of success in individual trial then I will have to consider 1 by sigma x i here. Of course, 1 by sigma x i will follow negative binomial distribution. So, I can use the property of that negative binomial distribution and I can construct in a similar way. So, although unbiased estimators are heuristically having a nice justification that on the average the estimated value will be equal to the true value of the parameter, but there may be sometimes the problems. So, you may have for example, sometimes unbiased estimators do not exist. To take a very simple example let us take say x following Poisson lambda distribution. Now consider 1 by lambda as g lambda. Let t be unbiased for g lambda then we should have expectation of t is equal to 1 by lambda for all lambda. Now, this will imply sigma t x e to the power minus lambda lambda to the power x by x factorial that is equal to 1 by lambda for all lambda positive. We can further rewrite this condition as sigma t x lambda to the power x by x factorial is equal to e to the power lambda by lambda and this right hand side I expand it is 1 by lambda 1 plus lambda by 1 factorial plus lambda square by 2 factorial and so on. So, what we are saying is this is for all lambda greater than 0. If you look at this term carefully the left hand side is a power series in lambda the right hand side is a power series in lambda, but there is also a term 1 by lambda here. Now, so basically you have a Laurent series also coming here. So, the 2 series cannot agree on an interval because if they have to agree on an interval all the coefficients must match. So, this is impossible that means, the unbiased estimator of 1 by lambda does not exist. A second thing is that sometimes unbiased estimators unbiased estimators may be unreasonable. For example, let us consider say in the same in the above example take say g 1 lambda is equal to say e to the power minus say 2 lambda. Now, let us look at this condition here sigma T x lambda to the power x by x factorial is equal to e to the power lambda into e to the power minus 2 lambda that is equal to e to the power minus lambda. So, let us write down the terms here. So, sigma T x lambda. So, this will give me T 0 plus T 1 lambda by 1 factorial plus T 2 lambda square by 2 factorial and so on that is equal to 1 minus lambda by 1 factorial plus lambda square by 2 factorial and so on. Now, let us look at here this gives me T 0 is equal to 1 T 1 is equal to minus 1 T 2 is equal to plus 1 T 3 is equal to again minus 1 and so on. Now, look at the problem this g 1 lambda function it is e to the power minus 2 lambda since lambda is positive we always have we always have 0 less than e to the power minus 2 lambda less than 1. This is an unbiased estimator now, but it is taking values always plus 1 or minus 1. So, this is unreasonable estimator that means, to estimate a parametric function which is lying between 0 and 1 I use the values plus 1 and minus 1 because depending upon what is the observation x you will either use it as 1 or you will use as minus 1. So, this is not a good criteria or good estimator here. So, now one thing I would like to mention right here see we have introduced the criteria of unbiasedness. So, it is based on the reasonable assumption that if we consider the sampling a large number of times then on the average the estimated value should be equal to the true parameter value. So, this is only one criteria we have also seen that sometimes we may not have the unbiased estimator or sometimes even if it exists it may not be reasonable. Therefore, we have some other criteria also let us look at 1 or 2 some such things another criteria is that of consistency. Now, we are considering T let me write the estimator as T n because it is dependent upon n observations. So, this is said to be consistent estimator of g theta if T n converges to g theta in probability. Now, if you remember in the previous lecture we have introduced the concept of convergence in probability that is for each epsilon greater than 0 probability of modulus T n minus g theta greater than epsilon this goes to 0 as n tends to infinity. So, if we give the physical interpretation of this condition it means that as the sample size increases that means, if the estimator is based on a larger sample then the probability of this being closer to the true value increases because the probability of this being away from the true value is decreasing to 0. So, consistency is again a large sample property that means, if we are taking more and more observations then we are approaching towards the true value at least in the sense of probability. Now, if we remember the strong law of large numbers and the weak law of large numbers in the weak law of large numbers we assume that if the mean of the observations is mu then the sample mean is consistent for the population mean. So, in fact, we also saw the strong law of large numbers. So, we call it then weak consistency and the if T n converges to g theta almost surely then it is called strong consistency this is we can say weakly consistent just to keep in analogy with the weak law and the strong law of large numbers. So, x bar is consistent for the weak law mu that is the mean of the population. Of course, you have to assume that mu should exist sometimes we may not be able to apply this weak law of large numbers or strong law of large numbers then one may try directly. Let me take an example here which is different from the consistency of the sample mean. Let us consider say x 1, x 2, x n following uniform distribution on the interval 0 to theta. Now, if we are considering the uniform distribution on the interval 0 to theta all the observations are lying between 0 to theta and estimator for theta can be say x n that is the maximum of x 1, x 2, x n. Let us consider the distribution of x n the consistency of x n for theta. So, let us consider probability of modulus x n minus theta greater than epsilon since all the observations are between 0 to theta, theta is bigger than x n. So, this probability same as theta minus x n greater than epsilon. Which is equal to probability of x n less than theta minus epsilon. Now, this is equivalent to probability of each of x 1 less than theta minus epsilon and so on x n less than theta minus epsilon because if the maximum of the observations is less than theta minus epsilon then each observation will be less than theta minus epsilon. So, because of the independence it simply becomes the f x at theta minus epsilon to the power n that is equal to theta minus epsilon by theta to the power n naturally this goes to 0 as n tends to infinity. So, that means this x n is consistent for theta. So, this is equivalent to probability of x n so in this case we have not used the law of large numbers rather we have gone for a direct verification of the result. Now, we may also have a situation where there are two estimators say both may be unbiased, both may be consistent or one may be biased and one may be unbiased and then how to compare. So, we introduce the mean squared error criteria. So, the mean squared error that is we call it m s e m s e of t for g theta is defined to be say m s e t that is equal to expectation of t minus g theta whole square and in the terms of this we can give the comparison because the smaller the mean squared error the better the estimator can be. So, we say estimator t 1 to be better in the sense of mean squared error than t 2 if mean squared error of t 1 is less than or equal to the mean squared error of t 2 for all theta with a strict inequality at least for all theta. For some theta prime belonging to theta. Now, in case t is unbiased for g theta then the mean squared error of t is becoming expectation of t minus expectation t whole square that is nothing but the variance of t. So, an estimator t is said to be less than or equal to uniformly minimum variance unbiased estimator for g theta if t has the smallest variance among all unbiased estimators of g theta over the full parameter space. Now, the question is that how to determine the minimum variance unbiased estimators or how to obtain the estimators which have less mean squared error for that there are certain other techniques. For example, to obtain uniformly minimum variance unbiased estimator we have method of lower bounds for the variance then second method is the method of sufficiency and completeness. We will not be discussing in detail these methods in this particular course as they have been discussed in another course on statistical inference which is also available on NPTEL. Here I will briefly mention about the concept of sufficiency and completeness regarding the lower bound I will mention one lower bound here. So, but that is entailing various conditions on the density function. So, that becomes a bit theoretical since this is a course on statistical methodology I will just mention the method here. So, in the method of the lower bounds for the variance what we do if I say variance if I say that t is an unbiased estimator for g theta then variance of t should be always greater than or equal to certain number. Now, if that is so then if I obtain if I am able to obtain an estimator t which is having variance equal to that bound then certainly it will be minimum variance unbiased estimator. So, there are methods like we have fresh raw, Kramer lower bound then we have Bhattacharya bounds. We also have Chapman, Robbins, Kiefer bounds. For detailed discussion about these bounds you may look at the lectures on statistical inference. Let me briefly mention about the concept of sufficiency and completeness here. So, sufficient statistic. So, we have the same setup that we have a random sample x 1, x 2, x n from a population with distribution f x theta. So, a statistic T x is said to be sufficient if the conditional distribution of x 1, x 2, x n given t is equal to say small t is independent of theta almost everywhere. Now, to give a simple example suppose I consider say x 1, x 2, x n follow Poisson lambda distribution I consider T is equal to sigma x i i is equal to 1 to n. Then if I consider the conditional distribution of x 1, x 2, x n given t is equal to t then certainly this is equal to 0 if t is not equal to sigma x i. If we consider T is equal to sigma x i then this probability is equal to probability of x 1 is equal to small x 1 and so on, x n is equal to small x n, t is equal to t divided by probability of t is equal to t. This can be then further simplified as probability of x 1 is equal to x 1 and so on, x n minus 1 is equal to small x n minus 1. Now, this t is equal to x 1 plus x 2 plus x n minus 1 plus x n. So, if the value of t is fixed then the value of x n is also fixed. So, we can write this as x n is equal to t minus sigma x i i is equal to 1 to n minus 1 divided by probability of t is equal to t that is equal to now we can use the independence here. So, probability of x 1 is equal to small x 1 and so on, probability of x n minus 1 is equal to small x n minus 1. Probability of x n is equal to t minus sigma x i i is equal to 1 to n minus 1 divided by probability of t is equal to t. Now, we also have seen the additive property of the Poisson distribution. If x 1, x 2, x n are independent Poisson lambda then sigma x i will follow Poisson n lambda distribution. So, we can make use of this fact here in the calculation and this one then becomes e to the power minus lambda lambda to the power x 1 by x 1 factorial and so on e to the power minus lambda lambda to the power x n minus 1 by x n minus 1 factorial e to the power minus lambda lambda to the power t minus sigma x i i is equal to 1 to n minus 1 by t minus sigma x i i is equal to 1 to n minus 1 factorial. divided by now this capital T follows Poisson and lambda. So, this becomes e to the power minus n lambda n lambda to the power t by t factorial. So, easily you can see that lambda is simply cancelling out and we are getting here t factorial divided by x 1 factorial and so on x n minus 1 factorial into t minus sigma x i factorial i is equal to 1 to n minus 1. So, this is independent of lambda. So, t that is equal to sigma x i is sufficient statistic here. Now we have a very strong result for the if in a given decision problem or in a given estimation problem if the sufficient statistics exist, we can make use of this to create better estimators and ultimately it will lead to minimum variance and by estimators. So, we have the result called Raub-Lackwell theorem. Say if t is sufficient and say u x is unbiased for g theta, then expectation of u x given t let me call it say h t is also unbiased for g theta and variance of h t will be less than or equal to variance of u x for all theta. We will show that by combining this concept with another concept of completeness, we can actually get the minimum variance and by the estimator in estimation problems where the unbiased estimators exist. So, before that we also mention another result. This method of proving the sufficiency is somewhat complicated. In this particular case I was able to already guess an estimator or guess a function which I could prove that it is sufficient, but in general problems it may not be so easy and another thing is this involve the computation of the conditional distribution here which again may not be easy. For example, here we are dealing with discrete distribution and therefore, it is easy to derive the conditional distribution, but suppose we are dealing with a continuous distribution, then this interpretation will not be there and the calculation of the conditional density may involve lot of algebraic calculations. So, there is another result called Neyman-Fischer factorization theorem which involves writing down the joint distribution of the observable factorizations that is f x 1, x 2, x n as a product of two terms one is g theta say T x into another term called h x. If this factorization is available, then that means this term is involving the parameter and the statistic T and the second term is free from theta, then we say T x is sufficient. Now, this is necessary and sufficient condition of course, under certain conditions and therefore, this can be easily used for calculation of the or obtaining sufficient statistics in a given problem. In the following lecture I will also introduce the concept of completeness and how this can be used to derive the uniformly minimum variance and by estimators. I will also introduce the concept of method of moments and the maximum likelihood estimator for deriving the estimators here.