 So, far we have been talking about data simulation in deterministic model deterministic static deterministic dynamics with respect to deterministic models what is the basis the model is perfect the model can be static or dynamic to do data simulation I need observations when we did the static deterministic model even though we recognize that observations in general are corrupted by noise to make things simple we assume as if the observations did not have any noise. But when we came to the dynamic data simulation deterministic perfect model assumption we assumed the observations are noisy and we knew the observational covariance so with this we have pretty much completed data simulation into static deterministic and dynamic deterministic models the next topic is a model can be stochastic the model can be static and dynamic model can also be stochastic. So, what does it mean I am going to now start with static stochastic model dynamic stochastic model stochasticity randomness randomness where does the randomness comes into being randomness essentially comes from one observation noise another one randomness comes from model may have a random forcing function what is the reason why we consider model forcing sometime models are approximations of reality the model captures pretty much good aspect of the physics. But still there are some left over processes that are not accounted for the left over unaccounted terms is called the model errors it is not if we know what the error we are committing we would have always taken that into account it is not acceptable for one to know that I have committed the error and not correct the error. So, when somebody says this is the model that essentially gives the complete understanding of the model at that time. But whatever be the model you may want to account for unaccounted terms that is called model error one way to simplify the incorporation of model errors is to assume the model errors are random. So, addition of randomized version of model errors makes the model stochastic considering the observation noise makes the observation also random or stochastic. So, we are going to move into a newer realm where the stochastic stochasticity in the observation as well as stochasticity in the model errors both could be part of our analysis. So, when we were going from deterministic stochastic the principles of data simulation has to depend on statistical probabilistic ideas I am assuming the readers are familiar with basic fundamental concepts from probability theory. So, under that assumption I am now going to build some of the basic tools from statistical estimation theory that one needs to be able to perform data simulation into static and dynamic stochastic models. So, that gives raise to the mathematical background that underlie statistical estimation theory. Please remember from the first lecture data simulation can be thought of as regression data simulation can be thought of as estimation. So, estimation within the deterministic context is what we finish talking about estimation within the stochastic context is what we are moving into. So, the first topic in module 6 is called principles of statistical estimation it is the preparatory work that we need to do to gain an understanding of the fundamental principles involved in statistical theory. So, this is the part of the mathematical requirement please go back we have already talked about finite and full vector space matrix theory multivariate calculus optimization theory matrix methods optimization algorithms now I am going to be talking about statistical estimation algorithm. So, you may see this course is heavy and mathematics why this is heavy and mathematics because that is what data simulation is all about. If you do not understand the mathematics we may not be able to get to the cucks of the algorithm that underlie the data simulation process itself it is very easy to be able to use the algorithm that somebody developed. But in addition to be able to use the algorithm that somebody develops if you want to be able to venture into the new world of being able to develop newer methods you need to understand the models the algorithms under the models the data and the process of bringing the model to data which is essentially an engineering process in my view and this engineering process involves lots of mathematical primaries and that is why our approach is quite mathematical. So, with that preamble I would like to be able to describe the fundamental principles of statistical estimation. So, I am going to pose the estimation problem let X be an unknown vector to be estimated X is called the state or the true state I want to know the temperature in the city of Bangalore this afternoon at 3 o'clock I want to be able to estimate that is the unknown the true temperature I not known I would like to be able to estimate this often times X is not directly observable the state or system may or may not be directly observable but a function of the state may be directly observable therefore Z is called the observation the observation is related to the true state by a function Z is equal to hfx we already know that we have utilized this termed again. So, Rn is the model space X is the model state Z is an observation vector we have the observation space which is Rm H is the map from the model space the observation space H essentially refers to the measurement system if H is linear Z is equal to hfx if H is non-linear you simply have Z is equal to hfx that is what we have these are very familiar territory for us because we have used this several times over the problem knowing Z I want to be able to best I want to get a best estimate X hat of X. So, we know Z is related to H but Z is related to X X is not directly observable a function of X is observable. So, knowing Z I would like to be able to estimate X where does a stochasticity stochasticity comes into play there is an additive observation noise we are going to assume this noise is mean 0 Gaussian the known covariance. I will also like to be able to generalize what we have been doing so far we have been thinking about X as a state deterministic but the state itself could be random therefore, we are having an observation Z is random in two ways because the unknown itself is random if the unknown is fixed that is called static model if the unknown X is fixed but it is random what do you mean by fixed but random it is a random variable it value can be different based on a particular distribution. We just do not know what is the distribution based on which the values of X is selected the distribution is the distribution that mother nature chooses. So, X the unknown is random the state is random the noise is random so Z is random. So, given a random observation I would like to be able to recover the X. So, X hat is the estimate of the unknown X because X and V are random I am going to simplify matters assume X and V are uncorrelated. So, X is a random variable that represents the natural variation for example, this year in some parts of the world the temperature is warmer than usual in the winter in some parts of the world it is better than normal and these variations are related to a phenomenon called L-linear and L-linear occurs with some rhythm over time. So, many of the weather variables around the world have a natural variation associated with them. So, that is what the distribution of X is all about. So, X could be the temperature in a specific region of the world that I want to estimate and X is a random variable is subjected to certain natural variability controlled by other events that happens around the world. These are observations in addition to the underlying natural variability there is also an observational error. So, given the observation Z I would like to be able to have a realization an estimate of the realization of X that is X hat. So, V is the noise, noise is normally distributed E of V is 0 the covariance of V is R. So, X is the unknown I have assumed X in general could be random. So, if you look into the statistical literature this stochastic estimation problem there are two competing schools of thoughts one is the Fisher school another is the Bayesian school within the Fisher school X is assumed to be a deterministic constant and it is Mu but unknown. Fisher developed a method called maximum likelihood estimation technique to estimate Mu and Fisher's technique lies Fisher's technique one can call it as a point estimation because Mu is a point in a vector space of dimension n because n is the dimension of the state vector X. So, I am interested in estimating an unknown value of the vector Mu and Mu is a deterministic constant. So, Fisher formulated this problem as a point estimation problem and he developed a method of what is called maximum likelihood estimate as opposed to Fisher's approach that is called the Bayesian approach within the context of Bayesian approach X is considered to be random X is said to have a prior distribution the prior distribution captures the natural variability of X. So, if X denotes the temperature distribution around the world the temperature distribution around the world is subjected to climatic conditions. The climatic conditions itself vary in some rhythmic fashion therefore, we can predict to some degree of accuracy the natural variability in X and this natural variability is captured as prior distribution. Now, given Z so, there is a particular that is a prior distribution relates our belief as to what X would be or X is Z is the actual state. So, for example in this current here we know we are under the grip of El Nino. So, we know under El Nino what kind of temperature variations could take place even though we have a prediction from the base of the prior which are climatic data we make an actual measurement Z. So, Z contains some new information X the prior contains some old information I would like to be able to combine them the prior and the new information to get what is called the posterior distribution. E is random in this case mu is expected value of X expectation is taken with respect to the prior distribution. So, you can think of X being a constant and X being random these are two complementary points of view as it exists in statistical estimation theory. So, given a function H and assumptions about X and V as we have done I would like to be able to now concoct a function phi which is R m to R n what is R m Z belongs to R m please remember that Z belongs to R m Z contains information what is R n R n X is in R n. So, I would like to be able to transfer information from Z to X Z is known X is not known I have to transfer information from Z to X this information transfer I am going to represent through a function phi that maps from R m to R n. So, phi of Z is equal to X hat. So, phi is a process by which I analyze the observation the output is X bar. So, you can think of it like this phi is the process into which you give the observation outcomes X hat an estimate of that node. So, if phi generates X hat which is the estimate of X based on Z phi is called an estimator. So, estimator is a map from the observation space into the model space. So, what are the examples given the reflectivity from the radar that is Z I would like to be able to find the amount of rain. So, the state of the system is rain but the measurement is reflectivity which is Z in this case Z is random. So, phi of Z is the function of random variable. So, X hat is random. So, the estimate X hat is a random variable the goal is to be able to obtain probabilistic characterization of the estimate what are the probabilistic characterization of the estimate. So, there are 2 things first we need to be able to concoct a way to design phi that will output X hat an estimate of the unknown X. Once it puts out an estimate we have to talk of an X hat is random we need to be able to talk about the probabilistic characteristics of X hat a complete probability characterization involves knowing the entire distribution of X bar sometimes is often difficult to get that in lieu of that sometimes we will be content with knowing what is the mean what is the covariance. So, the problem of characterizing the properties of X hat is the problem that is associated with statistical estimation. If phi is a linear function of Z then X is called a linear estimate otherwise it is non-linear. So, estimate can be either linear or non-linear. So, what is the summary of that I have a unknown X which could be random there is a natural variability there is a prior distribution I make observations. Observations are also corrupted by noise I know the distribution of the observation I want to be able to combine the distribution prior with the given distribution to be able to get the property the probabilistic characterization of X hat. So, I want to be able to design an estimator an estimator could be either a linear estimator or a non-linear estimator. So, in Fisher's approach X is fixed constant. So, X is equal to X is equal to H X plus V V is randomized V is Gaussian random. Therefore, the probability density function of Z given X is again a normal distribution the mean of Z is normal distribution which H X as the mean and R as the covariance. So, if X is deterministic the randomism Z comes precisely from the randomism V and H X is added to V. V is 0 mean and a covariance or if you add a deterministic quantity to a random quantity it simply shifts the mean without changing the covariance and that is essentially the analysis that we have given in this discussion. In this case there are 2 approaches to estimation one is called the maximum likelihood estimation and there is the least square estimation. In the Bayesian approach P F X as a is called the prior distribution it is the belief that we had about the known the natural variability I would like you to think of it as a information that we have on climate that is a prior information. Then act when you start taking the actual observation the actual observation as a conditional distribution. So, given X given a particular realization of X the observation has a distribution that is called the conditional distribution generally that is known prior is given. So, I can now compute the joint distribution P of X of Z P of X of Z by simple rule on conditional probability is conditional of Z with respect with given X times P of X or it can be written as conditional of X with given Z with respect to P Z by equating these 2 we can now see P X of Z is given by the product of P Z given X times P X by P Z P Z is the probability density of the observation it can be expressed as the integral of the joint density which is P X Z with respect to X if you integrate the joint density with respect to X you get P Z. So, this relation is has come to be called the Bayes rule. So, what is the Bayes rule say if you know the prior if you give me the prior if you also tell me the conditional distribution of the observation I can combine them to get this and this is what is called the posterior. So, what does it mean I am updating the prior prior as the belief before I came into the game when I started playing the game I got observation the observation gives me some new information the new information helps me to revise my old belief. So, the new belief is called the posterior. So, the old belief changes to a new belief by virtue of getting new information through observations when P X Z the posterior is computed within the Bayesian setup we could use this in a variety of ways we could compute the mean we could compute the covariance we can make lots of analysis with respect to different properties of X based on the posterior distribution. So, these are the two competing approaches to statistical estimation in one form the stochasticity arising purely from observation noise the unknown is fixed the other one the unknown is also random the noise also further corrupts the observation in this case I have a posterior I have a conditional distribution the process of combining a posterior with the conditional distribution is the one that gives you the prior and the conditional distribution when combined gives you the posterior. So, posterior is the new belief posterior is the revised belief posterior is the one that we should use in our in our in our decision process. So, now we talked about the need for creating estimates unknown estimates of the unknown be random or deterministic there are several properties of the estimate one has to be concerned with one is called unbiased knows another is called the relative efficiency of the estimate if we have to understand what is called an efficient estimate we have to understand what is called consistency in estimation or a consistency of the estimate we also need to worry about what is called sufficiency of the estimate. So, these are all these are all the norms against which estimates are evaluated as if we can induce as many of these properties into the estimate as possible those estimates are better estimates for example, I would like to be able to have an unbiased estimate I would like to be able to have a relative efficient estimate I would also like to have a consistent estimate. So, while data simulation is an in in principle is an estimation problem within the context of stochastic estimation we need to be aware of different properties the estimate will possess and the properties that the estimates possess depends on the design of the estimator the function phi. So, how do you define design a function phi the estimator such that the estimate is unbiased efficient consistent and so on. So, statistical analysis has been concerned with the development of this theory for well over century there is a very well established body of literature. So, if you want to be able to become an expert in the area of stochastic data simulation problem or randomized stochastic data simulation randomized stochastic data simulation problems you need to be cognizant of very many fundamental results from this contemporary or statistical literature. So, you can see how different areas of applied mathematics are going to be involved in trying to make this area of data simulation work. So, I am now going to define what is called unbiased when do I say the estimate is unbiased unbiased relates to the relative location of the mean of the sampling distribution. So, let us talk about that now. So, there are lots of little things in here sampling distribution. So, let me first talk about the notion of sampling distribution. Suppose I have a coin I do not know whether the coin I do not know the probability with which the coin falls head or tail. Let P be the probability with which the coin falls head. So, 1 minus P will be the probability falls tail I want to be able to estimate this P what do we do we conduct experiments. We conduct experiments in which we do n tosses I am going to n tosses then I am going to compute how many of these n tosses head turned up. So, n h is the number of tosses where head turned up n t where the number of tosses where the tail turned up and that is equal to n. So, what is the estimate of P? P is essentially given by n h by n this estimate becomes. So, let us assume we have picked n is equal to 1000. So, I first conduct an experiment I get the first estimate P1 which is given by the estimate P1 given by the first set of 1000 experiments. Let us conduct a second set of 1000 experiments again n remains the same n h the number of times it falls head in the first set of 1000 tosses and the second set of 1000 tosses need not be the same. So, let me call that as P2 of hat let us consider this as P L of hat. So, what is that we are trying to do? I am trying to fix n as 1000 we are conducting an experiment on estimating the probability that the coin the chosen coin falls head. I am doing L experiments I will let us assume L is 100 in the first set of 1000 I compute the number of heads I get P1. In the second set of 1000 experiments I count the number of heads I get P2 it turns out you can it is you can very easily see the number of even though the total number of tosses remains the same the number of heads in each chunk of 1000 tosses need not be the same, but they will be slightly different. So, P1 in general need not be equal to P2 in generally need not be equal to P L. So, if I now plot the value that P1 P2 P L takes they will take different points in a real line this is 0 this is 1 P is in between that. So they will take different values there will be 100 points. Now we can divide this interval where this lies into different bins we can then compute the number of times the P hat lies in here we can put a bar we can put a bar we can put a bar we can put a bar. So, the bar refers to the number of times the estimate has fallen into that bin. So, this is called the histogram the histogram essentially refers to the sampling distribution of the estimate P hat. So, please remember P is that constant unknown P hat is the estimate of P P hat is random because P hat depends on the number of tosses. So, P hat the estimate so, this is the estimator this is the estimator this estimator gives you an estimate the estimate is a random variable if I repeated this experiment L times I get L different values of this estimate because it is random they are distributed in a range I can then this range and compute the number of times the values of P falls in each of these I can compute what is called the histogram the histogram gives you the sampling distribution the histogram is an approximation for the sampling distribution. So, what is the sampling distribution the distribution of the estimate condition of the fact the unknown is x in this case unknown x is P x hat is P hat. So, even though the probability that the coin falls here P is fixed unknown it estimate varies estimate has a distribution that is what is called the sampling distribution it stands it stands to recent x expect that the expected value of this random of this estimate which is a random variable the conditional expectation of x hat given x is equal to x when x is a constant the conditional expectation of x x hat with respect to x. So, in the first case x is a constant in the second case x is random so nature picks x from the prior distribution. So, this is the expectation with respect to the prior the second one is a sampling distribution that that is related to the randomness arising from sampling. Therefore, we would expect a my estimator x hat to be such that the expectation with respect to the prior of the conditional expectation of x hat with respect to x must be equal to E of x hat must be equal to E of x what is E of x is the mean of the original random variable x with respect to the prior. So, that is so these are the two conditions for unbiasedness that they are very natural conditions for unbiasedness. If an estimate is not biased is not unbiased there is a bias the difference between the expected value of x and x that is called the bias or the difference between the expected x hat and E of x is called the bias. We also know bias arises in other ways for example, if you have an voltmeter if you have been using the voltmeter for a long time the properties change. So, if the actual voltage is 15 degrees it may show it may always underestimate this that could be an error of minus 2. So, that is called bias there the bias in the reading of the instruments that can be corrected by calibration you can calibrate a meter against the standard we can correct the bias but in here the bias arises because of the way I estimate bias is the property of the estimator. So, what is the desirable attribute of an estimator an estimator is a desirable estimator is one where the output of the estimator which is an estimate the estimate must be unbiased. Since we are considering two alternate cases where x could be deterministic or random in the case of deterministic x the conditional expectation of x hat given x must be x in the case of random this repeated expectation the expectation of the prior expectation with the prior of the conditional expectation must be equal to the expectation of x that is the mean of the prior. So, that is the condition we should always seek to force unbiasedness in the estimates again a coin tossing experiment if I want to coin I am going back to the coin tossing experiment even HRT PQ 1-P given the results of M independent tosses I assumed M is 1000 in my illustration E of z is I am I am going to E of z is P variance of z is PQ these results you may ask where does it come from it comes from the standard binomial distribution z takes the value 1 when it falls head z takes the value 0 when it falls tail so 1 with the probability P0 the probability Q. So, in our rotation z i is equal to P plus V i z i is equal to P plus V i z i is equal to P plus V i V i is equal to 1-P with the probability P it is equal to minus P with the probability Q therefore the expectation value of V i is 0 based on this calculation the variation of the variance of V i is PQ as it should be and the X value of z i is P the variance of z i is PQ that is what comes out of this. So, we calculate the properties of V and then we calculate the properties of z these are simple calculation that comes from fundamental analysis. Now, I am going to talk about estimation of the sample mean now that we have seen 2 different formulations of the estimation problem the fissures formulation and the Bayesian formulation and we have also seen the definition of what unbiasedness and what is the measure of bias is all about we are going to illustrate the concept of bias using a simple coin tossing experiment. So, example 13.2.1 13.2.1 is taken from our book Lakshmi Vrahan, Lewis and Dahl dynamic data simulation published in 2006 consider a coin tossing experiment the events are coin for falling head or tail the probability of head is P the probability of tail is Q1 minus P we are assuming P is a constant. So, we are following the fissures framework given the results of M independent tosses of a coin in the previous illustration I used dm is equal to 1000 we would like. So, the estimate we would like to be able to get an estimate of P the observations are z is equal to 1 when it falls head z is equal to 0 when it falls tail the probability of head is P probability of tail is Q that are for the expected value of z expected value of z is P the variance of z is PQ anybody who has done basic probability theory and statistics should be able to recognize that this is a simple example of a binomial Bernoulli random variable which is taking two values head or tail. So, we are going to rewrite this in our notation in our notation observations are z z is equal to P plus Vi P is the unknown Vi is the noise the unknown P is to be estimated the observations is the sum of the value of the unknown P plus the noise Vi in order to be able to make sure the z matches with the previous description we are going to concoct a noise Vi is equal to 1 minus P with probability P Vi is equal to minus P with the probability Q with this first we are going to compute the expectation and the variance of Vi the expectation of Vi is 0 the variance of Vi is PQ once I know the mean and the variance of Vi since Zi is the sum of a constant plus random variable adding P to Vi simply shifts the mean. So, the mean of Zi is P the variance of Zi is PQ so this is the fundamental result that comes from probability theory if you add a constant to a random variable the distribution of the sum is the same as the distribution of the original random variable except that the variance remains the same but the mean is shifted that is the basic idea in here. So, I am now going to talk about the estimation problem I am going to perform M experiments yam could be 1000 Zi's are the results of the tossing coin of the I have to toss please remember Zi takes value 0 or 1 so the sum of Zi I running from 1 to Yam is equal to the total number of times the head came the total number of head divided by Yam is an estimate of the unknown the estimate is characterized by P hat this P hat is a random variable P hat has an underlying distribution that distribution is called the sampling distribution it can now be verified E of P hat is equal to expectation of the sum is the sum of the expectations therefore E of P hat is equal to the average of the expectations of the I random variables Zi running from 1 to Yam the average of each Zi is P as it was shown in the previous slide. So, the average of the expected value of the estimate is equal to the true value that means this estimate P hat is unbiased unbiased variance of P hat variance of the sum is equal to sum of the variance is if the random variables are independent in this particular okay that is the result that comes from basic statistics and probability theory and here we are concerned with the sum of the results of independent tosses the tosses are independent therefore there is no correlation between two successive results of the two successive tosses therefore the variance of P hat is given by expected value average minus it is the average this is the random variable which is the average of the which is the estimate this is equal to P hat as you can see P hat minus P whole square expected value of that this from basic probability theory relating to the properties of expectation expectation of the sum is sum of the expectations. So, this reduces to 1 over M square times the sum of the variances of the individual term the variance of each individual term is P Q I am adding M times so M times P Q divided by M square that leads to P Q divided by M therefore the distribution of P hat this should be smaller P the distribution of P hat has a mean P P is the unknown to be estimated and the variance of P hat I think we should put within parenthesis the variance of P hat is equal to is equal to P Q by M in the limit it goes to 0 so P hat is called an unbiased estimate in so in this case P hat is an unbiased estimate as you can readily see as you can as you can readily see the estimate has no bias therefore the estimate P hat as given in here is unbiased so that is an important attribute of this particular estimate so you can think of Z as a set of all observations so I would like to go back to the structure of the estimator the so what is that we have the observations are Z 1 Z 2 Z M that is the vector that is given to us that is Z P hat is equal to a function phi the estimator of Z in this case the function phi is essentially the average of the components of Z I is equal to 1 to M. So this estimator which is given by the average is an unbiased estimator so that is the fundamental concept of unbiasedness and so unbiasedness is one of the properties of this particular estimator why unbiasedness we are often interested in mean square errors in the estimate X hat of X let us go back to X being the known X hat being the estimate of that so what is the if X is a constant if estimate the expected value of the square of the difference the expected value of the square of the difference is called the mean square error the error in the estimate. So now I can add and subtract E of X hat to this expression inside then I can combine this two and combine this two that becomes a plus b whole square that becomes a square plus b square plus 2 a b so you get the result of three terms again we have used the sum of the expectation the expectation of the sum is the sum of the expectations. Now I am assuming X is a constant to start with so X bar is a random variable if I took the expected value of X bar with the sampling distribution this also becomes a constant so this minus X is a constant. So the mean square error in X hat is given by E of X hat minus X whole square that is what we have been discussing. Now since E X hat minus X is a constant if you look at this particular term this particular term in this particular term the second factor is a constant I can take the second factor out as a common as a factor out if I took that out in here we are left with E of X hat minus E X hat if I distribute that E operator inside it becomes E X hat minus E X hat that is 0. Therefore this term with the coefficient 2 in this expression becomes 0. Therefore the mean square error now is simply the sum of these two terms this term as well as that term the first term is called the variance of X hat that comes from the fundamental definition of a variance variance is expected value of the random variable minus is expected value whole square. So the first term is the variance the second term as you can really see from the definition of bias if E X hat is minus is equal to X is unbiased though in this particular case what is that we have this is we have and when X is a constant E X hat minus X is a constant. So expected value of a constant is itself therefore we get that term equal to the second term equal to the square of the bias. So now you can see the impact of bias on the mean square error so bias is something bias square is always positive. So mean square error in the estimate is equal to the variance of the estimate plus the square of the bias since bias is always positive the minimum value of the mean square error happens when the bias is 0 and the minimum value of the bias possible is equal to the variance of the estimate. Therefore when the bias is 0 the minimum square error is equal to the variance of the estimate. Therefore minimizing the mean square error is equal to minimizing the mean square error when there is no bias is equal to minimizing the variance because mean square error becomes variance is the bias is 0. This is one of the reasons why we are always looking for unbiased estimate. Minimum variance estimation is one class of estimation that we will deal with and that relates to derivations in Kalman filters. Minimizing the mean square error that is another criteria that comes from gain estimation theory therefore mean square error criterion is one thing minimum variance error criterion another thing these two criteria coincide with the biases 0 and so these two problems become one and the same if the bias is 0 and that is one of the reasons why we are always motivated to find estimates with bias 0 or unbiased estimates. Now I am going to go to so far we have talked about the role of bias unbiasedness and some of the reasons for seeking unbiasedness and when the bias is 0 we also saw mean square error is equal to variance. Now I am going to go to the next attribute of the estimate called relative efficiency. Let xa hat and xb hat be two estimates of the unknown x we say xa hat is more efficient we say xa hat is more efficient than xb if the variance of the estimate xa is less than the variance of x hat b. So x hat a is one estimate x hat b is another estimate suppose somebody gives you two estimates of the same unknown how do we compare first we compute the variance of these estimates. Please remember these estimates are random variables a random variable has a distribution hence it has a variance. So each of the estimates being random has an associated variance the one estimate with the lesser variance is said to be more efficient than the other. The ratio the variance of xb to variance of xa is called the relative efficiency of the two estimates in a coin tossing experiment let us assume I have one estimate which is p hat I think it is smaller p hat p hat. So what is p hat? p hat if you remember is equal to 1 over m times summation zi i is equal to 1 tm. So this is going to be my first estimate x hat a my second estimate x hat b is the a itself that is one observation. So what is that I am going to do now I am picking two estimates one is the average of the observations arising from m tosses second one is one observation itself. So you can see the difference is in the sample size used in this estimator. It is a simple exercise to show that the variance of p hat is pq by m but the variance of zi is essentially pq. So for every m greater than 2 for every m greater than or equal to 2 this inequality variance of p hat is less than variance of zi any one observation that means the mean is more efficient than the single observation. I think this is a very fundamental result. So if you are trying to estimate more the barrier you have more observations you take the mean of a large number of observations if the number of observations becomes large there is a theorem called central limit theorem even though the average is a random variable as the number of samples becomes larger and larger the sampling distribution becomes a delta distribution centered around the unknown p and that is the very well known result and that result is bound by is at least the essence of that result is bound by this example. Therefore whenever there are different possible choices for designing estimators we are going to be looking for estimators that gives you estimate which are unbiased and more efficient. More efficient means the variance of the estimate is small if the variance of the estimate is small the confidence of the estimate is larger that is why relative efficiency matters that is why the efficiency matters. Now a question is there a so the question is this if one is more efficient than the other it behooves us to ask a question is there a most efficient estimate I want to think about this now. If there is a possibility of improving if there is a possibility of improving the variance of the estimate there is a fundamental interest in asking a question is there is there a most efficient estimator or is there a most efficient estimate the answer is yes one of the theoretical ways in which one can establish this most efficient estimate is by resorting to a technique called maximum likelihood estimate. This maximum likelihood estimation technique was essentially introduced by Fisher so Fisher assumed I am I have an unknown x which is a deterministic constant I have observations the observations are going to give you estimate and I would like to have an estimate which is unbiased which is relatively more efficient in fact I want to have an estimate which is most efficient that means there is nothing else which is more which is nothing else which is more efficient than the one that is given by maximum likelihood estimate. So that is the theory developed by Fisher well over 100 years well over several decades ago question 2 could it happen that the biased estimate may be more efficient than the unbiased estimate again the answer is yes look at this now bias is one attribute of the estimate efficiency is another attribute of the estimate these are two different attributes so we need to ask ourselves when we try to design estimates to estimate the unknowns in a data simulation problem we need to be aware of the following question that what is underlying what are the underlying properties of the estimate we so generate is it unbiased does there exist another estimate which is more efficient than this what does it take to be able to generate the most efficient estimate is the most efficient estimate is always a linear estimate is it a nonlinear estimate is that is there a possibility that a biased and a biased estimate will be more efficient than an unbiased estimate. So these are all the class of question that statisticians have worked around and develop a beautiful theory I am trying to provide a snapshot of some of the fundamental underpinnings of this theory because of its intrinsic interest in intrinsic relation between estimation theory and data simulation theory now we come to the next property which is called consistency. Consistency is again another fundamental attribute please recall that we have seen x I am sorry x hat is a random variable if x hat is a random variable there is going to be a probability distribution which is called the sampling distribution of x hat let us assume the sampling distribution comes like this so this is a sampling distribution we talked about the method of generating sampling distribution in the context of coin tossing in the previous slides so what is that we are looking for let this be the x that is a known x hat is an estimate the x the estimate is random we would like to ask out of the following question what is the problem so this is x I would like to consider an epsilon strip plus or minus epsilon this is x minus epsilon this point is x plus epsilon if you consider this strip if you integrate the probability density from x minus epsilon to x plus epsilon that is going to be a total probability mass under this curve so the probability that the absolute value of the difference between x hat and x that is the probability that I have pictured here and well I should say this is epsilon maybe I think this is a bad notation I will change my notation a little bit please forgive me this is not epsilon because I have let us assume this is x minus delta and x plus delta no I will go back to epsilon sorry sorry I will go back to epsilon that is right so x hat could be x could lie in between x plus epsilon and x minus epsilon that is correct my original statements are right that is correct this probability so the probability within the hatched area is 1 minus the probability outside the hatched area probability outside the hatched area outside the hatched area so I would like to ask myself the following question when will what is the probability that my estimate x hat will lie outside of an epsilon band around x that is the question that probability is given by 1 minus the probability that it will lie inside if this probability were to tend to 0 that will happen only when the probability of the hatched area is closer to 1 if the probability of the hatched area is closer to 1 is becoming closer to 1 means what the probability distribution becomes more and more peaked it was originally like this then becomes like this then it becomes like this then it becomes like this we are looking for a thin narrow region around the unknown within which the entire sampling distribution the probability mass resides so that outside of this thin strip the probability mass is 0 that is what exactly this the relation tells you as m tends to infinity what is m m is the number of samples I am as I increase the number of samples my estimate x hat as a random variable finds itself in an epsilon strip around x with probability 1 what does it mean the probability of my estimate lying outside the epsilon band that means x hat minus x is greater than epsilon it goes to 0 if the sampling distribution satisfies this property that is what is called consistent estimate that means as the number of samples increases my my estimate becomes closer and closer and closer to the truth and the probability of it being not equal to the truth goes to 0 continuously as the number of samples goes to infinity that means my estimator becomes more and more closer to the truth in the probabilistic language this has a very special connotation are a kind of special name this is called convergence in probability so what is it what does it mean the the the f the f the estimate x hat which is a function of the number of samples in the limit as m goes to infinity as m goes to infinity lies in a region which is very small that means it lies this the probability of this goes to 0 the limit the limit the probability is 0 if my estimate had a satisfactory this property it is called consistent so consistent estimators are very natural choices so this is called convergence in probability of x hat to x in the probabilistic language so consistency of an estimate is another fundamental attribute so we have seen 3 attributes biasness or unbiasedness relative efficiency most efficient estimate and then the third one is called consistency so what are we looking for we are looking for consistent unbiased most efficient estimate is what we are looking for that is the that is the ultimate goal from a statistical perspective sufficiency is is another criterion we are not going to go too much into the discussion of sufficiency is a little bit more technical conditions under which a chosen random sample has enough information to obtain the required estimate that is relate to the sufficiency in other words that is the chosen sample that is used to estimate has sufficient information to provide you good estimate under what condition such sufficiency can be guaranteed is a very technical condition I am not going to go into the details one of the most thorough discussion of all these attributes biasness unbiasedness relative efficiency maximum efficiency consistency sufficiency all these properties are discussed in great detail in one of the classic books on statistical analysis by Professor C. R. Rao published in 1973 it is a classic book linear statistical inference and applications and in my view anybody who wants to do data simulation especially in the statistical arena should have a copy of this book in that personal library it is it is a Bible with respect to most of the fundamental statistical principles and their applications now I am going to continue my example let mu be the unknown but constant we are simply concerned with the coin tossing experiment again mu is the unknown zi is equal to mu plus vi in this particular case we are assuming vi is is is is normal I think it should be normal vi has a normal distribution vi is are independent identically distributed what does it mean I am there is a box random number generator out of the box I can continuously keep asking and it will deliver a random number the sequence is v1 v2 v3 and so on these are independent these are independent in the same sense that if I am trying to toss a coin the the results of the tossing of a coin are also independent so that is what I did not what is IID first I refers to independent the samples are being independent the second I refers to the fact all these samples are drawn from the same distribution the distribution does not change from one drawing to another drawing so IID in independent identically distributed is one of the standard assumptions used in estimation theory to start with so this is very similar to the the coin tossing experiment but not quite the same because in the coin tossing experiment the events are 1 or 0 head or tail but in here I am assuming there is an unknown mu I can observe the unknown through vi vi is equal to mu plus vi vi is not a discrete no it does not take two values this vi takes vi has a continuous distribution Gaussian 0 mean and sigma square as the variance. So if I have a bunch of M observations what is the estimator estimator is in this case we call it z bar z bar is the average of all the z i so this is the formula for the estimator and the estimate is z bar by by taking the expectation of z bar using the fundamental principle the sum of the expectation is equal to expectation of the sum one can readily verify e of z bar is mu hence this estimate is unbiased so this is this is an unbiased estimate much like the average elites unbiased estimate in the case of coin tossing as well. In this case we can also compute the variance of the estimate please remember the estimate is an unvariable this is equal to the variance of the average the variance of the average is given by this formula by invoking to the standard definitions of variance from basic probability theory a little calculation where it is revealed this variance is given by sigma square by M please remember in the case of coin tossing experiment with PQ by M in this case sigma square by M very similar. So you can readily see the variance of the estimate z bar which is the average of all the z i is 1 over M times the variance of a single random variable which is sigma square so as M goes to infinity sigma square over M goes to 0 that means the variance of the estimate becomes closer and closer to 0 that means the sampling distribution becomes peaked if the sampling distribution is peaked that is called consistency so this estimate the average of all the observations is simultaneously unbiased is also consistent. So this is the reason why we say well if you want to estimate something infinite it is asymptotically it is going to converge to the exact value but you may not be able to have the resources need to do unbounded number of experiment so if you have a large sample using large sample if you take the average the average arising out of large sample is unbiased and also reasonably good efficiency it is also consistent depending on the number of depending the efficiency relates to the number of samples you have. So you can see why average is a good estimate we have also seen in our static inverse inverse problem average is the best least square estimate for example you remember you may remember the following experiment suppose I want to estimate by weight I make M measurements in M different scales or M measurements in the same scale at different parts of the day. So I have M measurements which are all look different I would like to be able to have a best estimate of my weight our least square theory tells you the average gives you the best least square estimate of your weight given that you have M observations M independent observations of your weight same thing in here the average of the observation gives you the estimate which is simultaneously unbiased and its sampling variance goes to 0 so it is it is it is it is asymptotically very efficient it is consistent is unbiased. Example 3.22 continued in the previous example exercise we assumed mu to be known a mu to be unknown so we estimated the mu now I am going to consider the other part of the story let us pretend mu is known I do not know sigma square so let us go back to the problem please in here Z i is equal to mu plus V i mu is the unknown constant V i is the is the noise with the 0 mean and sigma square is the variance. So I can formulate several different estimation problem knowing C assume I know sigma square estimate mu that is what we finished now what are we going to do we assume that I know mu but I want to be able to estimate sigma square that means I have I know an unknown I know the observation is unknown plus some additive noise the additive noise has an inherent variance I do not know what the variance of the additive noises my goal is to be able to estimate sigma square what is sigma square sigma square is the variance of the noise in the measurements I would like to be able to estimate sigma square under the assumption mu is known there is one more version of the problem mu is not known sigma square is not known. So you can see there are three kinds of problem sigma square is known mu is not known estimate mu mu is known sigma square is not known as estimate sigma square mu is not known sigma square is not known estimate both of them simultaneously this is a very classic example every student and statistics generally go through this the aim of this exercise to be is to is to is to acquaint ourselves with the fundamental principles of properties relating to estimators and estimates namely unbiasedness relative efficiency asymptotic efficiency consistency and so on okay. So let us concoct an estimator for sigma sigma square if I do not know the variance I am going to have an estimator which is sigma square hat so Zi's are the observations I know mu from basic probability theories the variance must be expected values of the square of that so what am I going to do I am going to take the average of the sum of this difference between Zi and mu that the so Zi-mu is the error is the sum of squared errors is the average of the sum of the squared errors you can see the least squared principle comes in here also in the underpinnings of least squares you can see here also but sigma square is a random variable because Zi is a random so expected value of the estimate sigma square hat is expectation of the sum is the sum of the expectations so by applying that simple rule it can be verified that the expectation of the estimate is equal to the true value therefore this estimate is unbiased I am not going to prove this it can also one can also compute the variance of this sigma square it can be shown that the variance is 2 times sigma to the power 4 by m as m goes to infinity this variance of the estimate goes to 0 that is consistent here there are lots of a homework problem in here I would very strongly encourage you to use simple principles of basic statistics and probability theory to be able to compute the variance of this so this is the random variable it is a mean it is a variance please compute the variance and verify I am I am I am hitting all the major major conclusions some of the derivations I am going to leave it as an assignment for you to be able to do I think it is a worthwhile assignment to be able to check whether you understand some of the fundamental principles involved in calculating these quantities especially sample moments and properties of analyzing the properties of sampling distributions. So in the previous case what is that we have seen if mu is known sigma square is not known I can estimate sigma square I have an estimator which are unbiased and consistent much like the estimate for mu now we are coming to the harder case I do not know mu I want to estimate mu I also want to estimate sigma square I also want to estimate sigma square. So when mu is known sigma square is not known the estimator for sigma square is called sigma hat square when mu is not known sigma square is not known I am going to call it sigma bar is the estimate of mu I am going to call S square as a as the estimator for sigma square. So this is the S square is the estimator for sigma square C bar is the estimator for mu that C bar is simply the average. So how do I estimate the variance this is the sample value if I had if mu had been known I would have used mu here but mu is not known I am going to use the sample mean in here I am going to compute the difference square some of the square difference average value. So this is the estimate of the variance when the mean is not known this is the estimate of the mean given a particular samples. So I do not have any truth I simply have to rely everywhere with estimates whatever I have. So from basic principles of the definition of variance it can be real it can be verified that E of z i square is equal to sigma square plus mu square because you already know z i is equal to mu plus mu plus vi I am sorry mu plus that is vi vi. So from here we get this formula we also can compute the expected value of z square the square of the average it can be verified that is given by this that is given by the formula 4 again it very simple calculation from basic probability and statistics if you take a good course in probability theory a good course in basic statistics where you will do all these computations in detail I am assuming many of you have taken courses of this type if not this is a motivation for you to be able to learn some of the fundamental principles of estimation theory I think you can use this as an excuse to learn something you probably have not had an occasion to learn. So given so I am building all the basic ingredients now z bar s square E of z i square E of z bar square all given in 4 now I am going to ask myself what is going to be the mean of the estimate of the variance and that is given by this formula again it will take about 5 to 10 minutes for somebody to derive this but I would like you to go over the detail use the expressions in 4 to do this if you simplify this it becomes this. Therefore the estimate of E square E of a square is equal to sigma square by m plus mu square look at this now the actual value of expected value must be sigma square. Therefore s square is a biased estimate s square is a biased estimate and the bias is given by minus sigma square by m the bias is given by minus sigma square by m I can also compute the variance of E of x square that is given by this the variance is given by 2 times a minus 1 by m square times sigma to the power 4. So you have you have I think that is an error here so this will get cancelled with this that is correct. So this must be sigma square sorry this must be sigma square that is correct that is right sorry for the error that is sigma square. Therefore if you consider E of s square minus sigma square that is equal to minus sigma square m and please remember that that is the bias. So now you can really see I have an estimate which makes sense but that estimate is a biased estimate. So far we have seen unbiased estimate for the first time we are seeing an estimate which is a very natural estimate but it turns out to be biased it is a it is a variance the variance is given by this expression. Now if you let m go to infinity the bias tends to 0 the bias tends to 0 as m goes to infinity the variance also goes to 0 as m goes to infinity. So what does it mean this estimate is asymptotically unbiased but finer sample is biased but this is asymptotically it becomes consistent. So consistency and and and and biasness unbiasedness with respect to finite samples infinite samples. So what happens asymptotically may not happen for a finite sample. So in statistics there are always two types of theories finite sample statistics and asymptotic analysis the asymptotic analysis are rather easy than finite statistics. We generally derive conclusions for the finite sample statistics by looking at the asymptotic analog of the finite sample statistics. To be able to judge the impact of not having infinite number of samples so that is very clearly borne out by this this this example. Here again here again I have I have two kinds of estimates now the the variance of x square sorry the variance of x square from the previous page is is given by this the variance let me let me go back the variance of sigma square is 2 sigma to the power 4 by m sigma to the power 4 m this is variance of hat I believe okay I am sorry this is hat. So this is one estimate of the variance this is another estimate of the variance this estimate of the variance assumes the mu is known this estimate assumes mu not known we already know when mu is not known this estimate is biased we know this estimate and the when mu is known the estimate is unbiased so the variance of the estimate is given by this the variance of the estimation by this this is something extraordinary. The unbiasedestimate has a larger variant than the biasedestimate. Wow that is nice interesting property. So, what does it mean the unbiased estimate s square sorry sigma square hat is the unbiased estimate this is the biased estimate. So, the biased estimate is more efficient than the unbiased estimate when it comes to the question of variance. So, here you have to see the choice of estimator what are the given conditions under which you design the estimator what are the knowns what are the unknowns all these things matter in the design of your estimator which spits out the value of the estimate. So, the properties of the estimate very much related to very much related to what is known what is not known and how the estimator is designed and we have to deal with the properties of the estimate from many of the different dimensions biasness efficiency relative efficiency consistency relative efficiency tells me which one is more efficient than the other. So, we simply cannot say unbiased estimates are the only thing that is of interest we have already seen if an estimate is unbiased the mean square error is equal to the variance. So, minimizing mean square error is equal to minimum variance that is an advantage of unbiased estimate, but if you are interested in the overall efficiency of the estimation you cannot rule out the possibility of introducing a small bias in the estimate to be able to get more efficiency. So, it all depends on what your ultimate goal is when you want to be able to estimate them though with this we come to the end of the first discussion on the design and other properties of a statistical estimation. I would like to ask the reader to verify all the relation the variance expressions I had given and I would like to very strongly encourage you to be able to derive these from the basic probability theory and statistical experience you may have had and the next slide provides you couple of very good references these are some of my favorite Melza and Cohen 1978 decision estimation theory is a small book published by McGraw Hill is an excellent book largely tailored to engineering audience especially electrical engineering audience within the context of communication theory estimation and so on. The book by Sajan Melza is another wonderful book estimation theory and its application to communication and control again it is tailored to electrical engineers and communication engineers. I coming from an engineering background I particularly like these two of course the book by C. R. Raul is the ultimate Bible when it comes to question of statistical principles and techniques with this we conclude our discussion of the properties of estimates. Thank you.