 Welcome back to the course on dealing with materials data, presently we are going through sessions on estimation problem for the parametric distributions and let us quickly recall we are going we have studied the parametric estimation, there are two kinds of parametric estimation, one is a point estimation which for which we talked about maximum likelihood estimation and the method of moments estimation. And we also showed the weakness of the method of moments estimation and we showed that in one particular case of uniform distribution how the results of the method of moments estimate may not give you the correct answer, they tend to you know deviate from the basic understanding. So, one has to be careful while working with method of moments, we also talked about the interval estimation, then we talked something about how to evaluate point estimator, in this case we talked about mean squared error of estimator and what is known as a what is called an unbiased estimator. Today what we would like to do is learn about Bayesian estimation and before we do that we would like to revisit the conditional probabilities, we will revisit Bayes theorem, we will understand mutually exclusive and completely exhaustive sets, we will once again visit the total probability theorem, this is only to recall what we have already done in the previous sessions and then we will introduce the Bayes estimate and we will talk about the three basic elements of the Bayes estimation and what is Bayes estimation itself. So, let us start, conditional probability and Bayes theorem are related, first let us talk about conditional probability, a probability of occurrence of an event A given that an event B has already occurred is given by this formula, what does it really say, if you look at it very carefully, what it really says is that suppose you have a sample space S, in this sample space you have a set or an event A and you have an event B, this is an event B, what this says is that the probability of A given B is probability of this region, which is A intersection B, it is a probability of this region with reference to only the region, complete region B. What does this mean? What we are trying to do is generally when we find a probability of A, it is with reference to the sample space S, here what we are trying to say is you look at the probability of occurrence of A as if your sample space is only restricted to B. So, you have to look at the occurrence of event A relative to space, now when you do this, you get this is the formula for that, this formula can also be understood which is simplified in here, this formula can be easily understood because you are taking it relative to space B, so you have divided the probability with respect to probability of B. This further simplifies into the fact that if you just take the probability of B on the other side, it says that probability of interaction of A and B is probability of A given B, this is as we said in the past, this can be read as probability of A given B multiplied by probability of B and this as well be written as you know, this as well be written as probability of B given A will be probability of same A interaction B divided by probability of A because now the reference set is A and we are looking at of occurrence of B given that the A has occurred. So, instead of looking at relative to S, we are going to look into the probability of occurrence of B relative to A and therefore, only the portion which occurs along with A can be counted. So, that is how this is what becomes and therefore, we get this formula. Now, you agree that for any event A, probability of A interaction B plus probability B interaction B complement gives you the complete probability. So, what I am trying to say is that if you again look at the sample space S and then you look at set A and then there is a set B which is this, this I call B then what we are looking at is actually A interaction B as an A and rest of the part, let us use another color. So, rest of the part that is this portion is the interaction of A along with the outer area of B. So, it is actually an interaction with respect to this complete area which is outside B. I think this is well understood and therefore, it is this yellow area which has been shown here. This yellow area is what you get and therefore, this area plus this area together makes it a mutually exclusive sets and therefore, what you have is a probability of A as probability of A interaction B plus probability B interaction, A interaction B complement. Let us try to generalize it. So, we try to generalize it. My spelling is mistake it is there is no G here. So, let us try to generalize it. So, we say that there are event E 1, E 2, E n such that they are mutually exclusive. So, E i intersection E j is null set for all i not equal to j and union of i is equal to 1 to n E i gives you the sample space. Such an event are called mutually exclusive and collectively exhaustive. These events are called mutually exclusive and collectively exhaustive. So, if we have this, then looking at this which can be generalized as shown here that probability of A is equal to probability of A interaction B is probability A given B probability B plus probability of A given B complement probability of B complement. We can write here that probability of any event A in that case can be written as probability of A given E 1 probability E 1 plus probability of A given E 2 multiplied by probability of E 2 and so on and so forth probability of A given E n multiplied by probability of E. This is known as a total probability rule or total probability theorem. So, if we have a mutually exclusive and collectively exhaustive units sets or events then any event A can be divided into collection of or a summation of several events which are conditioned events and you have to multiply with a respective probability of the event on which you have conditioned it. So, now we come to Bayes theorem. It is very simple. The Bayes theorem puts it in a very simple way. From this formula, from this formula it simply says that probability of A given B is probability of B given A multiplied by probability of A divided by the probability of B. So, if you put this in the denominator you get this formula. Let us try to understand what does this formula say because this has a great importance and a great bearing on what the Bayesian inference that is very important in today's time. So, let us spend little bit of time. Once again let us try to understand through Venn diagram. This is a sample space S. There is a set A event A and this is a event B. If you are looking at the A interaction B, why am I looking at A interaction B because of this? See this whole Bayes theorem has come out because A interaction probability of A interaction B can be written in two types of conditional probabilities. So, here we are saying that we are looking at this event the A interaction B in two different ways. So, on one way we are saying that occurrence of A relative to B is what we call probability of A given B. On the other hand you have a probability occurrence of B relative to A which is what gives probability of B given A. And here you see these are the two things which are being connected here. Note that these are the two things are connected here that is if you are looking at occurrence of A relative to B or you are looking into occurrence of B relative to A, they are sort of proportional or they are weighted down by a factor. So, it is you can very easily say that probability of A given B is some kind of proportional to probability of B given A. So, if you know one you should be able to find the other is what is being indicated here. First let us try to understand this whole thing how it becomes useful in application by taking a very simple problem. Let us look at the problem of an ice cream shop. There is ice cream shop that sells only three flavors of ice cream for our convenience chocolate, vanilla and strawberry. And since it is near the college only boys and girls tend to visit the students tend to visit more often. We have not taken care of the faculty which may be going and eating ice cream there but we are talking about the students only here. And the shopkeeper has noticed that 55% of the girls prefer chocolate flavor, 20% of them prefer vanilla flavor and 25% of them prefer strawberry ice cream while among the boy this distribution is slightly different. The boys like chocolate flavor 50% of the time, vanilla flavor 30% of the time and 20% of the time they buy the strawberry flavor. We are also given the information here that both girls and boys student frequent ice cream shop equally. Now we are asked a question that find a probability that a strawberry flavor ice cream will be ordered by a random student visiting a shop. That is any student comes what is the probability that he or she would be ordering ice cream which is a strawberry flavor. Second question is that suppose a student comes has ordered a strawberry ice cream. What are the chances that that student is a girl student? So these let us try to solve this problem. So here we are we have simplified this here as to what we want. We said that the boy student and girl student frequent the shop equally. It means that probability of boys going to the shop or the girl going to the shop is equal and therefore it is half. As I said we are not considering the case that the faculty also might be visiting the shop. So we are restricting ourselves only through the students. Now C is for chocolate ice cream, V is for vanilla ice cream and S is for strawberry ice cream. So then it says that if a girl visits then the chance that she buys a chocolate ice cream is 55%. It means that probability of a chocolate ice cream given that a girl has visited is 0.55 and similarly the same thing will be 0.5 with respect to boy. So these are the conditional probabilities given to us. So then let us find out what the question is. The question says that if anybody wants to go and buy a strawberry ice cream what is the possibility probability that a ice cream bought will be strawberry. So first question is that what is probability of S? This is question number one. Second it says that suppose a student comes and buys the strawberry ice cream. So given that it has bought a strawberry ice cream. What is the chance that the student is a girl student? So this is what we are looking for. So this is what we are actually looking for. How shall we solve this? You must have done this sometime in your other courses or other high school level courses. So here we come to we say that let us use total probability rule. Where do we apply total probability rule? We say that probability of strawberry any person coming and buying a strawberry ice cream is probability that a strawberry item is bought by a girl multiplied by probability of girl plus probability of strawberry item bought by a boy given that probability of boy visiting the shop. So we say that suppose a girl visit a shop and buys a strawberry ice cream then this is the portion which defines that that probability that a girl visits the ice cream shop is PG, probability of a girl visiting ice cream shop and she goes and buys a strawberry ice cream. So these two are multiplied. Similarly these are multiplied and therefore the answer here I think this is 0.25, this is 0.5 plus 0.2 multiplied by 0.5. If you simplify it is 0.225. Now let us try to answer the second question. Student has come and bought a strawberry ice cream. What is the probability that given that strawberry ice cream has been bought the student is a girl student. Now you note that we do not have this probability that we have to look for this but we already have this probability. So if you use the base theorem it says that probability of a girl buying a ice cream strawberry given that a strawberry ice cream has been bought the person is a girl is probability that a girl has come a strawberry ice cream is bought given that a girl has visited multiplied by probability of the girl visiting the ice cream shop divided by probability of the strawberry ice cream itself. This is where we are applying Bayes rule or Bayes theorem and therefore here the answer is going to be this is given to us as 0.25 multiplied by 0.5 divided by 0.225 and this roughly becomes 0.56. So more than half the time you will find that that is a word and that makes sense because if you look at it the higher percentage of buying ice cream is by a girl but it is the difference is 0.25 and 0.2 while here it says that more than half the time it will be a girl. So this is how the Bayes theorem gets applied. Now let us come to estimation part of it. So far we talked about we recall what we had done in the past. Parametric estimation let us look at in a different way. In parametric estimation we have a data right. What is the data? We say that x1, x2, xn is iid that is independent and identically distributed sample from a function f theta ok, f theta where what we really mean is that small x1, small x2 etcetera small xn is the data observed ok. So this is a sample and this is the data we have observed. Then there is this parameter theta. This parameter theta we say that it is unknown ok. But we also assume in parametric estimation as we said this parameter theta actually is a value that the population takes. We do not know exactly how the population looks like. We are trying to understand the population through the small sample of data we have collected. So data observed is small x1, x2, x3, xn is our small data and from that we will have to find out what is the value of theta. And when we do the parametric estimation so far what we did so far, we assume that theta is some unknown and fixed value right. We say that oh well the population is unknown but it has some mean value for example. It has some mean value and it is a fixed mean value for the population. Now we want to ask ourselves a question is this a good assumption. Now we want to ask ourselves a question is this a good assumption. So that is our next. Because this is not the first time that you have collected the data to understand this theta. You have not collected this data for the first time. You had a background information on theta. In other words in the past sometime you had collected the data and you had estimated theta okay. So what when you had already done this exercise once why not use that information and improve upon your new estimation. Again I repeat I have some background information on theta. How do I get the background information? Very simple. I have already done this exercise in the past and from the past exercise I have some information as to what the generally the value theta can take. That is also an estimation and now the question is that having already done this exercise one and having got this background information why should I simply repeat the the process to get another estimate of theta. Why can't I use the background information that I already have. Let us take an example to understand it better. Suppose I have a data coming from a population of normal distribution with mean mu and variance sigma 0 square okay. I say mu is unknown sigma 0 square is known. This is only to make life simple for the example. In the past I have done the analysis and found an estimate of mu okay. I will denote it by mu cap okay. So in the past I have found in a mu cap estimate of mu. I have done it repeatedly and you also know that when you estimate x bar x bar itself x bar is a good estimator of mu but then x bar itself has its distribution. So similarly this mu I am not saying that you have found a maximum likelihood estimator or which estimator but suppose you have found some estimator okay. And let us assume that past information says that mu cap itself is distributed normally with some mu 0 and some variance sigma 1 square where mu 0 is known and sigma 1 square is known. So when I have done this exercise in the past I have observed that my mean the estimator of mean also follows a normal distribution with a certain fixed value mu 0 and a variance sigma 1 square. So then the question is how to use this information? How to use this background information estimate from fresh data? Please note what is being said here is that you have done this exercise in the past and you have found that your estimator itself follows a distribution with a known mean and a known standard deviation. In this case we assume it to be normal and now you have got a fresh data here. How to utilize this background information? This is the background information. How to utilize this information in order to get a better estimator? That certainly is good to use the background information if you already know something, if you already use it it always gives takes you closer to the reality. So this is what is the question and the answer lies in Bayes theorem. The Bayes theorem says, this is I have stated probability of A given B is probability of B given A multiplied by probability of A divided by probability of B. Let us translate it in terms of the parameter and the data. So I am saying that in terms of density function instead of probabilities I can write it as a density function. So let me make it simple here. Before we go next let us note it down that I can also write it as probability of estimate the unknown parameter theta given that I have got a fresh data is equal to probability that my data given the parameter theta multiplied by probability of theta divided by probability of only data without theta. This is what it translates into. In terms of density function I would say that probability density of theta given x1, x2, x3, xn is the probability of x1, x2, x3, xn given theta multiplied by the probability density function of theta divided by the probability density of only x1, x2, x3, xn without theta. Now you see these are my notations. So I am just removing this. I am just removing this so that you know this actually defines because everywhere I have written f. So to distinguish f as a probability density function from this variable to this variable I had written a subscript. Let us remove it. We understand that all these f's are different and they refer to their argument to take the form of it. So in simplified form it can be written. The same thing what I have written up can be written in this fashion. And this has the three parts into it which is what we want to talk about. There are three parts or three elements to this. We have used a base estimation. So what we have got is a prior distribution that is the background information which is shown here in blue. You see the prior information is in blue. The data is given in green because data already has some assumption on theta. So I have not shown this in green but this also is a part of data. And then there is a posterior distribution which is obtained through incorporating the prior with data and that distribution is given in red. Now what is this f? The pure density function of x1, x2, x3, xn without theta. Well it is integrating out the numerator with respect to d theta. You imagine now that the theta is also having a density function. It also has a distribution. It is a random variable. So I think this is a kind of confusing idea. So I want to clarify it. Our data is a random data. It has a unknown parameter that needs to be estimated. So far in parametric estimation we have been taking it only as a fixed unknown value. But the question comes that if for the same population in the past I have got data and I have already estimated this parameter theta once or maybe many times, then why do not I make use of the same information on theta to take a new and fresh estimate of my theta in the population because in any case theta remains unknown. So instead of imaging theta as a fixed value, here we are trying to imagine that theta is also a random variable. It also varies depending on the sample, different samples you take it takes different values. And suppose we know that this theta follows a certain distribution which has no unknowns in it. That distribution is completely known. In that case this integral is possible. So if you integrate out, this is like the likelihood function. If you look at this, this part is a likelihood function of x. This is a likelihood function. If you remember this is the function which we maximized as a function of x because x1, x2, x3, xn are unknown to us. Now we keep that likelihood function. We know the distribution of theta and then we integrate it out we get a pure density of f of x. These are the three important elements of base estimation. There is a prior distribution, there is a data and then there is a posterior distribution which is derived in this manner. And when this happens you have something called a base estimator. So the Bayesian estimator of theta is then the expected value of the posterior distribution of parameter theta. Remember this is the posterior distribution. This is the posterior distribution of theta given x1, x2, x3, xn data. You remember that I have given this as a smaller values because they are the actual data that we are calculating it. Here also they are smaller value. Here only they are working as a random variable. So with this what we get is we say that once you have a posterior distribution of parameter theta, the expected value of theta from the posterior distribution is called the Bayesian estimator of theta. Generally it is seen that it improves upon the estimate of theta from the previous estimator that we have already managed. So this is what is called Bayes estimator. I think in many of your R lessons you are going to work out certain Bayesian estimation and Bayesian hypothesis testing procedures using Bayes estimator in your R exercise. And most important thing we have to realize here is that since prior is derived from the past experience it does not contain any unknown parameters. So with this let us summarize. We revisited the conditional probability and Bayes theorem. Interpretation of Bayes theorem as a technique to get the probability of B given A from probability of A given B. We had a relook at Bayes theorem in terms of probability density function and then we introduced Bayes estimator as the expected value of the posterior density obtained through Bayes theorem. What is the posterior density? Well posterior density is obtained by using Bayes theorem on the prior information and the data. You get the posterior density and from that when you estimate the parameter theta it is called a Bayesian estimator of theta. Thank you.