 Welcome to the course on Dealing with Materials Data. In the previous session we have started working with regression analysis and we are going to go further on it. First let us review what we did in the past, we considered the case of simple linear regression relationship between a response variable y and an independent variable x. We denoted it as yi is equal to alpha plus beta xi plus epsilon in which epsilon is a random error and this random error has been assumed to have a normal distribution with mean value of 0 and a variance of sigma square. It is a very natural assumption to make with respect to random errors. There is a slight change in notation from the previous class, please note. We previous session we said that yi is equal to beta 0 plus beta 1 xi, I am for simplicity changing it to alpha plus beta xi. So, in this case as you can see alpha, beta and sigma are unknown parameters that need to be estimated and we had derived those estimated estimates and we had used some notations which was freshly introduced. In the present session, we would like to look into some useful inference on the parameters. First the regression coefficients, then the mean response value of the regression and prediction of the future y value. This we will do by test of hypothesis concerning above values and we will arrive also simultaneously at interval estimates of these parameters at some confidence level. First we start with inference. You see we assumed that yi and xi has a linear relationship. I think it is important to test whether the any relationship ever exists and if you want to test such an hypothesis, it amounts to testing hypothesis that the parameter beta is equal to 0 because if the beta is 0, yi is simply alpha plus a random error. So, it becomes a completely a random process without having any systematic change in it. So, we want to test the hypothesis as 0 as beta is equal to 0 versus alternate hypothesis of beta as not equal to 0. In the previous session, we saw that beta estimated by B is distributed as normal distribution with a mean value of beta and variance of sigma square by Sxx. It is a good idea to recall that Sxx is defined as summation xi minus x bar whole square. Hence, we can easily say that beta B minus beta which is its estimated value divided by square root of sigma square by Sxx which is its variance which can be simplified in this manner is distributed as a normal 0 1. Now sigma square is also unknown and that can be estimated as sums of squares of residuals divided by sigma square which is Sxx Syy minus Sxy whole square divided by Sss sigma square is distributed as a chi square distribution with n minus 2 degrees of freedom. Please recall last time we also introduced why it is n minus 2? Why it is n minus 2? Because we have n data points, we have estimated 2 parameters alpha and beta and therefore remaining degrees of freedom becomes n minus 2 and therefore it is n minus 2. Please recall just as Sxx we can define Syy as summation of yi minus y bar whole square and Sxy will also be defined as summation xi minus x bar times yi minus y bar. So, from this it is easy if you if we replace, if we replace this sigma square, this sigma square in this formula by the estimated value of sigma square as SSR or rather the sigma square by SSR divided by this particular value, it turns out that this will be distributed as t distribution with n minus 2 degrees of freedom because this sigma square estimate and the beta estimate of that is V, these 2 are in the SSR and D are 2 independently distributed SSR is distributed as a chi square, V is distributed as a normal and therefore the ratio is going to be distributed as a t distribution with n minus 2 with appropriate multiplication constants which are given here. And therefore, we can say that we can reject the null hypothesis of beta is equal to B is equal to beta by stating that absolute value of this statistic is greater than t at n minus 2 degrees of freedom with gamma level of significance. Please recall that you have to take gamma by 2 because when you take alpha by 2, your t distribution is also a symmetric distribution and this is your rejection area that is this is your critical area C and this area total you would like it to be gamma. So, this is gamma by 2 and this is gamma by 2, gamma is your level of significance and therefore it is t this value which is defined as t gamma by 2 n minus 2 is the value you are taking and therefore it works out that this is the rejection region or critical region. So, you reject as 0 if square root n by 2 by SSR multiplied by SXX multiplied by B minus beta absolute value is greater than the cutoff value of t at gamma by 2 n minus 2 degrees of freedom if it is large you are going to reject the hypothesis at gamma level of significance. This also gives a rise to the interval estimator of beta because this is the acceptance region of the hypothesis, this is the acceptance that is a critical region that is a rejection region of hypothesis, this is an acceptance region of hypothesis and this probability is 1 minus alpha and therefore 1 minus gamma sorry and therefore the 100 times 1 minus gamma percent confidence. So, if gamma is say 0.01 then 99 percent confidence interval for estimator of beta can be given in this manner. Let us move on in the same fashion with respect to constant alpha. Now, when we want to test whether alpha is equal to 0 or not what we are really testing is if there is a linear relationship then the y-insert intercept of the line is 0 or not it means that does the line pass through the origin or not and therefore here again we have a distributed as in the previous notation as beta 0 with sigma square summation xi square divided by n SXX in the new notation because we have replaced beta 0 by alpha it becomes A is distributed normal with mean value of alpha and variance of sigma square with this multiplicative. Again SSR divided by sigma square is a chi square n by 2 as we have taken earlier and therefore when you replace the sigma square by the appropriate quantity with SSR we find that SSR by n minus 2 multiplied by this is distributed as t minus 2 degrees of freedom t distribution with my n minus 2 degrees of freedom I correct myself and therefore once again we can set up the procedure to test the hypothesis that alpha is equal to 0 and it also gives you gives us a way to get an interval estimate of alpha at 100 times 1 minus gamma percent confidence in this manner. Now we come to the next interest see a number of times our interest lies in considering the output or the response when the input value is x 0 some given value x 0. This can be said that suppose you are conducting an experiment with some variation in temperature. So, temperature becomes your independent variable and you are checking out some response variable y. Sometimes you are interested in finding out what happens at a given particular temperature. Now this can be looked into it in two ways. The first way is what we are trying to look into here and this is called the mean response. This is called the sorry this is called the mean response. Now this mean response we call it because the natural estimate of alpha plus beta x 0 is a plus beta x 0 as alpha is estimated by a and beta is estimated by b. An expected value of a plus b x 0 is alpha plus beta x 0. So, it is a natural estimator. Suppose we write b which is estimated as s x y divided by s x x it can be written as summation of x i minus x bar multiplied by y i multiplied by c. c is a constant which is 1 over s x x and a can be written as y bar minus b x bar. In that case a plus b x 0 is nothing but an average value of y and this is what I want to show you. At the same temperature when you are conducting an experiment several times the average value of that experimental result is going to be summation y i and then it turns out that this summation is going to be the expected value that is the expected value of that y i is going to be a plus or rather alpha plus beta x 0. So, the estimator of that is going to be y i multiplied by the constant as it comes here looking at this and then the variance of this value will be exactly sigma square multiplied by this quantity this you have to simply calculate it from that 1 over n comes because of it is a y bar. So, therefore, the quantity 1 over n comes here please remember. Let us I think it is a time to recall that if x 1, x 2, x n are given and you know that variance of x i is sigma square then the variance of x bar is given as sigma square by n and therefore, here this 1 over n term comes. Since y are distributed as normal a plus b x 0 will be distributed also as a normal with its response value as alpha plus beta x 0 and this as a variance as we calculated. Again, this sigma square is an unknown quantity we will be replacing it with sums of squares of residual divided by n by 2. Why is it so? Please recall once again let us recall here that this actually means this implies that expected value of sums of squares of residual divided by n minus 2 is sigma square and therefore, this sigma square over here we would like to replace by its estimator here and that when you estimate and you replace it again you are having a one normal random variable divided by an independent chi square random variable and therefore, this whole random variable follows t distribution with n minus 2 degrees of freedom. I guess n minus 2 degrees of freedom is now well understood and this distribution also gives us a way to test the hypothesis whether alpha plus beta x 0 is equal to 0 versus alpha plus beta x 0 is not 0 and to test the hypothesis now we have a test statistic and the same test statistics will give us a way to have the interval estimator at 100 times 1 minus gamma percentage confidence of alpha plus beta x 0 that is a mean response value at x 0 as this. Now we move to the prediction of future y. See here there is some confusion can exist so I would like to say spend some time in clarifying this. Suppose we are interested in predicting y for a given x 0, it means that a good example would be a weather prediction. Suppose I want to predict weather tomorrow morning then in that case I have to know the weather conditions which are independent variable x 0 for tomorrow morning and I should also know that in general when such conditions prevailed in past what is the average response. So, in the previous session what we have estimated is an average response mean response when the value of the weather condition is like x 0 and now what we are trying to do is we are trying to predict what is exactly going to happen tomorrow. So, it is going to depend on both. So, this is the difference between the mean response which is alpha plus beta x 0 and the exact response which is y x 0. Here I have given a example once again of the temperature where y be a response and x 0 be a temperature when the experiment is carried out then several experiments are carried out at x 0 we would like to estimate mean value of alpha plus beta x 0. However, when one experiment is carried out so like one weather forecast is to be carried out in that case we have to work out with the mean response for not the mean response, but the one response of y i and the case we are talking about is this case. So, let us start y is distributed as normal with alpha plus beta x 0 and sigma square as a variance a plus beta x 0 is distributed again as normal with a mean value of alpha plus beta x 0 and variance of sigma square into the bracket 1 over n plus x minus x 0 x bar minus x 0 whole square divided by s x x as we did in the previous two slides. Therefore, if you take the difference of the two the difference is distributed since the mean values are same the difference of a normal random variable is also normal and the difference is going to be the mean difference of the two means which is 0 and the difference of the variable has the variance as summation of two random variables. Please recall, please recall that if x and y are two independent random variables then variance of x minus y is variance of x plus variance of y. So, the same principle is being used here and you get this quantity it is this plus one of this and therefore, you find that if you divide this quantity with its variance it is distributed as standard normal distribution variate which is normal with mean 0 and variance 1. Again, sigma square is not known it has to be replaced by sums of squares of residuals divided by its degree of freedom n minus 2 and therefore, this quantity we find is distributed as t distribution with n minus 2 degrees of freedom because the quantity this and the quantity sums of squares of residuals are two independent random variables. And therefore, again we have a way to test the hypothesis that y is equal to alpha plus beta x 0 versus y is not equal to alpha plus beta x 0. These are the two hypothesis we can test using this t distribution with n minus 2 degrees of freedom and the same relationship or the distributional facility we can use to predict the interval response for y at input level of x 0 and with the 1 minus gamma times 100 percent confidence and that can be given by this interval. So, in summary let us go through it. We have done the inference about four quantities. These quantities are these quantities are beta and alpha and then we also worked with the mean response which is alpha plus beta x 0 mean response at a independent variable value x as equal to x 0 and a prediction of y at x 0. These quantities can be written in a distributional way as beta will be distributed as b minus beta that is the estimated value of b, let us write it down. We find that beta is estimated by b and that b has a distribution of b minus beta multiplied by this quantity as t distribution with n minus 2 degrees of freedom. Alpha is estimated by a and this a minus alpha multiplied by this quantity also has a t distribution with n minus 2 degrees of freedom. The mean response of alpha plus beta x 0 will be estimated by a plus b x 0 that is here and that also along with this divider has the same t distribution with n minus 2 degrees of freedom and if you want to predict y x 0 it will also be predicted as a minus a plus b x 0. Please remember this there is a some difference between the two and this difference will be estimated by this but the variance will be different. Note that the two variances are different because we are estimating different quantity. However, once the correct divider is used the distribution is again t with n minus 2 degrees of freedom. They give us two things number one way to test the hypothesis and it also gives you the way to have confidence interval estimator of each of these quantities. So, with this summary next we will move on to the next session of regression analysis.