 the course on dealing with materials data. We are going to continue our session from the previous two sessions on regression analysis. Let us quickly review what we have done in the past. We introduced a simple linear regression model. We did the parameter estimations for coefficient of regression and also for error variance. Later on we continued and we did some testing of hypothesis for regression coefficient, mean response value and future response prediction. This process of hypothesis testing led us to have also the interval estimation of all the three of the above parameters discussed here. In the present case, we are going to now study in detail the analysis part of regression. There are three aspects to it. Number one, having done all this estimation and inference on the regression equation, the question comes how much of y, your dependent variable or your response variable is explained by the independent variable x. This is decided through a coefficient of determination. Then we would like to know because we already know that there exists something called a correlation between two variables x and y. So, we would like to find out what is the relationship between coefficient of determination and correlation coefficient. Next, suppose you get a data. How do you go about carrying out the regression analysis? So, we will give you steps how to go about doing it and having done it, the most important part is to know whether you are on a right path or not. That is, you have done the regression analysis is the correct approach or not and this can be done through what is known as residual analysis. So, let us begin. We start with coefficient of determination. You see when we write this expression yi is equal to alpha plus beta xi plus a random error epsilon which is distributed as normal with mean 0 and variance sigma square. In this equation, we are trying to express actually two kinds of variations that affect the value of y. One is the random variation which is given by epsilon and it is mainly explained by the variance of the random error which is sigma square. And then we are talking about this independent variable or independent values xi that also introduce a systematic variation in yi. So, the random variation we can say is estimated or explained by sum of squares of residual. We have seen that in the past. So, the sum of square of residuals can be expressed as summation of yi minus the estimated value of yi which is minus a minus b xi whole square. The total variation in y is given by we denoted by S yy and it is given by summation of yi minus y bar the mean of y whole square. This is called the total variation. Now, if this is the total variation and this is the variation which is introduced by sigma square that is the random variable, then the difference between the two inputs, the difference between the two sums of squares would show us or should give us the variation that is explained by input variable x. And therefore, that is given by S yy minus SSR that is the total variation minus the variation due to residuals. In that case, coefficient of determination is defined as R square which is equal to S yy minus sums of squares of residual divided by sums of S yy. In other words, we are trying to estimate the amount of variation cost in the total variation of y value by the variable systematic variation introduced by xi. So, further if you write S R square is equal to you can write it as 1 minus sums of squares of residuals divided by total variation in R total variation in yi and you can see that this term is always positive and it has to be less than 1 and therefore, the R square value lies between 0 and 1. Now, when R square is close to 0, what does it imply? It implies that very little has been explained by the independent variable x while when R square is close to 1 it means that most of the variation in y is explained by input variable x. Now, having understood this, let us try to establish a relationship between correlation coefficient and the capital R square. Correlation coefficient we write it as R and capital R square. So, correlation coefficient is written in our new terminology S xy that is sums of sums of the difference between x and x bar and y multiplication of x and x bar and y and y bar divided by the variances of x and y which is the S xx and S yy. So, this sums of squares of residual in those terms can be expressed as S xx, S yy minus S xy whole square. Please note that this is whole square. So, here we can see that where were we were here. This terminology is actually meant as S xy whole square. So, please do not understand that this is S xy square. So, the same terminology is here. So, then we can say that R square is S xy square divided by S xx, X yy. Simplifying it, it will be shown that small R square is same as capital R square. But please remember that the small R lies between minus 1 and 1 and therefore, we can only say that absolute value of R is a square root of R square. So, we can say that except for the sign the correlation coefficient and the coefficient of determination R equal. Now, as I said having known the most of the parameters related to regression analysis and one comes across the data say xy, yi and then you have to apply regression analysis. How do you go about doing it? So, here we are explaining the steps to it. First it is important to have we have to make I think we have moved forward. So, we have the steps to follow number 1. So, first step is plot a scatter plot of y versus x. If this shows some linear trend, you remember in the previous slides we showed how you can estimate the linear trend, exact trend versus approximate trend. So, if it shows some kind of linear trend then simple regression technique can be applied. Once the regression coefficients are estimated confirmation of assumption is very important. That is you have estimated the regression coefficient, you have conducted the hypothesis testing to make sure that none of the regression coefficient is 0. Then it is important that the whole process of regression analysis has been carried out under three very important assumptions. The first assumption is that residuals are random. Second thing is residuals are normally distributed and the third is that they have a common variance sigma square for all i is equal to 1, 2, etc, etc up to n. So, now we will describe because making a scatter plot we have already seen that in the previous slides. So, there is nothing to show, but we would like to go into the second approach which is called a residual analysis. First we define a standardized residual for that. So, you remember the expression here is a residual and residuals divided by its standard deviation would be called a standardized residual and the standard deviation is I think by now it is very clear. It is sums of squares of residual divided by its degrees of freedom and we have to divide by taking the square root because this is the estimator for sigma square. Now, assumptions we repeat assumptions made on the error or residual is that the residuals are random. Let us see if the pen works again. So, first thing is that it is random, second thing is that it is distributed normally and that it has a common variance. So, we know that the errors are estimated by residuals. So, we have to first make a standardized residual, this is standardized residual versus data plot. If normal then 95 percent of the residuals should lie between minus 2 and plus 2. I hope you recall that if the distribution is normal and it is a standard normal distribution with mean 0, then between minus sorry plus 2 sigma and minus 2 sigma the data line or the probability of this area is 95 percent. So, it means that when you take a standardized residuals they should lie the sigma is we understand as 1. So, then it becomes it should lie between minus 2 and plus 2. Second thing we would like to have is that the standardized residuals versus fitted value plots and if this appears as a random scatter plot, then the randomness is also confirmed. Let us take an example. Here I have an example of germanium silicon alloys in which the density of the alloy is tested against the percentage of silicon in the alloy. Now, I have not shown you the scatter plot, but you see that this data points the blue color are actually density data points. This is the on the x axis we have amount of silicon in the alloy and y axis has the density and then these points are the actual points of density as measured when silicon value is given on the x axis. It is pretty much in straight line and therefore it is perfectly fine to apply the linear model. Now that we have applied the linear model there is a line fit and now you look at the residual plots and in this residual plots again it is the we have plotted against the percentage of silicon against the standardized residual and please note that the data is completely data is completely between this is minus 0.06 to about plus 0.07. It means that it is normal because most of the data is between the minus 2 and plus 2 limit. It is actually in a very short in a very narrow span and you can see that the plot is quite scattered as you can see it is completely a scattered plot randomly scattered. So, our assumption of randomness is also justified. Let us take another example. This is an example where I want to show way what do you see if the error is not random. So, here we have a data on linear thermal expansion in the alloy and the thermal expansion along with the temperature is shown here. The plot we have made is without considering this data error values given. So, we have plotted like 77 against minus 0.388 etcetera, etcetera and the temperature is taken in Kelvin. Let us move on. You see the actual scattered plot of the data does not appear to be linear but well it is not very far from the linear either. So, if we actually plot the line it comes like this. This is a least squares line. So, this is a regression line we are plotting. But if you look at this residual plot against it is temperature versus standardized residual. Though the residuals are well within the limit please note. Let us put up the pen again. So, please note that this is 0.7 and here it is minus 0.6. So, the data is well within minus 2 and plus 2 regime. But it is not scattered in a random fashion. There is a pattern to it. There is a pattern to it and this pattern is reflected in here. This pattern is reflected in here. Therefore, this very clearly says that number 1 the randomness is not justified and therefore linear regression simple linear regression is not a correct model choice to express your thermal expansion in terms of temperature. I could not get any other example where the random error gets distributed which shows that it is the variance is not constant. So, I have artificially generated a graph. In this basically what we are trying to show is that in such situations not even that the error have gone beyond 2 plus 2 and minus 2 that is one aspect. But before that we can also see that the error seems to be growing with the x axis. This is x axis and this is residuals. Then you can see that the value variation in residual tends to increase. This is called heteroscedasticity. It means that sigma square which you have assumed to be same common for i is equal to 1 to n is incorrect. It does not apply here. Sigma square actually is changing. So, this is called heteroscedasticity. A plot can look something like this. So, now to summarize what we have learnt today. We first define the coefficient of determination R square. It is actually shows the variation due to rather variation explained by the independent variable x in y compared to the total variation that has been introduced in y. That is the total variation of y what percentage of variation is introduced by the variation in x. If R square is close to 0, it implies that very little variation is explained by input variable. If R square is close to 1, it implies that most of the variation is explained by the input variable x. The correlation coefficient and the coefficient of determination are related. Correlation coefficient square is the coefficient of determination or the absolute value of correlation coefficient is equal to square root of coefficient of determination. We have seen that standardized residual plots can be used to confirm the assumption of randomness, normality and I have forgotten to mention. Let us mention it here and common variance epsilon. Thank you.