 Hello and welcome to the course on Dealing with Materials Data. In the past we have seen various sessions of regression, we are continuing with it. This is the fourth session on regression analysis. Let us have a quick review of what we have done in the past. We introduced simple linear regression model and its parameter estimation and its inference, inference particularly on the regression coefficients that are called alpha and beta there mean response and the predicted value. Then we also discussed the coefficient of determination which actually decides as to how much of the response variable is explained by the input variable x. And we showed the relationship between coefficient of determination and correlation coefficient. Coefficient of determination is always written as r square is equal to the correlation coefficient square or in other words the absolute value of correlation coefficient is equal to the square root of coefficient of determination. Then we talked about very briefly how to approach the problem of linear regression that is once you get a data, how you go about doing it and finally how do you know that what you have done is correct and that is through checking the assumptions that we have made while going through regression analysis on the errors and these assumptions are randomness of error, normality of error and common variance. So we have gone through this exercise before. In this session we will see that certain models which are not directly a linear model can also be transformed through mathematical transformation to a linear regression model. We will have an example, we will learn it through example. Then we will introduce multiple regression model and how the parametric estimation can take place and what is the way to do the inference on the regression parameter. Then we will introduce polynomial regression as a special case of multiple regression model and then finally just to give a taste that what happens when the assumption of normality fails. Let us recall that all these analysis that we have done, it actually takes into account the fact that we have assumed the response variable follows a normal distribution and suppose that assumption fails, then one example we wish to give is the logistic regression model. We will not go into great details about it, however it is a special case or case in the class of generalized linear model, just to get a taste of it as to regression models are all not linear and there are treatments to be given available in statistics for such generalized linear model. So let us begin, we take as I said transformation to linearity through a case study or a case of fatigue crack goes. The Paris relationship which is an empirical relationship for fatigue crack growth rate per fatigue cycle under linear elastic fracture mechanics is given by this formula, where A is the length of crack, N is the number of fatigue cycle, delta K is a rate of stress intensity factor. So this delta K is fixed and then you get a rate of change of crack for cycle. So the data comes, let us look at it, the data comes as you will have delta K as independent and dA by dN that is rate of change of crack growth per fatigue cycle becomes the response variable. Now as it can be seen that typical data has these two values with us and therefore as I had mentioned here, we have this relationship that delta K is independently fixed and corresponding dA by dN is obtained. So how do we transform this to linearity? Well in Paris relationship what we can do is take a log transformation on both sides, sorry we again go back to the pen. If you take the logarithm of the Paris equation, it turns into a linear equation which says that log of dA by dN is equal to log of constant C plus M times log of delta K as mentioned here C and M are called Paris coefficients. So here if you look at it, this can be seen as y is equal to alpha plus beta x where x is log of delta K, y is log of dA by dN. So what we are trying to show here is that in the Paris relationship shown in the previous slide, in the Paris relationship shown here can be transformed into a linear relationship through log transformation and once having done that you can follow the regression model in which log C, this is the game of pen and arrow, log C and M can be estimated through least squares estimate using linear regression model, simple this is because there is only one element simple regression, linear regression. So this is how you can approach it, what happens after that, this is a plot, it actually shows you the plot, sorry they shows you a plot, these are the data points, the blue are all the data points of this side is log of dA by dN and this is log of delta K which is shown here, log of delta K which is shown here and then these are the log of dA by dN versus log of delta K data points. This is the straight line which we have estimated over it and we have on top of that we have shown these two lines, let us convert it into arrow, this will show the two lines which are actually you recall we had the upper bound and lower bound, in other words we had a confidence limit over estimated value of the beta that is the parameters, regression parameters and we also had it for the mean response. So this line actually gives a mean response line and these are the upper bound and lower bound, the outer one is a 99 percent lower upper and lower bound while the inner one is a 95 percent upper and lower bound what really this shows is that the data more or less except for a few points like this, this and this most of the points are lie between 95 percent confidence limit, it means that our model is correct and if you exactly put it without the logarithmic transformation it says that dA by dN which is y is equal to 8 times 10 to the power minus 9 multiplied by delta K to the power of 2.89 and the R square is almost 99 percent. So it shows that you have a good fit for the data, let us move on there is another example in which we would like to talk about weighted least squares. Now it would be nice if you recall in the previous session we said that when you want to test the hypothesis, you want to test the assumption not the hypothesis, you want to test the assumption that there is no heteroscedasticity, in other words we want to test the assumption that variance of epsilon which is same as variance of response variable y is sigma square. So even if the y changes with different values of i, for i is equal to 1, 2, etc, etc n the sigma remains constant and if you recall we had also shown a plot like this in which we said that if the standardized residuals, please recall standardized residuals, if these are the standardized residuals and this is the number of data point that is number of first data point, second data point, etc. And if it shows a relationship as shown here then there is a chance of what is known as heteroscedasticity. So when such a thing happens you take the variance of epsilon i suppose it is a constant divided by a weight W i. So sigma square is still constant, the variation of i comes from a common factor in other words sigma square is a common factor and with every data point i is equal to 1, 2, 3, 4, 5 the value changes proportionally. In that case the least square estimator needs to minimize this particular equation that is the say this you please recall this is the same equation as before but then we have to normalize it with variance of yi and variance of yi if we call this that this is divided by W i then it becomes the 1 over sigma square comes out and this becomes the equation to be minimized. And this relation to be minimized please remember W i is a given value it is not to be estimated the two things to be estimated are a and b and therefore you can follow again the process that we follow for linear simple linear regression. We take the partial derivatives with respect to a and with respect to b and we get these two equations and these two equation need to be solved simultaneously as a to find a solution for a and b which I think is a very simple algebraic question and we will not go into details of it ok. Let us move on let us move on to the next issue of multiple regression. So let us recall that originally we considered this equation and we said that first let us deal with a simple regression linear regression by taking only beta 0 and beta 1 and at some point to make our life simple notation simple we call them alpha and beta. Let us consider the full case of the response variable y which is equal to beta 0 plus beta 1 x 1 plus beta 2 x 2 plus and so on plus beta k x k. So beta 0, beta 1, beta k are the some constant they are also called regression coefficients these are also known as regression coefficients. Epsilon again represents the random error with the same assumption that the expected value of random error is 0 and it has a common variance sigma square. Then to find the solution the least squares estimates for this regression coefficient we set up an equation which is if we assume that b 0, b 1, b 2 etcetera, etcetera b k are the estimates least squares estimate for beta 0, beta 1, etcetera, etcetera up to beta k. Then this is the estimated value of y and this is the actual value of y. We take the difference of the 2 square it and then we minimize it. Then what happens let us move on we get k simultaneously linear equations to be solved for k plus 1 parameters. You see there are I am sorry there are k plus 1 equations sorry there are k plus 1 unknowns and we have k plus 1 equations in k plus 1 unknowns. So it is the case of solving the simultaneous linear equations for k plus 1 unknowns and these linear equations look like this and this is again a problem of solving simple algebraic issue. This can also be presented in a different notation. So if we go to the matrix notation you must have done it even to solve the simultaneous linear equation when you studied that in the your previous degrees. If we say that y a vector y is the n y observations. So y 1, y 2, y 3, y n matrix x which is please recall this is n by k plus 1 please remember. Let us write it down the size of it. This is n by 1 this is n by k plus 1 remember the plus 1 comes because of 1 and why do we have 1, 1, 1 because it is beta 0 everywhere. So we have here I have made a mistake it should have been with beta 0 please correct this. So we have this vector the matrix n by k 1 vector beta is k plus 1 by 1 and epsilon is once again n by 1. So you can see that this equation can be set up it the dimensionally it matches this is n by 1, this is n by k plus 1, this is k plus 1 by 1. So the multiplication is n plus n by 1 and then epsilon is n plus 1. So we have the matrix equation to be solved and this can be solved by multiplying both the sides if you put in place of beta you put a b that is b is equal to your estimator beta's b 0 b 1 b 2 dot, dot, dot b k then this is your b. So you find that x prime x b is equal to x prime y it is just multiplying the both the sides replacing this by b. So epsilon goes away. In other words we are considering the estimated relationship. So here the relationship to be considered is x b and therefore x prime y is equal to x prime x b that is what is written here and therefore if x prime x is invertible and we are assuming that it is the invert the x prime x is invertible then b is equal to x prime x inverse x prime y and this can be solved using matrix algebra ok. Let us move on. In this case also expected value of b if you look at it just as in the case of a linear equation just please it is interesting that you look at it this equation again particularly this equation again sorry ok. You look at this equation again this sounds just like the equation we had written for the linear simple linear regression equation. So simple linear equation say that y i is equal to beta 0 plus beta 1 x 1 plus epsilon. So this you can write as beta 0 beta 1 sorry here you have to write x 1 x 2 this is the prime one this plus epsilon. So this equation and this equation are equivalent equations and going by that you will see that again in the same fashion the expected value of beta or expected value of b not beta is expected value of x prime x inverse x prime y. Now x prime x inverse x prime this is a constant therefore it is expected value of x prime x inverse x prime expected value of y and therefore this becomes x prime x inverse x prime you remember expected value of y is x beta and therefore this turns out to be beta because this cancels each other. Similarly we can show that if you define a constant a matrix C as x prime x inverse x prime then C C prime is x prime x inverse these are very simple calculations so I am going to leave it to you for verifying it. Then the covariance variance covariance matrix of beta is sigma square times x prime x inverse I think this also you can work it out. The residual our point of interest because we do all the testing using residuals. So the residual sums of squares SSR is a summation of yi minus the estimated value square and here also we can show that sums of squares of residual divided by sigma square follows chi square distribution with n minus k minus 1 degrees of freedom. Please remember n is number of data points k plus 1 parameters estimated from data and therefore the degree of freedom remains is n minus k plus 1 which is n minus k minus 1. So this calculation should be clear to you and therefore expected value of sums of squares of residual divided by n minus k minus 1 which is the degree of freedom is sigma square. With this let us move to polynomial regression. I want to show that polynomial regression is actually one special case of multiple linear regression. So here we write it down as y is equal to beta 0 plus beta 1 x plus beta 2 x square plus and so on beta k x to the power k plus epsilon. Now beta 0, beta 1, beta 2, beta k are some constants or we say that they are regression coefficients and epsilon represents the random error in the relationship with the same assumption that expected value of epsilon is 0 and variance of epsilon is sigma square. Then we say that this can also be written as y is equal to beta 0 plus beta 1 z 1 plus beta 2 z 2 and so on beta k z k where z sub p is equal to x to the power p for p is equal to 1 to k. And thus you can see that this is nothing but multiple regression equation, multivariate it transforms into multiple regression equation. It transforms into multiple regression equation and it can be solved same as before. But please remember the matrix x is going to be numerically heavy because it is going to have powers of x and therefore the numbers are going to be large and therefore this needs sometimes special treatment to solve the matrix equation the simultaneous linear equations. However if you have k is equal to 2, 3 or 4 it is simpler to solve the equations the simple simultaneous equation and come to this solution without using matrix algebra. Now we come to the final case as I said so far we have been making assumption that y follows a normal distribution y is a response variable and it follows a normal distribution with a variance the mean value as a regression model and a variance as a sigma square. If sigma square varies proportionally with the data then we treated it as a weighted least squares but otherwise we assume that there is no heteroscedasticity and they are the same this is what we have assumed in our course. Now consider the case where the experiments are performed at various levels of input and your response y is either success or failure or it is defective or non-defective. In such cases if you can express the probability of success in this equation sorry if you are able to express your probability success in this equation then such a model of experiment is called regression logistic regression model. Please remember this is a very specific case this is just to give you a test that life is not all linear regression models. There are generalized models and this is one of them. Let us go to the next slide. How do you estimate the parameters? We say that let y be the response of experiment from logistic regression model and then you can express it by a Bernoulli trial. You see it is a Bernoulli trial it is a success and failure. So if there are yi successes and pxi is the probability of success of yi then you say that pxi to the power yi multiplied by 1 minus pxi to the power 1 minus yi this gives you the model for estimation and the parameters to be estimation when you put the pxi as expressed earlier. If pxi is exponential to the power a plus bxi divided by 1 plus exponential to the power a plus bxi and this a and b are our unknowns then this is the equation which you need to use to estimate of a and b and this a and b can be estimated best estimated by using maximum likelihood estimation and the log likelihood of the function is given by this. You can see that this is a nice linear part this introduces a little difficulty. So you can use any method like gradient descent or one such method of estimation and get the approximate value of estimated value of a and b. So with this we come to an end of this session. Let us quickly summarize. We worked on first transformation to linear regression model and we took the case of Paris equation where the log transformation transforms the Paris equation into a simple linear regression model and you can work out the analysis, estimate the Paris coefficients and do your further analysis. The multiple regression model we saw that in matrix notation it is looks it is very similar to the simple linear model and you have to use matrix algebra to solve the matrix equation and come up with the least squares estimates. We also saw that polynomial regression model can be transformed into a multiple regression model and can be solved numerically using matrix algebra. However, because the terms are going to be the powers of x the independent variable numerically it can become bit challenging. However, if you are the powers are 2, 3 or 4 the simple linear models can be system of linear equations can be solved. Finally we saw an example a simple example of what happens when the normality assumption is not true. We took the case of a logistic regression model which is an example of a generalized regression model and generally how it is approached. Thank you.