 to our session 24 on Quality Conduct and Improvement within ETAB. So, I am Professor Indrajit Mukherjee from Shailesh Chimetha School of Management, IIT Bombay. So, in previous session what we are doing is that we are discussing on basics of regression and which is an important tool to identify variables which can be considered for further experimentation. And so, the basic models that regression uses is proposed by Gauss and what we have and this is the basic fundamental model that we are using over here to understand relationship between y and x. And here we are we are modeling expected value of y with respect to x over here. So, there are two coefficients that is similar in line with what we know about basic line equation that is y equals to mx plus c. This part is known as intercept of the models and this part is known as slope of the models like that ok. So, this is a simple linear regression. So, I am assuming one single x I am having and this is the expectation of y for a given x. This is the conditional conditional values that is mean values that we can expect ok. So, at different conditions of x what we will have if we have different values of y and the condition if I if I can change the condition and again reset that one to that same condition even the output will be different. So, that was observed in when we are talking about analysis of variance and so, expected value of mean value is basically modeled over here with respect to x. So, this is the linear component that is beta 0 y equals to mx plus c this is the component where this is we can think of as slope over here. So, for one unit one unit change in x what is the change in y expected value of y that is basically beta 1 over here. And if you if you just extrapolate this regression equation if I have developed something y is beta 0 and if you can extend the line what is the intercept as this is the beta 0 intercept over here. So, this will be the beta 0 component over here the value of y expected value of y when when x is equals to 0 basically. So, this is the but generally in regression we do not we do not extrapolate and we do not extend that one but this is the intercept concept that we have over here. So, physical interpretation is not not possible for beta 0, but beta 1 has a physical interpretation like that for every unit in increase in x what is the expected value change in y basically change in y. So, that that is the interpretation. So, beta 0 and beta 1 are the two important parameters which needs to be estimated from this model. So, if I can get the values then I can I can write the function over here and if I have the value of beta 0 and beta 1. So, how beta 0 and beta 1 is estimated that is important for us and so estimation is important over here. So, estimation of unknown parameters that is beta 0 and beta 1 I need to estimate this one. So, I am writing beta 0 hat and beta 1 hat and whenever we have estimated that one then we have to do some modular adequacy checks that is also necessary like in ANOVA we have done similar kind of tests are required in this regression analysis. I told that regression is an extension of analysis of variance basically. So, that is also important modular adequacy checks like that ok. So, so this is the mathematical equation. So, I have a so what it will do is that many many lines can be constructed. So, it will plot the lines and innumerable lines out of the innumerable lines what it will do is that where the error is minimized basically where the predicted value with the actual value. So, this will give you error. So, error minimization basically how can I minimize the error and that will give me the best fit line. So, there will be possibilities over here. So, I can place a line over here I can place a line over here like this. So, I can place innumerable lines like that and whichever gives me the minimum error over here that will be the best fit line like that and many times it automatically for you ok. So, derivatives of this error square over here will give you the mathematical relationship of beta 0 and beta 1 and if you if you can equate it with 0 and that is the normal equation we consider and when when we do that what we get is that beta 0 estimation and beta 1 estimation. So, I have a set of x condition I have set of y conditions. So, I have 1, 2 observations like this n number of observations. So, every pair of observations that I get over here. So, specifically I can calculate what is y average of that values and x average of the dataset like that. So, this values will be used and beta 1 estimation is given over here. So, this is a complex one, but this is not difficult because I know what are the values of xi i varies from 1 to n, n number of observations that we are considering over here. So, all these values can be calculated and beta 1 can be calculated and when beta 1 is estimated beta 0 can also be estimated. Now, Minidive does it automatically this is based on normal equation solution of normal equations over here and this is known as least square estimation, this is known as least square estimation and these are unbiased estimates basically these are unbiased estimate this is statistician what they have suggested these are unbiased estimation and so we can adopt that one. So, in this case what we will do is that this estimation Minidive will automatically give it for you and then there is a ANOVA analysis concept over here in Minidive also like earlier ANOVA analysis over here. So, this is let us say y, y is on this side and xi is on this side xi values over here, pi i values on this direction over here, single x and single y I am just representing one point over here. So, one actual value is located over here and based on a line equation which I can I can assume for a given beta 0 and beta 1. So, this can be one of the line equation over here beta 0 plus beta 1 for a given estimation let us say and we have developed some equation and this is the line equation. So, for a given value of x that is actually xi over here what we can do is that we can also get a predicted value over here we can get a predicted value over because this is the line equation. So, whenever I put xi in this equation I will get a predicted value which is nothing but what what we are seeing over here. So, but the actual value is over here and this is the predicted value over here ok. So, this point from the overall average how much it differs. So, this is yi yi variables over here. So, this will have some average which is the y average or over average although all the y values over here. So, what is the total deviation over here this is the total deviation that we are seeing over here. So, this is for one observation I am saying that this is the total deviation from the mean values over here and out of this how much is explained by regression equation this is known as this is the part that is explained by regression equation, but this part of the variability is unexplained by the regression equation over here. So, the total variability from the mean overall mean over here is known as SST this is known as SST which is represented over here and the part of explain variability that we are seeing over here is known as SS regression basically SST is equals to SS regression over here and the remaining which is unexplained over here that variability which we are seeing is SSE over here. Now, this is for one variable one one observation over here there can be n number of observations. So, this summation equals to 1 to n over here summation equals to 1 to n over here. So, like this summation 1 to n like that. So, SST plus SSR plus SSC over here. So, same concept of regression again SS sum square regression sum square error over here and because we are predicting y is a function of x over here x is a single variable and the degree of freedom for a single predictor will be equals to 1 basically. So, that that is considered is another analysis and if I have n number of observation SST degree of freedom will be n minus 1 and the regressor degree of freedom will be 1 and error degree of freedom will be n minus 2 like that. So, that is the interpretation n minus 1 minus 1 will be n minus 2 basically. So, the same concept is used over here only thing is that Xi is continuous variable and it can take any values not that specific values of predefined level 1, level 2 like that that is not the case here x is continuous y is continuous also. So, every values I can calculate what is the SS total over here that is the deviation from the overall average that is taken over here considered this one. And and this is this is actual values over here and this is the predicted one that will give me SS error and the regression will be predicted minus average from the average how much is explained basically what we are getting over here. So, this gives an idea that analysis of variance can also be adopted over here and this is used for regression analysis for adequacy model adequacy checks like that ok. And that we will see in Minitab the interpretation of that and here also model adequacy checks are required. So, in this case constancy of variance that we have almost get a series city has to be checked. Errors are uncorrelated that means, Darwin Watson statistics is used and normal distribution assumptions we have understand the link test to prove that one ok. Understand the link test means the distribution is normal. So, understand the running test we can adopt over here. We can we can store the residual and residual over here is nothing, but so error over here. So, error is known as residual and that is nothing, but actual values minus predicted values like that for a given observation xi condition like that ok. So, for a given xi what is the prediction over here. So, this is the actual xi values over here. So, this gives you the error which is known as residual which is known as residual ok. And Minitab in Minitab you can save the residuals and do all kind of analysis. So, sometimes residuals are stored sometimes standard residuals are stored. So, preferably we use standardized residual like that. So, that takes care of certain other aspects like that. So, what we will do is that we will we will just analyze the residual and try to see what model adequacy checks. So, this is important over here. This the assumptions has to be satisfied to use the regression model which can be if model assumptions are satisfactory immediately we can say it can be generalized and irrespective of if you give me the within the range of x if you give me any values I can predict even if I do not have the observation in earlier in the historical data set any new observation within that within that boundary conditions of x I can predict what will be the expected value of y basically ok. So, prediction is possible of y for a given value of x ok. Only thing I have to understand that it should be within the boundaries where where the regression model is developed. So, I cannot extrapolate basically regression equation. So, that consideration is always there ok. So, heteroscedasticity we have already discussed that if this is a fundamental shape in that case heteroscedastic behavior if this is the scenario then it says that linear model may not be sufficient you you may have to incorporate like second order terms or something like that. So, this is also lack of fit this will be reflected in lack of fit which is known as lack of fit integration. So, this if you have multiple observations like x i over here and y i observations over here and if you have x repetition observation for a given x they are at a given condition of x I have multiple observations like that then only I can calculate this lack of fits like that. So, that will give me nonlinearity whether it is nonlinearity is there not this is also non constancy of variance over here and this is more or less what is expected. So, there is no as such deviations that is happening. So, that is residual versus fit we can plot that one and see that one ok. So, if that is the scenario if this is the scenario again that conversion or transformation has to be used on y basically and then regress transform y with the x variables like that ok. So, here we are taking one example to illustrate regression in MINITAB. So, for this what we will do is that this is an example an engineer is interested this is taken from Monomor is applied statistics and probability for engineers this is the data set that there is given. So, purity of data purity of oxygen over here is the basically y variable and percentage of hydrocarbon is considered as one of the factor. This is the historical data not based on experimentation statistical experimentation, but just historical data which says and I want to I want to check whether the hydrocarbon levels when I change whether it is expected value of purity is changing and whether I can develop a generalized equation prediction equation and within this range that is given over here whether I can predict the expected value of y for whether it is possible or not by developing a regression equation which can be generalized like that. If all model adequacy checks are ok then in that case we can we can we can do that ok. So, and why we are doing this because why we are doing this because we do not have this we do not have this theoretical functions that can be used to model purity with hydrocarbon levels basically. And so, that is that is one of the one of the constraints that we that we have that is why we are doing empirical modeling over here ok. So, how do we do that that is important for us. So, I will just go to that interface of unitab and this is the C 1, C 2 observations, C 3 observations that is given over here. I have just taken the data one is purity data one is hydrocarbon data over here and in this case what I have to do is that I have to go to stat and regression analysis. So, let us just try to see whether scatter plot what does scatter plot shows over here. So, whether we can we can see that linear relationship or not. So, you can go to graph and scatter plot. So, in this case what we can do is simple scatter plot over here and y variable is purity over here and x variable is hydrocarbon. And if you think ok over here what will happen is that it will give you some this is the graph that we that we can see over here and it shows that there is a linear relationship that exists over here and also we can check the correlation. So, this is positive correlation what we can see is that hydrocarbon increases purity also increases. So, we can just check the correlation over here. So, basic statistics we can just check correlations and correlations I want to see purity with hydrocarbon levels over here and in options we can we can just do this and the results we can just interpret this one and I click ok. And what we see is that correlation p value that we are seeing over here is near to 0 and less than 0.05. So, that indicates that this and purity is basically highly correlated and correlation value what you can see is that 0.937 and that is more than 0.7 and that is significant over here 0.93 is very good ok. So, p value indicates that there is a statistical significance over here and the p value is quite significant p value shows significance over here that means they are highly correlated and linearly correlated basically. So, Pearson correlation is used over here. So, this is so then I can develop the regression equation. So, let us develop the regression equation. So, what I do is that I go to stat I go to regression and I go to fit regression models over here then it will ask what is the response I will say purity is the response what is the continuous predictor I have hydrocarbon there is no categorical predictor over here that is also possible to incorporate. So, we will not go into that complex it is now ok. So, if purity is the response and hydrocarbon is the only x variable that is continuous and then all these options there are many more options over here what you can see. So, only thing what we will consider include the constant term in the model. So, this should be clicked over here because statistician says that when I include the ah intercept when I include the constant term which is the intercept over here generally the model performance is quite good and this is seen by many research like that. So, we will not omit this beta 0 estimation over here we will keep that one. So, this is the only thing that you have to you have to remember these options coding step wise that that is not required at this stage validation graphs results and only thing what we can do is that you can store the residuals. Now, I can store residuals or standardized residual let us store standardized residual which is which is which is generally recommended and this is nothing but residuals divided by standard deviation of residuals. So, that is known as standardized residual and we want to save that one and when I save this one and I click ok what will happen is that I will get the regression equation. When I when I get the regression equation. So, this is the equation that you are seeing. So, I can copy this one copy as picture and I can I can just paste this one let us say I want to see the results. So, over here what I have done is that this is the first result the equation that MINITAB has generated its purity is equals to this is the intercept beta 0 74.28 this is estimated based on the formulation that we have shown and 14.95 is the beta 1 that is the slope and one unit increase of hydrocarbon level how much will be the average increase in the purity that is given by 14.95 like that. So, beta 0 estimation beta 1 estimation over here and then if you go to the second results these are the this is the second results that you will get and which we can paste over here and see this one and here the coefficient is given. So, when I see constant that is the beta 0 estimation coefficient is 74.28 which is reflected over here and hydrocarbon is beta 1 intercept is 14.95 that is given over here ok. Standard error of this is also given and the corresponding T values and P values are given P less than 0.05 will indicate that beta 1 is significant over here and this is statistically significant. So, it is basically saying that there is a slope and we can consider this hydrocarbon level as one of the variables which is explaining the variability of purity over here ok. Constant is also beta 0 is significant statistically. So, we should retain this one beta 0 ok. So, that is also signifies signified over here and then model summary you will find over here model summary. So, I can just copy this one also copy as picture and we can we can paste this one ok. So, we can place this one over here. So, in this case model adequacy. So, this is one of the one of the measures that like R values that we have told correlation coefficient is nothing but the correlation coefficient that you see over here R square which is known as coefficient of determination which is same in case there is one variable over here. So, it will come out to be same 87.74. So, it is converted into percentage. So, I can convert into between minus 1 and plus 1. So, 0.877 you can think of which is more than 0.7 ok. And this is calculated by another formulas over here that is known as SS regression how much of the variability by SS total basically which ANOVA analysis will tell you and the ANOVA tables will summarize that one. So, SS regression by SS total will tell me how much of the variability of Y is explained by basically this hydrocarbon level variable over here. So, it is around 87 percent like that. So, you can think of ok. So, that is quite good. One of the variable is explaining so much of variability of the Y that you observe. So, when I change the X it is influencing the expected value of Y basically that is the interpretation that we can make out of this ok. So, then what are the other results that we are getting over here? This is the analysis of variance that you see over here. So, this I can copy again and I can paste it over here. So, I can paste it here and just try to see what is the interpretation of this ok. So, over here what you see is that regression this is the regression equation that is developed it is showing that p value is less than 0.05 that means this regression is quite this equation is quite significant and in that case we can adopt this equation and hydrocarbon level is that is the basically the variable X that we are considering p value is significant over here. And you will also find a lack of fit testing over here that means whether there is any nonlinearity in the model that is we have to adopt and go to higher order equations like that that will be given by lack of fit test and formula is given in any books like that. So, if we have multiple observation at a given level of X then lack of fit can be calculated and lack of fit over here is calculated as 0.5 which is not more than less than 0.05 that means there is no lack of fit as such. So, linear model is quite sufficient to explain the variability and that is adequate over here. So, I do not need to go to higher order terms over here. So, this is lack of fit what we can get ok. So, this regression equation one way is seeing this coefficients over here constant constant terms beta 1 is significant beta 0 is significant like that and overall if you want to see that whether the regression equation is making sense. So, this value of regression this p value we have to we have to see and generally it will agree. So, both of the coefficient is a significant then only regression will be significant like that. So, that that is a interpretation that we can make ok. Then some unusual observations over here when standard residual is more than. So, if you can see this one copy as picture. So, we can paste this one. So, when we have certain observations which is beyond 2 that is standardized residual which indicates that this is a unusual observations like that. So, then we have to see whether to include that one or exclude that one. So, whether it is outlier like that whether we need to eliminate that one regression also this type of information is also useful when we do regression analysis. So, that is why standardized residual is used to identify any outlier observations something we can do that ok. So, and we have to be careful in dealing with outliers. So, there are many ways of dealing with outliers ok. So, what we can understand based on this simple example over here is that every condition is satisfactory. Now, one more thing what we have to do is that we have to save the residual. So, let me just delete this one I do not know whether this is we have saved or not. So, we are go to residual regression over here fit regression like that this is the one storage whether we have standardized residual was saved over here. So, this is ok and if I save this one the last one is standardized residual this can be considered as residual over here. And there are three checks that has to be undergo we have to undergo three tests over here which is the one of the test is normality assumptions like that. So, one test what we can do is that we can we can see the basic statistics we can go to normality test over here and we can see the residual whether it is normal or not. So, in this case what you observe is that p value is more than 0.05. So, there is no problem in the assumptions of normality over here there is another graph which can be seen over here. So, when I am when I am doing this regression there are other possible graphs that we can we can draw. So, one is residual versus fit. So, this will indicate heterocerasticity is there or whether there is any auto correlation that exists this is the second one with respect to order. So, when you draw these two graphs also what you will observe is that this is one of the graph which will more or less you see residuals with fitted value is more or less random over here on the 0 line. So, there is no heterocerasticity as such, but we can prove that by Bruce Pagan test like that. So, we can we can take the residuals and we can test that one and that is available in R what I have told earlier also. So, this can be done and if there is any patterns that we observe in this data set like that, but it is seems to be random. So, in this case auto auto auto correlation may also we come out to be negligible over here. So, this is also not. So, all the assumptions we can check, but what we are seeing is that at least preliminary assumptions are satisfactory and this if you go to this book you will find that all the assumptions are true in this case and so we can generalize the equation that means this equation that we have written over here can be generalized and the values that you are getting 74.28 and 14.95 of the beta 0 and beta 1 estimation that is also possible in excel. So, if you go to excel over here and you and you run the regression equation and which I have done earlier over here save this one and what happens is that the values that you see constant values of intercept that you see over here by using excel simple excel. So, what I do data and I have data analysis tools and in this case you have regression analysis tools and based on that you can do the regression analysis and you can get the coefficients and corresponding p values are also indicated over here regression is significant or not. So, this is more than 3 plus of decimal over here what you can see 10 to the power minus 9 and it is showing it will give you the values that is this is not 0 what we can see over here and although Minitab reports this as 0. So, this Minitab does not report more than 3 plus of decimal over here but excel reports beyond 3 plus of decimal over here. So, this is true and the values are exact what you see intercept is 74.28 and Minitab also has given 74.28. So, if you go to the original value 74.28, 74.28 14.95 is the beta 1 estimation that you see and excel also shows the beta 1 estimation 14.95 approximately that is also over here what you see 14.95. So, regression is also possible you can verify that one if you want to see what is the exact p values like that and Minitab is not giving you that because it does not go beyond 3 plus of decimal. So, in that case what you can do is that you can do it in excel and see that one or you can transfer this to R and do the analysis and you will get the exact p values what is what what what comes out from the analysis like that. So, generalized equation that we can use is 74.28 14.95 and this is hydrocarbon label that we can take. So, this is one of the example that we have seen and the data set is this is the data set and these are the graphs and equations over here. So, this is the model fit that you can see this is the line equation that is divided over here R square is around 0.87 like that this beta 0 and beta 1 is estimated over here. So, this is beta 0 and this is beta 1 estimation over here and this regression is found to be significant and lack of fit there is no lack of fit. So, in this case linear equation is sufficient and R square value is SS regression by SST which is approximately equals to 0.877 what you are seeing over here 0.877 this is the value over here ok. So, this is the ANOVA analysis that you see analysis of variance which is ANOVA analysis that you see and the regression degree of freedom is 1 because 1 variable is there ok. And error degree of freedom is basically N minus 2. So, 20 observations are there. So, 20 minus 2 is 18 and N minus 1 observation is over here. So, this is 19 over here. So, this is the basic interpretation of this and we can we can generate the errors because regression equation we have got. So, actual value and predicted value will give me residuals and residuals are probability plots are done over here and the Sandaling test is done over here and P value is greater than 0.05. So, that indicates that this is more or less normal and the data set is on the line most of the points are on the line. And also versus fit what we see there is no as such trend over here and more or less seems to be random and also we can we can do the Bruch-Pagan test to confirm that one and other autocorrelation test like Durbin Watson statistics can also be done. This is another example of selling price and annual taxes I will just repeat the analysis over here. So, that one more examples we are taking over here. So, let me go to that examples and try to figure out how we have done this one. So, just repetition one more examples where we want to see that whether annual tax is related with the selling price of the house like that ok. And we want to obtain the estimation relationship like that. So, what we do is that first step what we have to do is that this is the data set is given over here this is the second one C5 and C6 and this data is taken from again Mon-Gomorris applied statistics and probability. And I want to see the relationship of tax with sales price and tax will be Y and sales price will be X over here. So, this is dependent on the sales price. So, what we will do is that we will we can plot the scatter diagram and try to see whether linear relationship is we can think of. So, this Y will be tax and X will be sales price and we can draw the graphical scatter plot over here. And what we see in the scatter plot is more or less again positive relationship that exists over here and I can confirm what is the correlation coefficient between these two. And I can go to correlation and I can just check sales price and taxes over here and I click ok. And what I see is that P value is significant over here and 0.876 is the more or less approximation that we are getting over here correlation coefficient. This is positive that means there is a positive relationship between sales and tax that is relationship that we have. Now everything is fine so I want to predict the equation. So, I will go to regression and regression over here fit regression models over here and instead of this purity we will change this one to tax and instead of continuous variable I will say sales purity I will use over here. And I can store the residual again over here standardized residual and I click that will be saved at the end. And then I will click on the graph so I can I can see residual plots. I can also see normal plot of the residual and residual for plot may be standardized residual I want to check and then I will do ok. And then what will happen is that it will give me the equations then it will say whether beta is constant is significant or not. So, what we see is that constant is not significant, but I told that researchers suggest that we should keep the constant. So, we will keep that one and sales price is significant over here that means beta 1 is significant. So, we will retain this one. So, there is a positive relationship that we are seeing. So, coefficient is positive over here and constant is negative that constant cannot be interpreted in regression. So, in that case physical interpretation is not possible. So, and the r square value is 0.7673 what we have seen calculation also previous calculations like that. So, it will be close to r square that is calculated from ANOVA analysis SS regression by SS total. And it is significant and there is no lack of it that means 0.25 that you are seeing over here. So, approximately what we can see is that copy if I copy this image and I paste it over here in another sheet. So, what we are seeing over here is that analysis. So, in this case what we are seeing is that regression is significant lack of it is not there. So, in this case model can be generalized. So, normal probability part more or less it seems to be ok, but we can do understanding test and there is as such no pattern in the residual versus fit diagram equation over here. So, heteroscedasticity may not be a problem over here and also the autocorrelation does not seems to be significant because there is no trend as such that we are seeing and maybe random observations, but we can do Durbin Watson's we can do that test Durbin Watson's test and or Buchbacher's test BP test we can do over here to confirm that one and we have stored the residual. So, we can see at least the normality test and when I go to the last residual over here and do this test what will be what we see is that the p value is not significant. So, in this case we can assume that the residual is normal. So, what we do is that we test the residual over here and if the if the condition is not satisfactory then in that case we go for transformation. So, next case what we will see is that errors are non-normal. So, in that scenario what is to be done and how the regression models has to be developed. So, that we can see in another examples in our next session like that and some more complexities we can see for we will go to multiple regression. So, sometime we will spend in multiple regression because in design of experiment what happens is that there is not single x that we are dealing with in design of experiments there can be multiple x single x is the one way analysis of various that we are dealt with and that is the simplest of condition but that is not reality basically. So, there will be multiple x that will influence the y and we need to develop the mathematical model and we need to develop the regression equation which we have to optimize basically finally. So, that case we will we will we will discuss about that but at present we will stop over here and we will continue in our next session on simple regression complexities when error is non-normal in that case how do we deal with that like analysis of variance and we will also go to multiple regression and see the complexities in modeling what we face basically. So, we will continue with that. Thank you.