 selecting best regressor model. Suppose you have a problem with large number of regressors and one response variable, having large number of regressors you may wonder whether some of them are irrelevant for the response variable. And whether those irrelevant regressor variables can be removed from the model without affecting the model predictive power. So, by the best regression model in multiple linear regression setup, we mean a model that can explain the variability in y and at the same time it includes as few regressors as possible. Because you know more regressors in a model, if you have more regressors in the model then greater the cost of collecting the data and also the cost for the maintenance increases. Well, so the known algorithms for selecting the best regression model, they are basically classified into two classes. One is called all possible regression approach and the other one is called sequential selection. So, in all possible regression approach what we did is that if you have a problem with K minus 1 regressors, then you consider all the two to the power of K minus 1 linear models. And the next step is that we feed those models and we evaluate them with respect to some criteria. So, the criteria could be the coefficient of multiple determination which is called you know R square. The criteria could be MS residual, it could be adjusted coefficient of multiple determination that I am going to talk about this one today. And also it could be the Malo's statistics that is denoted by C P. Let me talk about the criteria for evaluating the subset regression model, one criteria is residual mean square. So, we denoted by MS residual. So, what we observed in the last class is that SS residual P by this one we mean that the residual sum of square for a model involving P minus 1 regressors. So, this one decreases as the number of regressors, but the same is not necessarily true for MS residual. So, MS residual may increase if you increase the number of regressors also that is what we observed in the last class. Now, well what is MS residual? MS residual is MS residual associated with the model involving P minus 1 regressors is equal to SS residual P by N minus P and of course, you know lower value of MS residual P indicates better fit. So, now we will what we will do is that we will plot MS residual minimum MS residual against. So, the P is the number of unknown parameters in the model and we will plot minimum MS residual along the y axis right. So, what we do is that see given a P all possible models with P minus 1 regressors are evaluated and the model giving minimum MS residual P is tabulated. So, to explain this one let me recall my one example we discussed in the last class. So, we will be considering the same example the hold a cement data. So, here we have four regressors and one response variable. So, here k is equal to k minus 1 is equal to 4 k minus 1 denotes the number of. So, here k minus 1 is equal to 4. So, k minus 1 stands for the number of regressors well and then what we do is that we write down the models all possible models. So, here this is the model I said with no regressors. So, the number of regressor variable is equal to 0 here equal to 0 here. So, that is why P is the number of unknown. So, there is only one unknown that is the intercept. So, P is equal to 1 for this model. Now, these are the model involving one regressor or we will say that these are called one regressor model that is why P equal to 2 because the number of unknowns for these models is equal to 2. These are the model with involving two regressors that is why P equal to 3 and these are the model involving three regressors and P is equal to 4 and this is the full model involving all the four regressor variables that is why P is equal to 5. And then first we write down all possible models and then we take one model and we fit that model. That is what we did in the last class. You consider the model beta naught beta 1 x 1 plus epsilon and you compute the s s total s s residual s s regression and you write down the ANOVA table. Here is the ANOVA table for the first model and similarly you have to do you have to do the same job for all possible for all the 16 possible models. Now, what is the MS residual here? The MS residual for this model is 115.1. So, that is mentioned in the last class. Here the MS residual for this model is 115.06 more precisely and similarly you fit the second model you get the ANOVA table you find out the MS residual value. So, this way you compute the MS residual value for all the models involving one regressor for all the models involved in two regressor variables. These are the MS residual and similarly for the other models also you compute the MS residual value. Now, what we do is that you know the minimum MS residual value in this class is 80.1. So, you compute the MS residual value 0.35. So, we tabulate this one and the minimum in this class is 5.7. So, we will tabulate this one in our in this graph here. So, for p equal to 1 p equal to 2 3 4 3 4 3 4 3 5 well for p equal to 1 that means there is no p equal to 1 means this model no regressor the MS residual value is 2 2 6. So, suppose this is 10 20 30 and then it is you know may be 200 here. So, for p equal to 1 it is 2 2 6. So, somewhere here now for p equal to for p equal to 1 for p equal to 1 it is 2 2 6 for p equal to 2 it is 80 point for p equal to 2 it is 80.35. So, you plot somewhere here I am not very particular about this scaling. Now, for p equal to 3 it is 5.7. So, somewhere here for p equal to 4 for p equal to 4 it is the minimum 1 is 5.33. So, for p equal to 4 it is the minimum 1 is 5.33 and for p equal to 5 it is 5.98 this is 5.33 and. So, what happen here is that this value is for MS residual 3 is equal to 5.7 MS residual 3 is equal to 5.7 MS residual 3 is equal to 4 for p equal to 4 it is 5.33 and MS residual 5 for p equal to 5 it is 5.98. So, what happens here is that initially you know initially MS residual decreases and then it stabilize for some time and then it may increase also. So, here it is increasing little bit from 5.33 to 5.98. So, this you know because for SS residual it is not true SS residual is always a decreasing function, but that the same is not true for the SS residual sorry the same is not true for MS residual. So, it may increase also. So, if the newly added regression variable is not relevant to the response variable then the reduction in SS regression is not sufficient to compensate the 1 degree of freedom loss in the denominator that is why MS residual increases also for some time. So, what we do here? So, the we choose a value p such that MS residual p is almost equal to the MS residual for the full model or we choose a value of p such that value of p where MS residual p turns. So, here this is not true for the full model these are the I mean selection criteria either you choose a p in fact it is a number of regressors in the regression model p minus 1. So, you choose a p such that the MS residual p is almost close to the MS residual for the full model otherwise you choose a p such that from that point MS residual p turns upward. So, here you can see that MS residual decreases till p equal to 4 and then it turns upward because for 5 it is again the value is more than the value for p equal to 4. So, here it turns upward. So, here you can choose p equal to 4 the meaning of this one is that. So, this MS residual criteria suggest that you go for the model involving the regressor x 1, x 2, x 4 also you know this other two models like the model involving x 1, x 2, x 3 and the model involving x 1, x 2, x 3 and x 4 they have also you know the comparable value of MS residual. So, either you can go for this model or the these two models and if you want if you prefer you know the model with two variables then of course, you have to go for the model with which involves x 1 and x 2. So, this is how we evaluate the model using the MS residual criteria. Now next we are going to talk about one more criteria that is called adjusted coefficient of multiple determination and this one is denoted by R. So, let me just recall R square, R square is the coefficient of multiple determination which is equal to SS regression by SS total and this one is equal to if I is this one is equal to 1 minus SS residual by SST. Now if I say that this R square is associated with the model which involves p minus 1 regressors then I will put p here, 1 p here and 1 p here. Now what we know is that this is residual p always decreases as p increases. So, from here you can say that this implies this test statement implies that coefficient of determination R square p this always increases as p increases. So, from here you know we can say that R square p or R square is not a good measure of the quality of fit. Let me explain why because see who is the whether the newly inserted or newly included regressor variable is relevant to the model or not irrespective of that fact whatever regressor variable you include SS residual always decreases and R square always increases with the when the number of regressors increases number of regressor variable is number of regressor variable increases. So, that is why you know R square is R square cannot distinguish between whether the newly added regressor variable is relevant for the model or it is irrelevant for the model. So, that is why R square is or R square or SS residual is not a good measure to check the quality of fit. Well, so that is why what we define is that we introduce adjusted coefficient of multiple determination. So, what is that this R p bar square this is the adjusted coefficient of R p bar square coefficient of multiple determination the only difference is that this here you know we replace this SS residual p by MS residual by MS residual p and here we replace the SST by MS and then this can be written as now this adjusted you know R square can be written in terms of R square. So, this one is equal to 1 minus what is MS residual when there are p minus 1 regressors in the model I can write that as SS residual by the degree of freedom N minus p. So, this is nothing but MS residual p right. Now, what is SST sorry MST MST is nothing but SST by the degree of freedom is N minus 1. So, this one is equal to 1 minus N minus 1 by the N minus p into SS residual p by SST and this one is nothing but you know 1 minus R square. So, this can be written as 1 minus N minus 1 by N minus p 1 minus R p square and since you know here we have replaced the SS residual by MS residual as I said before that MS residual will not necessarily always decrease as p increases. So, here from there you can say that now R p square you can observe that R p is equal to 1 minus R p square. R p square bar square value will not necessarily increases with the addition of any regressor. So, that is why this adjusted coefficient of multiple determination it is a better measure than the usual R square. Now, let me just illustrate for this model how to how to get this adjusted coefficient of multiple determination. We have the fitted model and unavoidable for the first model well then what is the adjusted coefficient of multiple determination for this model. So, there p equal to for this model y equal to beta naught plus beta 1 x 1 plus epsilon. So, here number of unknowns is equal to 2. So, p equal to 2 so basically I am computing R 2 bar square for this model for this specific model. So, this one is equal to 1 minus N minus 1 that is 13 minus 1 12 by N minus p N is equal to 13 for that you know hold cement data. So, N minus p is equal to 11 and 1 minus R square R 2 square. So, what is the value of this R 2 square the value was 53.4 53.4 percent. So, basically this one is nothing, but this one is nothing, but 0.534. So, R 2 square is nothing, but R 2 square is nothing, but S S regression by S S total which is equal to 1450 by 2715 please refer my previous class. So, this one is equal to 0.534. So, this value is equal to 1 minus 12 by 11 1 minus 0.534 which is going to be equal to 0.492. So, this way so this value of the adjusted coefficient of multiple determination for the first model is 0.49. So, here is I am writing in percent so it is instead of 0.492 I am writing 49.2 percent. So, similarly you feed the second model you get the value of adjusted coefficient of multiple determination. Similarly, you do for all the model here and for all the model involving three regressors and four regressors these are the value. So, what we do is that we plot coefficient of adjusted coefficient of multiple determination against the value of p. So, here also you all possible models with p minus 1 regressors evaluated model giving maximum r p bar square is tabulated. So, of course, you know higher value of adjusted coefficient of multiple determinations indicates better fit that is why we consider the maximum value from each class r p bar square. So, we consider the maximum value from each class r p bar square. So, what we know is that the maximum 1, 2, 3, 4, 5, 6, 7, 8, 9, 9, 9, 3, 4, 5 and let me put 20, 40, 60, 80, 100. Let me have the values. So, r p equal to 1 when p equal to 1 the value is equal to 0. When p equal to 2 the maximum value is 64.5 when p equal to 3 the maximum value is 97.5. When p equal to 4 the maximum value is 97.6 and here 97.3. So, basically we will tabulate these values. So, for 0 p equal to 1 it is 0 for 0 for p equal to 2 it is 64.5. So, somewhere here for p equal to 3 the value is 97.5. So, it is somewhere here for p equal to 4 the value is 97.6 it is 97.6 and for p equal to 5 it is 97.3. So, here you can see it is not necessarily it is always increasing it can this value of adjusted r p square can decrease also sometime well. So, it suggest that the selection criteria is that you select a p where r p square reaches maximum. So, this is the value of what according to this coefficient of multiple determination among the two regressor models this one is the best the model involving x 1 and x 2 and among the three regressors model both are good you know see I mean all three are good model involving x 1, x 2, x 3, x 1, x 2, x 4, x 1, x 3 and x 4 they all have a very high value of adjusted coefficient of multiple determination right and see the difference here it is 97.6 and for the two model for the two regressor variable model the value is 97.5. So, there is a little gain in terms of this adjusted coefficient of multiple determination if you add one more variable with this say if you add x 3 with this model then the value is getting increased by 0.1. So, before it was 95.97.5 now if you add you know along with x 1, x 2 if you add either 3 or 4 they the increase is not that significant. So, I mean I personally I will go for this model you know x 1 the model with two variables x 1 and x 2 that is enough because there is no significant increase in the value of the adjusted coefficient of multiple determination if you add one more regressor variable here well next we will be talking about one more criteria that is called mallows statistic it is denoted by C p. So, this statistic measures the overall bias or mean square error in the fitted model. So, by the mean square error is measured by this one this one is y i hat minus y this one y i hat is the i th fitted value right and what is e y this is the expected response this is the expected response for the regression I mean the for the regression model I mean the full model regression and the difference. So, the difference between the fitted value and the expected value is the error you square it. So, it becomes square error and then the mean is you know you take the mean means expected value of this one well and this is for the i th observation you do you sum it over i equal to 1 to n and this one is standardized by dividing it by the sigma square well. So, this is called the mean square error and it can be proved that the mean square error is this quantity can be estimated by s s residual for the model for the model involving p minus 1 regressors by m s residual and this is for the full model of course, minus n plus and this one is denoted by see I am not I am not going into the detail of this you know how to get why c p is a is an estimate of this mean square error that is all. Now, here we need to observe something like a first one is this m s residual is computed using all the regressors in the model and s s residual p is computed for a model with only p minus 1 regressors. So, we know about this notation now what we need to observe something here what is the value of c p for the full model. So, here is the definition I mean here is the expression for c p now well let me write again c p equal to s s residual when the model involving for the model involving p minus 1 regressors and this is m s residual for the full model minus n plus 2 p and p is the number of m s residual. Unknown parameters in the model now when p equal to k that means the number of the number of regressors regressors in the model in the model is k minus 1. So, when if p equal to k then of course s s residual p is s s residual for the full model. Then what is the value of c p when p equal to k that means what is the value of c k. So, c k is for the full model c k is s s residual for the full model. Then what is the value of c p when p equal to k that means what is the value of c k. So, c k is for the full model c k is s s residual you can write in bracket k that means the s s residual in fact for the full model by m s residual minus n plus 2 k. So, here this is nothing but n minus k minus n plus 2 k which is equal to well. So, what we observed is that for p equal to k c k equal to k that means the full model the c k value is close to I mean is in fact it is c k value is equal to k. And you can you can note that the low c p value indicates better fit better fit. So, our selection criteria for you know for the model here is that of course a small value of c p is desirable. And also the c p should be close to I mean c p is small mean it is close to p. If c p is equal to p that means the model involving p minus 1 regressors is almost equivalent to the full model. Let me illustrate this Malo statistics using our example first. So, for p equal to 2 I will calculate the c p value for p equal to 2 that means I am considering the model y equal to beta naught plus beta 1 x 1 plus epsilon here p equal to 2. So, I will compute c 2 c 2 is I said it is s s residual 2 by m s residual minus n plus 2 into p. So, 2 into 2. So, this one is nothing but 1 2 6 5 this is the s s 1 plus 1 2 6 5 this is the s s residual by m s residual no do not take this m s residual this m s residual for the full model well where is the full model this one is the here is the full model. And you take this m s residual value. So, this is the m s residual for the full model. So, you take so s s residual 2 by 5.98 which is the m s residual for the full model minus 13 plus 4 which is equal to 202.53 right. So, this is the c p value 202.53 this is the 202.5 is the c p value for this model. Similarly, you get the unavoidable for the second model and compute the c p value. So, these are the c p values for model involving 1 regressors and 2 regressors. And these are the c p values for the model involving 3 regressors and 4 regressors. So, I said that you know the smaller value of c p or low value of c p indicates better feed and also it should be close to p. So, here you can see that this the 3 model this 3 models involving x 1, x 2, x 3, x 1, x 2, x 4 and x 1, x 3, x 4 they are acceptable under the c p criteria. Because these values are small I mean specially 3.02 is small 1 and low c p value indicates better feed. So, among these models you know this one is the best one, but all 3 are acceptable because they are reasonably small and also close to p. Now, here see there is no the small c p is here 138, but of course it is very far from 2. So, this is not acceptable according to the c p criteria. Now, for 2 variable case or model involving 2 regressors this one is the small 1 2.68 and also it is very close to 3. So, the model which involves x 1 and x 2 is the small c p. It is best one in I mean it is a good acceptable model in terms of the c p criteria. Well, so here you know again you can draw the graph of you can plot c p against p. So, here is the p along the x axis and you plot minimum c p along the y axis. So, 1 2 3 4 5 6 and here also write 1 2 3 4 5 6 you know I because I want to draw this line c p equal to 3. So, for p equal to 1 the value is something 138. So, it will be somewhere here for p equal to 2 sorry for p equal to 2 it is 138. So, for p equal to 2 it is 138 for p equal to 1 it is for p equal to 1 it is 4 4 2 and for p equal to 2 it is 138. So, for p equal to 1 it is somewhere here and for p equal to 3 it is 2.6. So, it is below the line 2.6 here for p equal to 4 for p equal to 4 minimum is 3.02 and for p equal to 5 it is 5 for p equal to 4 it is 3.02. So, it is somewhere here and for p equal to 5 it is 5. So, you can see if you draw you join them. So, the selection criteria is that you choose a p small value of c p which is close to p are desirable. So, from this graph you can say that I mean for p equal to 3 the value is equal to 2.6. So, among the two regressor model this one is the best in terms of the c p criteria and among the three regressors models all three are acceptable in terms of the c p criteria and overall if you see you know all the there are we talked about different criteria for evaluating the model. If you combine them now you know for the two variable model it appears that the model involving x 1 and x 2 is the best one with respect to all the criteria and among the three regressor models this one is of course, it is good with respect to all the criteria. So, this is how you know we write down all the models involving maximum k minus 1 regressors and then we feed them and we evaluate them in terms of using some criteria and we choose the best model out of the all possible models. Thank you very much.