 This is my third lecture on dummy variables. The variables used in regression analysis are generally qualitative, but occasionally we need to it is necessary to use qualitative variables and the dummy variables are used to incorporate the qualitative information in regression analysis. Let me give an example of the use of dummy variable. Suppose you have two sets of data. Well, I forgot to say that you know this the example of qualitative informations or qualitative variable are like employment status or sex whether male or female or the origin from where the data taken whether it is CTA or CTB or CTC. And if there is a significant difference in response level between two sets of data it is not advisable to use a simple linear regression model considering all the data together. So, here in example suppose you have two sets of data this is say for male and this is for female and the data are on two variables say the response variable y and the regression variable x. So, this could be like you know expenditure per month and this is the income per month and you have two sets of data. So, it is not advisable to fit a model like y equal to beta naught plus beta 1 x for all the data together because if there is a significant difference in the response between the male data and the female data this is not recommended. So, what we need to do is that we need to incorporate the qualitative information that one set of data is for male and the other set of data is for female. And the model which is recommended here using one dummy variable is y equal to x naught beta naught plus beta 1 x plus z 1 alpha naught plus alpha 1 x plus epsilon. So, of course, you know here we are using two dummy variable including x naught and their value x naught and z 1 is 1 0 for the first set A and 1 1 for the second set B or I said I mean for male and for the female. So, the separate models models for male and female are given by setting z equal to 0 and z equal to 1 respectively. So, the model we recommend is that y equal to beta naught plus beta 1 x for male and beta naught plus alpha naught plus beta 1 plus alpha 1 into x for female. So, if you know that you know if the data are given in two sets you fit this model which is equivalent to y equal to beta naught plus beta 1 x for male and y equal to beta naught plus alpha naught plus beta 1 plus alpha 1 into x for female. So, you fit this model and then by testing the hypothesis like say h naught equal to say alpha 1 equal to 0. So, if you test the hypothesis that whether alpha 1 is equal to 0 or not that is basically you are testing the appropriateness that whether two parallel straight line can be fitted for two sets of data similarly you test the hypothesis h naught which is equal to say alpha naught equal to alpha 1 equal to 0. So, this hypothesis is testing this hypothesis is equivalent to you are testing the appropriateness of the fact that whether the same model can be fitted for both sets of data or not. And the other one is that you test this hypothesis h naught which is say alpha naught equal to 0. So, this one testing this hypothesis means you are testing whether two straight line with the same intercept can be fitted for two sets of data. Now, if you see that all of them are rejected then you can go for this model I mean this is your final model which fits the data based and if you see that say for example, this one this one is accepted for example, h naught which is alpha 1 equal to 0. That means this indicates that if this hypothesis is accepted that means you can go for two parallel line you can fit two parallel straight line for two sets of data, but with different intercept that means in that case if this is accepted you can go for the model y equal to say beta naught plus beta 1 x plus alpha naught z 1 plus alpha naught. So, this is the model you can fit for the given set of data. So, this is about two sets of data two sets of data and straight line model. Now, we will go for three sets of data and straight line. So, to allow the fitting of three separate straight lines we form the model y equal to 0. So, this x naught beta naught plus beta 1 x plus z 1 gamma naught plus gamma 1 x plus z 2 delta naught plus delta 1 x plus epsilon. So, this is the model for three sets of data. And we are trying to fit straight line model for each set and here of course, x naught is equal to 1 is a dummy variable and z 1 and z 2 are two additional dummy variables. So, x naught z 1 z 2 1 1 1 1 0 0 1 and 0 0 this is for the first set this is for set a this is for set b and this is for set c and this can be rewritten as y equal to beta naught plus beta 1 x plus gamma naught z 1 plus gamma 1 x z 1 plus delta naught z 2 plus delta 1 x z 2 plus epsilon. So, note that here note that here we have two interaction terms involving dummy variable two interaction terms. So, x z 1 and x z 1 and x z 1 and x z 1 and z 2. So, this is the general model which sort of cover all possible possibilities of fitting three straight line for three sets of data. Now, we can test some hypothesis like since we are given three sets of data whether we can go for three parallel line for three sets or whether we can fit a fit one single straight line model for all the three sets. So, those I mean you can form the appropriate testing null hypothesis to test all these things. So, we will talk about two testing here. So, first one is to test whether three lines are identical we test the following hypothesis I need to write down the model may be once more. So, to test whether three lines are identical that means we have to test whether gamma naught is equal to gamma 1 is equal to delta naught is equal to delta 1. So, let me write down the model once more. So, y equal to beta naught plus beta 1 x plus z 1 gamma naught plus gamma 1 x plus z 2 delta naught plus delta 1 x plus epsilon. So, this is the model now we want to test whether we can go for a single straight line model whether the three lines are identical to test this hypothesis we need to test this one that gamma naught is equal to gamma 1 is equal to delta naught is equal to delta 1 is equal to 0 against the alternative hypothesis that H naught is not true. I hope that you know how to test this sort of hypothesis for a model like this because this one is nothing but multiple linear regression model. So, you have sort of three regressors you can consider this dummy variable as regressor variable of course it involves some interaction term that does not matter. So, you I hope that you can recall the extra sum of square technique. So, that to test this hypothesis we use the test statistic f which is equal to the SS regression for the full model you know what is the function of the what I mean by this you compute the SS regression for the full model this is the full model minus SS regression for the restricted model. So, what is the restricted model here the model under the null hypothesis. So, the restricted model is y equal to beta naught plus beta 1 x. So, that is what the null hypothesis suggest that you go for this model that means the null hypothesis suggest that you go for a single straight line fit for 3 sets of data. So, you compute the SS regression for the restricted model and you divide this quantity. So, this one has degree of freedom this was has degree of freedom this is the full model. So, it has degree of freedom 6 and this one has degree of freedom 2. So, 6 minus 2 it is you divide this quantity by 4 and this by SS residual by residual value. So, this is the degree of freedom that is n minus 6. So, this follows f distribution with the degree of freedom 4 n minus 6 and then the critical region of course, critical region is you reject this null hypothesis if this f value is greater than f 4 n minus 6. So, at some level of significance alpha. So, what is this numerator is basically it is basically the contribution of these parameters or the associated regressor variable to explain the variability in y. Similarly, you can test say to test whether the three are the lines are parallel. That means whether you can fit three parallel line for three sets of data to test this hypothesis you what you have to test you test the null hypothesis H naught that gamma 1 is equal to delta 1 is equal to 0 against delta 1 is equal the alternative hypothesis that H naught is not true and this one also you know same technique you can test this null hypothesis by using the f statistic f equal to SS regression for the full model minus SS regression for the restricted model what is the restricted model that is the model under the null hypothesis. So, here the restricted model is basically you just put lambda is lambda 1 equal to 0 and delta 1 equal to 0 in this equation. So, that restricted model is y equal to beta naught plus beta 1 x plus lambda naught z 1 plus delta naught z 2 plus epsilon. So, this is the restricted model. So, you see what is the SS regression under this restricted model and this has of course degree of freedom 6 and this has degree of freedom 1 2 1 2 3 4. So, you divide this by 2 6 minus 4 by SS residual by degree of freedom n minus 6. So, this follows f 2 n minus 6 and of course the critical region is you reject the null hypothesis if f is greater than f alpha 2 n minus 6. Now, what I will do is that so I talked about now at this moment I am talking about suppose you are given three sets of data and I am trying to fit three straight line model for three sets of data and we have just now we talked about you know whether we can go for a three parallel line or whether you can go for a single straight line model for all three sets of data and we learn how to test those possibilities. So, what I will do is that I will explain this if you can recall know we had data called turkey data and there we had three sets of data and I will try to explain whatever I talked just now by using the turkey data. So, here is the turkey data we had and y is weight in pounds x is age in weeks and they are from three different origin. So, we have three sets of data and we are trying to fit straight line model for three sets. So, the model you have to consider is that is a general model is this one z naught into z naught or you can call x naught also no problem. So, z naught into beta naught plus beta 1 x plus z 1 into gamma naught plus gamma 1 x plus z 2 into delta naught plus delta 1 x plus epsilon. So, this can be rewritten in this way. Now, I will just explain you know how to fit how to estimate this regression coefficients here you have six regression coefficients. So, here is the parameter vector beta this involves all beta naught beta 1 gamma naught gamma 1 delta naught delta 1 and here is the coefficient matrix. So, the first one is stands for say you can put here z naught also. So, this is either z naught or x naught you put and this x is for the regressor variable x here and this x is same as the x here. Now, z 1 you know the dummy variable scheme for two sets. So, this is the scheme. So, z 1 is this column z 1 into x you just multiply z 1 with x and you will get the column associated with z 1. z 1 x z 2 you know this is the scheme for the dummy variable for two sets of data and once you have z 2 you can compute z 2 x. So, this is what you know I am just you know explaining how to how to estimate this parameter. So, you have the coefficient matrix x you know this is the parameter you have to parameter vector that you have to estimate you know what is y vector this is the y vector and I am sure that you understand that you know this x and this x is different this is the this is the matrix coefficient matrix and this is this is a regressor variable. So, you can write this multiple linear regression model in this form y equal to x beta plus epsilon and you can see you know that you know for multiple linear regression model beta hat is x prime x inverse x prime y. So, you know what is x you know the x matrix you know y. So, you can compute beta hat and these are the estimates of the regression coefficients. So, this is the estimated I mean this is this is the fit basically. Now, three separate straight lines for first block is this one. So, this one is obtained by by setting z 1 is equal to 1 and z 2 equal to 0 in this fit. Similarly, for the second block the fit for the second block is obtained by by setting z 1 is equal to 0 and z 2 equal to 1 in this equation here in the fitted model here and the fit for third block is obtained by setting z 1 equal to 0 and z 2 equal to 0. And this three fits it says that these are exactly what one would find if one fitted each subset of data separately. So, if you go for say separate fitting for for for separate sets then then you will get exactly the this this fits. And now let me compare this one I see this is this is what we got before this is without using the interaction term. So, this is the model we considered here before this model did not this is basically here we considered that all three lines are parallel. And that means the slope for all three lines fits is beta 1 hat and you can see here we got this is the fit for the first set this is the fit for the second model for the second block. Now, you can compare that you know these are not different too much I mean the fit we obtained here and the foot fit we had before they are not very different, but that means what I am trying to say is that. So, we have fitted a general model and this the feet we got here are not so different from the feet we had before considering that considering three parallel straight line for three sets of data. Now, what we need to do is that we need to statistically test whether this three sets of data for for for for torquey data they really require three different straight line model or or three parallel straight lines are ok for the torquey data. That that we can do statistically by testing a suitable null hypothesis right. So, before doing that I will just write I mean here is the ANOVA table for for the torquey data we had you know this is the total s s t and the degree of freedom is 12 because we had 13 observations and this is the s s regression and s s residual is this one and s s residual degree of freedom is 7 because you know you know that here in the model we we have six parameters and we have six parameters and we have total 13 residuals six parameters means there are six restriction on the residuals. So, so seven residuals can be selected or can be selected you have the freedom of selecting seven residuals and the remaining six residuals have to be chosen in such a way that they follow those six restrictions. So, that is why the residual degree of freedom is equal to seven and then the regression coefficient regression degree of freedom of course, it is five and this is the part of variability which is explained by the model you know this is the total variability in the response variable this is the variability which is explained by this model I mean here if you can recall the the parameter r square is very high I mean most of the variability is explained by this model and here is the f value. So, to check now what I wanted to say is that we have observed that this three straight lines are very similar to the straight line we obtained using by considering three parallel line fit before. Now, we need to test that formally well so our model is the model we are fitting here is y equal to x naught beta naught plus beta 1 x plus z 1 gamma naught plus gamma 1 x plus z 2 gamma 1 x plus z 2 gamma 1 z 2 delta naught plus delta 1 x plus epsilon. Now, to check whether first first we check whether three lines could be identical so three I mean well let me write it to check whether three lines would be identical you test the hypothesis that h naught gamma naught equal to gamma 1 equal to delta naught equal to delta 1 equal to 0 against delta 1 equal to delta h 1 that h naught is not true. So, we have we have the fitted model and now we are trying to we are trying to check whether we really need such a general model or we can just fit a single straight line for all three sets of data so that we will get the answer from from by testing this hypothesis. So, how do we test this hypothesis the F statistics is the SS regression for the full model and SS regression sorry SS regression for the restricted model and by the degree of freedom here it is 4 and by MS residual. So, this is SS residual by residual degree of freedom. Now, if you recall this ANOVA table here SS regression is considering for the full model is 38.711 and of course, you have to compute the SS regression for the restricted model and the MS residual is 0.101 and you can check that for the restricted model what is the restricted model restricted model is y equal to beta naught plus beta 1 x. So, you just fit this model for the given data and you see how much of the variability is explained by this model that is what the SS regression for this model you can check that that is 26.20. So, this one is 30.97 and this follows F 4 and the residual degree of freedom is 7. So, this one is greater than F 0.0147 which is equal to 7.83. So, you you refer the tabulated value for F statistics. So, that means this is significant significant means we cannot go for I mean H 0 is H 0 is rejected. So, this test suggest that for for the turkey model you fit a general straight line feed and then you are testing whether this three blocks of data can be fitted by a single straight line and the test here implies that or you know yeah the test says that you cannot go for a single straight line model you have to you have to go for something else. So, fitting a single model for turkey data is rejected next what we will do is that we will test whether we can go for parallel lines to test whether three lines are parallel. To test this we test the null hypothesis H 0 which is gamma 1 is equal to delta 1 equal to 0 against the alternative hypothesis that H 0 is not true. So, similarly we use the F statistics here and that is nothing but S S regression for the full model and then S S regression for the restricted model right and the full model has 6 parameters and the restricted model has 4 parameters. So, 6 minus 4 is 2 you divide it by 2 and by M S. This follows F 2 7 now we know that for the full model the S S regression is 38.711 you have to find this one you know what is the model here. So, that is 38.61 let me write at least the model. So, y equal to beta naught plus beta 1 x plus delta naught z 1 plus sorry gamma 1 gamma and delta naught z 2 plus epsilon. So, this is the restricted model. So, you feed this model you find the S S regression due to this model that is 38.6 this is highly any difference. So, this is this 1 by 2 by M S residual that is 0.101 this is equal to 0.5 which is less than F 0.0127 that is 9.55. This is the tabulated value. So, this says that the test is not significant that means we accept the null hypothesis. So, here H naught is accepted. So, the feed is the feed we talked about feed y equal to beta naught plus beta 1 x plus gamma naught z 1 plus delta naught z 2 plus epsilon is satisfied. So, this is the S S regression. Well, so let me conclude this part. So, for the Tarki data Tarki data involves 3 sets of observations and what we did here is that first you feed a general model involving 3 dummy variables if it is straight line general model involving 3 dummy variables and then you test whether these 3 lines can be parallel or not. So, from the test here we observe that the hypothesis is accepted that means we can go for 3 parallel line for the Tarki data and based on that based on this test finally we conclude that this model is satisfactory for Tarki data. Well, so this is about 2 sets 3 sets of data straight line feeds. Now, we will go for 2 sets of data quadratic model. Suppose, we have 2 sets of data straight line data on y and x and we are we have in mind a model of the form y equal to beta naught plus beta 1 x plus beta 1 1 x square plus epsilon. So, this is the Tarki data. So, this is Tarki data. So, we have 2 sets of data and we are thinking of fitting a quadratic model for this one. We did not talk about this sort of quadratic model before these are called polynomial fitting, but let me just say that this is not difficult because you just consider x square as say you call it z equal to x square. Then this is nothing but a multiple linear regression model that is all for the time being, but we will be talking about polynomial regression later on. So, here we have 2 sets of data and we are planning to fit quadratic model. So, we fit the model involving 2 dummy variables that is y equal to say let me call it z naught beta naught plus beta 1 x plus beta 1 1 x square plus beta 1 x plus beta 1 1 x square plus z 1 alpha naught plus alpha 1 x plus alpha 1 1 x square plus epsilon. So, this is the same technique we used for linear fitting also, straight line fit. Instead of straight line we are replacing this by a quadratic fit here. So, here the dummy variable scheme is z naught z 1. This is always 1 and this is 0 for a and 1 for b. Now, let me just say that this is the now what we want is that we want to test several possibilities like whether we need 2 different quadratic model for 2 sets of data or we can go for identical quadratic fit for both sets or something else. So, we will be talking about 2-3 possibilities and we will talk how to test them. Now, we test several possibilities. First of all let me write down the model once mode y equal to z naught beta naught plus beta 1 x plus beta 1 1 x square plus z 1 alpha 1 plus alpha naught plus alpha 1 x plus alpha 1 1 x square plus epsilon. So, this is the model. Now, to test whether we can go for the same quadratic fit for both sets of data or not, we will test the hypothesis h naught is say alpha naught equal to alpha 1 equal to alpha 1 1 equal to 0 against h 1 that h naught is not true. So, you know how to test this hypothesis by using extra sum of square technique. So, of course, you understood by this null hypothesis that if the null hypothesis is accepted that means we can go for single quadratic fit. So, if h naught is rejected then we conclude that models are not the same for 2 sets of data. So, this is 1, 2. So, if h naught in 1 is rejected then we can go for the same quadratic fit that means we cannot go for the same quadratic fit for both sets. Now, we will check several other possibilities. If this is rejected then you test this hypothesis h naught that is alpha 1 equal to alpha 1 1 equal to 0 against h naught equal to h 1 that h naught is not true. What does it mean is that well the 2 quadratic models are different, but they have different intercept, but the null hypothesis says that they have same slope and curvature. So, it says that if h naught is rejected then accepted here then we conclude that the 2 sets of data have the same slope and curvature. So, this is how you know you can test many hypothesis and many cases. Let me do one more is that if h naught in 2 is rejected then you can now go for you test whether h naught is alpha 1 1 equal to 0 against h 1 that alpha 1 1 is not equal to 0 that means the null hypothesis is that their slope and curvature are not the same. Now, test whether the curvature is same for both the fits or not. So, this says that the null hypothesis said that the model differs only in 0 and first ordered term. That means the curvature is same, but they have different intercept and different slope. So, that is what you can test several hypothesis once you have a model. So, let me conclude this model. So, here the dummy variable in regression analysis is used to incorporate qualitative information in the data. Here we understood that if you have say 3 blocks then you need 3 dummy variables to fit a general model. So, what is recommended is that you know if you have 3 sets of data you should not go for a single straight line fit for all the blocks together. You fit a general model involving dummy variables and then you go for several testing like whether you can go for a identical straight line fit for all the data or you can go for parallel straight line fits for different blocks like that. And depending on the result of your hypothesis testing you choose the final model. So, I hope you understood the use of dummy variable in regression analysis. They are used to incorporate such qualitative information available with the observation and that is all for today. Thank you very much.