 So, this is my second lecture on dummy variables, and here is the content of this module dummy variables to separate blocks of data, and interaction terms involving dummy variables. Let me explain the objective of this modules once more. The variables used in regression analysis are usually quantitative, and these variables have well defined scale of measurements. The example of quantitative variables are like temperature, pressure, income, expenditure, and but occasionally you know we need to use some qualitative variables in regression analysis. Say the employment status, whether the person is employed or unemployed, it could be the marital status of a person married or unmarried, it could be sex, male, female, it could be like origin, different city from where the data have been collected, and there could be significant difference in response level between two sets of data. For example, let me give example of say I am considering the qualitative variable say marital status, and my response variable is say per month expenditure, and the regression variable is say income per month. So, I am interested to find relationship between expenditure and income, but there could be you know sort of difference in a significant difference in response level, here the response is expenditure. So, there could be significant difference in expenditure amount between a set of married people and a set of unmarried people. So, we cannot really plug both data set together and find a relationship between expenditure and income. So, we cannot put them together and fit a single straight line model, if there is significant difference in response level. So, let me once more explain the turkey data we considered in the last class, and using this example I will say what I mean by significant difference in response level. So, here is the data we considered in the last previous class, and here the y is turkey weight in pounds, and x is edge in weeks. So, we have 13 observations here, and we have sort of three sets of data I mean three blocks you can say. So, this is this four data originated from Georgia, and similarly this four data from Virginia, and this five data are from Wisconsin. Now, here is the response variable is the weight of turkey in pounds. So, what I mean by, so whether see whether we can fit a simple straight line model between x and y here, we have three sets of data right. Well, so first you try we start with simple linear regression model between x and y, and you fit a model. So, this is the fitted model, and you know how to fit this model this is nothing but the fitting y equal to beta naught plus beta 1 x, and here is the model. So, once you have the fitted model you can find the residuals. Now, we can see the residuals for the first block are all negative residuals for the second block are all negative, and here the residuals for the third block are all positive. What does it means is that the response value for the third block or for the third set of data are because this residual is nothing but this is y i observed observation minus the fitted. So, this is positive means the fitted response value for this block are smaller than the original data, and similarly here it is just opposite. So, that means there is a signal, so this sort of this residual if you now plot this residual graphically. So, that indicates that this residual indicates that there is significant difference in the response level. So, we cannot go for simple straight line model between regressor and the response variable ignoring the origin. Ignoring the origin means ignoring the qualitative information we have that they are from three different origins. So, here is the use of dummy variable. So, dummy variable is used to incorporate qualitative information in the regression analysis. Well, so I already talked about this is the, so we need to fit a model in one of all being two dummy variable because we have three sets of data. So, we know about it already. So, here is the model we want to fit the model is y equal to beta naught plus beta 1 x plus alpha 1 z 1 plus alpha 2 z 2 plus epsilon. So, this is the first dummy variable, this is the second dummy variable and then the whole model you know this can be treated as a multiple linear regression model. And it can be written as y equal to x beta plus epsilon. So, this is the matrix notation where y is the response here and beta is the coefficient vector. So, that is beta naught, beta 1 alpha 1 and alpha 2 and x is the coefficient matrix. So, here we can put x naught, x naught is 1 for all the observations and here is the regressor variable and this column corresponds to this regressor variable here and z 1 and z 2 are dummy variable. So, for the first block z 1 is equal to 1 and z 2 equal to 0 for the second block, z 1 is equal to 0 and z 2 is equal to 1 and for the third block z 1 is equal to z 2 equal to 0. So, you have the x matrix, you have y vector, you have z vector. So, you know how to find beta hat. So, here is the estimated value of the x regression coefficients, you know x and y. So, you can find it and the fitted equation is this one. So, I told this one also before that this alpha 1 hat estimates difference in response level between the first block and the last block. And alpha 2 hat estimates difference in response level between second block and the last block, whereas alpha 1 hat minus alpha 2 hat estimates difference in response level between the first block and the second block. So, I go back to the data again. So, looking at this residual, you know it is very clear that there is a significant difference between g and w and the difference in response level is estimated by alpha 1 hat. Also there is a significant difference between the second block and the third block that means v and w which is estimated by alpha 2 hat and intuitively it appears to me that there is no significant difference between the first block and the second block that means g and v which is estimated by alpha 1 minus alpha 1 hat minus alpha 2 hat. But, you know we cannot say whether this is significant or not just looking at the value. So, we need to test the hypothesis that h naught alpha 1 equal to 0 against alpha 1 not equal to 0 and we know that this can be tested using the test statistic this one where x prime x inverse 3 3 denotes the third diagonal element of x prime x inverse and this MS residual is nothing but sigma square hat because and we know that this is nothing but the variance covariance matrix of beta hat. So, you can see the observed value is 9.55 which is bigger than the tabulated value and it has degree of freedom 9 I told why it is 9 in the last class. So, that means the test is significant at 0.1 percent level what is the meaning of that the test is significant mean this is rejected this is accepted and the meaning of this one is that there is a significant difference in response level between Georgia and Wisconsin in response level that means in terms of Turkey weight. Now, let us check whether there is a significant difference between Virginia and Wisconsin so for that we need to test alpha 2. So, alpha 2 is equal to 0 against alpha 2 not equal to 0 same test statistics. So, here only it will be replaced by 4 the fourth diagonal element in the variance covariance matrix and you can see the observed value is 10.43 and the tabulated value is 3.25. So, that means the test is significant in other words there is a significant difference in response level between V and W. So, as I told you you know induitively it appears to me that there is no significant difference between G and V which is estimated by alpha 1 hat minus alpha 2 hat. Let us check whether there is a significant difference between these two or not. So, for that we need to test the hypothesis alpha 1 minus alpha 2 equal to 0 against that now alternative hypothesis that alpha 1 minus alpha 2 is not equal to 0 and you can check that the observed t value is this one is nothing but alpha 1 hat minus alpha 2 hat and this one is the variance of alpha 1 hat minus alpha 2 hat. So, you can check that this one is 0.217. So, the observed value is 1.4 0.217 which is less than the tabulated value at 0.1 percent level of significance and so this says that the test is not significant. That means the difference in response level between G and V is not significant. So, if you see that well so what we have done till now is that we are given three sets of data and we have fitted a model involving two regressor variables sorry one regressor variable and two dummy variables and then we have tested whether there is a significant difference in response level between G and V between different sets. So, what we have observed is that there is a significant difference between first set of data and third set of data that is Georgia and Wisconsin. There is significant difference between second set of data and third set of data that means Virginia and Wisconsin, but there is no significant difference between first set and second set, but since there are significant differences between two pairs of two pairs of sets we cannot go for simple straight line model ignoring the origins. That means, ignoring the qualitative information we have with us. So, we need to go for the first set model involving dummy variable. So, here just the graphical representation. So, this is the fitted model we have seen before and then the response in the first set or in Georgia can be estimated by using this fitted equation. What we have done here is that you put z 1 equal to 1 and z 2 equal to 0 in the general model to get the fitted equation for the first set. To get the fitted equation for the second set you put z 1 equal to 0 and z 2 equal to 1. So, this is the fitted equation for the second set and this one is the fitted equation for the third set where you put z 1 equal to z 2 equal to 0. And you can see the graph of this three straight line feeds and you must have noticed that these three feeds they have the same slope. They are basically the same model with different intercept. Now, we will talk in this class only whether we can go for a model I mean whether we can go for the same straight line model or the straight line model with the same slope for all set of data. That might not be true for all set of data, but here it seems it is okay. So, we will talk about the general case. We can think of different straight line for different sets of data. So, we will talk about the general case. We can think of different straight line for different sets of data. So, before that I will talk about the general case like suppose you have r blocks instead of three blocks you have r blocks and then how many dummy variable you need? Let me say it is r dummies. So, it says that in general we can deal with r blocks by introducing r minus 1 dummies in addition to x naught. So, how do you assign the value for first block, second block and the r th block? So, suppose the dummy variables are z 1, z 2 and z r minus 1 and for the first block what we do is that we put z 1 equal to 1, z 2 equal to 0 and all of them are equal to 0. For the second block we put z 1 equal to 0, z 2 equal to 1, z 2 equal to 0. So, we put z 1 and this and for the r th block we put 0, 0, 0 and z r minus 1 equal to i. So, this is nothing but you know identity matrix of order r minus 1 and for the final block we put 0, 0 and all 0. Now, so here we have assigned value for r minus 1, z 1 equal to minus 1 dummy variables. Now, if we include the dummy variable x naught which is equal to 1 for all blocks. So, this is how we assign value for r blocks and we can deal with r blocks with r dummy variables. So, what is special about this assignment? It says that dummy, you must have noticed that the dummy variable columns, columns are linearly independent. So, to assign the value you can think of some other assignment of the value here, but the constant, the condition is that all this column have to be independent and they also form linear independent set when they are united with the dummy variable regressor variable with regressor variable. So, perhaps you understood why this is true because you have to finally, write the whole thing in matrix notation. Suppose there are r dummy variables, so you can write that in the matrix notation y equal to x beta plus epsilon and then this coefficient matrix s has to be a full rank. So, that is why they are this dummy variable columns are independent among themselves. Not only that is true, they are also linearly independent with the column for the regressor. So, you have to put the regressor variable here also. So, for 2 blocks or 3 blocks, what model we fit? For 2 blocks we fit the model y equal to beta naught plus beta 1 x plus alpha z plus epsilon and suppose the fitted model is y hat equal to beta naught hat plus beta 1 hat plus beta x plus alpha hat z, then block a data are estimated by setting z equal to 0. That means, for block a data the estimated and the fitted equation is y hat is equal to beta naught hat plus beta 1 hat x and block b data are estimated by setting z equal to 1. That is the fitted equation is y hat is equal to beta naught hat plus alpha hat plus beta 1 hat x. So, what we are doing is that we basically fit the same basic model with different intercept to several sets of data. So, whatever we are doing is that we basically fit the same basic model with different intercept to several sets of data. So, whatever we are doing we have learnt in now is that we are basically finally, we understood that we are setting the same basic model with the same slope with different intercept for different set of data. But this might not be true always I mean different set may require different straight line fit. So, next we will talk about a model which involves interaction terms involving dummy variable. So, interaction terms involving dummy variables. So, we are talking about we will be talking about general case here. So, suppose we have two sets of data and we are thinking of straight line models. So, we have two sets of data and we are planning to fit straight line model for both the sets, but here we will be talking about general case. Two sets might have two different straight line. That means in the sense that they might have different slopes, they might have different intercept and all these things. Suppose A and B denote sets of data and we are considering fits involving straight lines. So, there are four possibilities there are four possibilities. The first possibility is that two distinct lines one is beta naught plus beta 1 x for the first set A and then gamma naught plus gamma 1 x for the second sets. So, here this involves four parameters. So, one two three and four and here I am talking about the case like you have two sets of data and we are fitting two different line for two sets. So, suppose this is the line beta naught plus beta 1 x for set A and suppose this is the line, this is the fit line and this fit for y equal to this y equal to gamma naught plus gamma 1 x for set B. So, this is the first case. Now, the second case is case B is that two parallel lines. So, this is the first line is beta naught plus beta 1 x for set A and the second line is gamma naught plus beta 1 x they have the same slope. So, here you need to estimate three parameters and here I am talking about this situation suppose this is my line beta naught plus beta 1 x. So, this is for the y equal to this is for the first sets A and I am fitting another line with the same slope these two are parallel y equal to gamma naught plus beta 1 x for set B. So, this is the second possibility. The third possibility is two lines with the same intercept with the same intercept. First line is say beta naught plus beta 1 x this is for the set A and the second one is beta naught same intercept beta naught plus gamma 1 x the slope could be different. So, this is the possibility third possibility. So, here this is my say y equal to beta naught plus beta 1 x and the other line has the same intercept but different slope. So, this is beta naught plus gamma 1 x for this is for block A this is for block B y equal to this. And the fourth possibility is the second possibility is the one line. So, let me write here only one line. So, we are fitting the same line or just one straight line fit for both sets of data. So, here I did not say how many parameters one two three. So, three parameters here and one line and the line is beta naught plus beta 1 x. So, this is the line we are fitting for both the sets A and B and here is the graph. So, this is the line y equal to beta naught plus beta 1 x well. So, we talked about all the four possible situations for two sets of data involving a straight line fit. Now, we will be looking for a model involving dummy variable a single model which can take care of all these four possibilities. So, here is the model. So, here is the model. So, we can take care of four possibilities at once by choosing two dummies including x naught. So, basically one dummy and then x naught. So, x naught and z and z equal to 0 for block A and z equal to 1 for block B and as usual x naught is always equal to 1 and then the model would be y equal to x naught and z equal to 1. Beta naught plus beta 1 x this is for the first block plus z beta naught sorry z alpha naught plus alpha 1 x plus epsilon and this can be written as. Now, any one of these anyway x naught is always equal to 1. So, I can write this is equal to beta naught plus beta 1 x plus alpha naught z plus alpha 1 z x plus epsilon. So, this is the model which involving one dummy variable which can take care of all the four possibilities. Let me explain it. So, here one observation you can do that this model contain. So, this model contains not only not only z, but an interaction term here is the interaction term z x involving z. So, this is the final model and I mean this is the model we have decided and now if you put z equal to 0 we will get the model for A and by putting z equal to 1 we will get the separate model for B. So, the separate models for A and B are given by setting z equal to z equal to 0 and z equal to 1. So, if you put z equal to 0 in that model you will get y equal to beta naught plus beta 1 x that is the model for A and y equal to beta naught plus beta 1 x. So, alpha naught plus beta 1 plus alpha 1 x for B. So, which is nothing but gamma 1 plus gamma naught plus gamma 1 x the model for B. So, what given us given the model for B and 2 sets of or 2 data sets you can go for this general model and then you fit the model and then you test whether you need 2 separate models. Now, you can go for several testing whether we really need 2 different straight line fit for 2 sets or we can go for a parallel lines or whether you can go for a single line or we can go for 2 lines with the same intercept all these 4 possibilities we can check now. To test whether 2 parallel lines will do parallel lines will do parallel lines will that is to test the appropriateness of case B we would fit the model let me call this model star we would fit this model first. You consider the general model you fit this model first and then you test whether you can go for I mean whether 2 parallel lines are enough or not. So, first we would fit star and then and then test this hypothesis H naught well so H naught which hypothesis you need to test you need to test whether they are parallel that means you have to test whether alpha 1 is equal to 0. You test H naught that alpha 1 equal to 0 against the alternative hypothesis alpha 1 not equal to 0. So, I hope that you know how to test this one I will talk about some more example later on. So, this is the test you need to do to test whether you can go for second case that means whether you can go for 2 parallel lines and if this is rejected this null hypothesis is rejected that means you cannot go for 2 parallel lines. So, you have to fit 2 different lines I mean you have to go for some other cases. Now, to test the appropriateness of the case C what is that C is 2 lines with the same intercept to test the appropriateness of case C we would first fit with fit star that general model and then test this hypothesis. H naught so what hypothesis you have to check is that whether they have the same intercept but different slope. So, you have to check you have to test whether alpha not equal to 0 or against the alternative hypothesis that alpha not is not equal to 0. So, to test the appropriateness of case C you have to test the hypothesis alpha not equal to 0 against the alternative hypothesis H 1 that alpha not not equal to 0. And finally, to test the appropriateness of case C you have to test the hypothesis of the case D that means you go for the identical model we would test what we have to test is that whether we can go for identical model. That means you have to test whether alpha 1 is equal to alpha 2 is equal to 0. So, then both the models are same. So, you have to test the hypothesis we would test H naught that alpha not equal to alpha 1 equal to 0 against the alternative hypothesis H 1 that H naught is not true. Well, so let me summarize this part. So, here now we are given two sets of data and we are trying to fit a general model involving two dummy variables such that we can cover so we talked about a model which can cover all the four possibilities. And then you fit this model first for your given data set you fit the general model involving the dummy variable and then you go for several testings like whether you test the appropriateness of case B if that is rejected suppose if that is that is I mean rejected means H naught is rejected that means what is that H naught is rejected. So, H naught is rejected that means the lines are not parallel. So, you cannot go for two parallel lines for two sets of data. Next you check the appropriateness of the case C here if your null hypothesis is rejected that means you cannot go for two straight line model having the same intercept and also if the last case is also rejected that means you cannot fit two I mean you cannot fit same straight line for both the data. So, you fit the general model and then you go for testing to test the appropriateness of case B case C and case D if all of them are rejected then you go for two separate I mean two distinct straight line fits for two two sets of data. So, this is what you know this is how I mean we have generalized the case here now let me talk about instead of two sets if we have say three sets of data and also we are trying for the same straight line model then three sets of data say A B C and we are going for straight line model perhaps you know I should not start this one because I do not have time today. So, I will talk about now how to how to fit a general model for three sets of data and instead of two sets you have three sets of data now how to what is the general model for that which perhaps cover all the possibilities. So, that we will be talking in the next class thank you very much.