 Today, we will start new module called dummy variables and here is the content of this module dummy variables to separate blocks of data and interaction terms involving dummy variables. Let me explain what is the objective of this module. What happened in regression analysis? In most of the cases, we use quantitative variables and these variables have well defined scale of measurement. For example, the variables we considered here like weight, age, height or temperature, pressure or may be income expenditure and all these things. But occasionally, it is necessary to use some qualitative variable like the variable like employment status, whether the i-th person is employed or unemployed or sex may be whether male or female. It could be marital status also. It could be origin also like whether the observation is from a city say from Calcutta or from Mumbai or from Delhi. So, these are called sort of qualitative variable and the objective of this module is to use how to incorporate this qualitative information in regression analysis. Let me just give an example to illustrate the qualitative information in the regression model. Let me consider turkey data. Here, the response variable y stands for the turkey weights in pound and the regressor variable x stands for the turkey age in weeks. Here, you can see that the first four observations are from a city called Georgia and the next four observations are from Virginia and the last five observations are from Wisconsin. Of course, I should mention that this data is from a book called Applied Regression Analysis by Draper and Smith. So, what we want is that we have a response variable, we have a regressor variable, we want to fit a relationship between a straight line fit may be for example, between y and x. So, what it says that we would like to relate y to x via a straight line simple straight line model, but the different origin of the turkey may cause a problem. Let me explain this part. Now, what is the problem here is that there might be significant difference in response level for different origin and that information I mean we need to incorporate that part. I mean we cannot if you well you can fit a model for this block first and then you can fit a simple straight line model for this block and finally, another simple linear regression model for the last block, but that we do not want I mean we want to fit a single straight single model between x and y and also there will be some problem with that I mean if you consider all the data together and fit a single model, there could be some problem we will explain that part. So, what we want is that we want to fit a single model and at the same time we want to incorporate the qualitative information we have like they are from three origins three different origins and there could be significant difference in the response level. So, we need to incorporate that part. So, it is a nice idea here how to incorporate that information that I know they are from different origin and then we will fit a single model to handle this situation. So, let me talk about the model we are going to fit. So, dummy variables separate blocks of data. So, blocks means this could be different origin this could be for different employment status one set of data for employed person, one set of data for unemployed persons, one set of data for male, one set of data for female this is what we mean by block and there could be significant difference between the response level between say male and female. So, you have to incorporate that part also in the model by using dummy variable. So, suppose we wish to introduce into a model the idea that there are two types of machines say types A and B that produces different level of response in addition to the variation that occurs due to the other regressor. So, what I mean by this is that there could be two blocks of data one for machine A and the another for machine B and there could be significant difference in their response level in addition to the variation due to the other regressor variables. So, how to incorporate the information that one set of data is from machine E the production from machine A and the other set of data is production from machine B. So, one way to do this is one way to do this is to add a dummy variable say call it z. So, now we will introduce a model involving a dummy variable. So, consider the simple model with one regressor variable x and one dummy variable say z and I should mention that this dummy variable can take value either 0 or 1. So, here is the model involving one regressor variable and one dummy variable. So, the response y is equal to beta naught plus beta 1 x this is a simple linear regression model involving one regressor and then I will add the dummy variable z here with the coefficient alpha plus epsilon. So, the model is also let me write down the model again here. So, y equal to beta naught plus beta 1 x plus alpha z plus epsilon. So, this is the model. So, here z equal to 0 if the observation is from machine A and it is 1 if the observation is from machine B. So, the model for block A data or machine A is basically y equal to beta naught plus beta 1 x plus epsilon because for machine A it is z equal to 0 and for machine B the model is beta naught plus beta 1 x plus alpha plus epsilon. Well, but we are considering a single model to fit all the observations together. Well, I will explain it again later on. Then let beta naught hat beta 1 hat alpha hat be lead square estimate of beta naught beta 1 and alpha respectively. Then the fitted model is model is y hat equal to beta naught hat plus beta 1 hat x plus alpha hat z. So, now machine A data are estimated by setting z equal to 0. So, the fitted model for machine A is basically y hat equal to beta naught hat plus beta 1 hat x and machine B data are estimated by setting z equal to 1. So, the fitted model for machine B is y hat equal to beta naught hat plus beta 1 hat x plus alpha sorry alpha hat. So, you can see that the fitted model for machine A and the fitted model for machine B both are straight line and both of them are having the same slope beta 1 hat only. So, this is called the intercept. Only the difference is machine A has different intercept than machine B. So, here the intercept is beta naught hat whereas, for machine B the intercept is beta naught hat plus alpha hat. So, what this alpha hat does is that this alpha hat alpha hat simply estimates the difference in response level between machine A and B. Well, now let me explain you know this is beautiful concept you know you need to think little bit to understand it more. Now, let me explain the technical part like. So, you are given the model y equal to beta naught plus beta 1 x plus alpha z plus epsilon and you need I said that let beta naught hat and beta 1 hat and alpha hat are least square estimate of these parameters. Now, I explain you know how to do it. So, once you add a new variable its multiple linear regression model type of things. So, here you can write this as y equal to x beta plus epsilon. So, here the x matrix x is equal to what I will do is that I will write x naught here I will write x naught here. So, x naught is sort of 1 for all observations. So, there is no harm in putting x 1 here. So, this is the column corresponds to x 1 and they are all equal to 1 and here is the regressor variable x or there could be several regressors. So, I should write other regressors and then z the dummy variable. So, the dummy variable is 0 for machine A. So, this part is for machine A. Suppose there are n 1 observations and the dummy variable value z is equal to 1 for machine B. So, this is the x matrix. This is for machine B and suppose there are n 2 observations. So, this is what the x matrix is. So, you know what is x y is simply the response observations of the response variable y 1 y 2 up to y n 1 plus n 2 I should write. And what is beta? Beta is a vector of course, beta is equal to beta naught, beta 1 and alpha. So, this can be written as a multiple linear regression model y equal to x beta plus epsilon where x is this matrix y is this matrix this vector and beta is this vector of regression coefficients. And you know how to estimate beta for multiple linear regression model. So, beta hat is nothing but x prime x inverse x prime y. This is how we estimate beta naught hat, beta 1 hat and alpha hat. So, here I want to say that you have 2 block of observation and you need basically 1 dummy variable, but I said the dummy variable takes the value 0 and 1. I mean in that sense you can say this x naught is also a dummy variable because it always takes the value 1. So, I will write that 2 sets of data or 2 blocks 2 blocks of data require dummy variables including x naught. I mean I am considering x naught is also a dummy variable. So, for 2 sets of data we need 2 dummy variables 1 is z and the other dummy variable is x naught. So, x naught is equal to 1 always and z is equal to 0 for the first block or block A or machine A and z is equal to 1 for the second block or machine B. Now, suppose instead of 2 sets of data you have 3 sets of data like in the example of turkey data there are 3 sets of data. So, how to fit a single model for the turkey data because there you have 3 sets of data. I mean instead of 3 sets it could be or any number of sets I mean any number of blocks say r blocks then how to handle how to fit a single regression model involving dummy variable to incorporate the qualitative information that they are from different blocks. Let me first talk about 3 blocks and then you know we will talk about n blocks in general. So, here let me talk about how to handle 2 blocks sorry I said right 3 blocks right 3 blocks 2 blocks we are done 3 blocks. And how many dummy variables let me say 3 dummy variables 3 dummy variables. So, how to do with 3 dummy variables I mean 3 dummy variables means 2 real dummy variables say z 1 and z 2 and 1 dummy variable is of course that x naught. So, how to incorporate the information that there are 3 sets of data and that you have to incorporate using 2 dummy variables z 1 and z 2. So, you cannot put you know z value is equal to 0 1 and 2 for 3 blocks not like that because the dummy variable takes the value 0 and 1. So, here is the idea you have 2 dummy variables say I mean we are going to use 2 dummy variables z 1 and z 2. So, z 1 and z 2 is equal to 1 0 for let me write machine machine a or you can say block a and this is equal to 0 1 for machine b and this is equal to 0 0 for machine c. So, the model would be here the model would be y equal to beta naught let me put x naught right now x naught. So, beta naught plus beta x which is the regressor variable plus alpha 1 z 1 plus alpha 2 z 2 plus epsilon. So, here we have you know we have 3 dummy variables specifically including x naught. x naught z 1 and z 2 and let me explain again that you know similarly like previous case this can be also written as in matrix form y equal to x beta x beta plus epsilon. So, where what is the x matrix here? So, x is equal to so for x naught x naught is all equal to 1 and then other x s I mean here instead of 1 regressor there could be several regressor it does not matter you just put the matrix here and then z 1 z 2. So, z 1 z 2 z 1 is 1 and z 2 is 0 for the first block or for the machine 1. So, suppose these are the data for machine 1 and then for block 2 z 1 is equal to 0 and z 2 equal to 1. So, 0 1 0 1 sorry 0 1 0 1. So, this is for machine a this is for machine b and for the block c z 1 is equal to 0 and z 2 is equal to 0. So, this is how you get the you get the x matrix and this is for machine c and you know what is this y? y is the vector of observations response variable y 1 up to let me write y n only. And beta is equal to beta naught beta alpha 1 alpha 2. So, you are all set to fit a multiple linear regression model because you know what is x you know what is beta you know what is y everything is given here. So, you know the least square estimate beta hat is equal to x prime x inverse x prime y. So, you can compute beta naught hat beta hat alpha 1 hat and alpha 2 hat by solving this one. So, suppose the fitted equation is the fitted equation is y hat plus sorry y hat equal to beta naught hat plus beta hat x plus alpha 1 hat z 1 plus alpha 2 hat z 2. So, given 3 sets of data we have fitted a simple regression model and c for machine a or for block a data are estimated by by setting z 1 z 2 equal to 1 0 right. So, the fitted equation or yeah. So, fitted equation for block a is y hat is equal to 0. So, beta naught hat plus beta hat x plus alpha 1 hat. Similarly, machine b data are estimated by setting z 1 z 2 equal to 0 1. So, y hat is equal to beta naught hat plus beta hat x plus alpha 2 hat and machine c data are estimated by setting z 1 z 2 equal to 0 0. So, y hat is equal to beta naught hat plus beta hat x well. So, we have 3 sets of data and what we are doing is that we are fitting a single linear regression model involving dummy variable to all the data I mean including block a b and c and finally, you can see that there are 3 different fitted equation 1 for block 1 which is different from the fitted model for block 2 and which is again different from the fitted model for block 3. So, we get 3 different fitted fitted equation for 3 blocks and they are having the same slope, but different intercept and I mean this is what we want I mean we want to fit a single model involving all the data and we do not want to fit separate model for separate set of data separate block separate blocks of data. Now, this alpha 1 hat it estimate the difference in response level between c and a similarly, the alpha 2 hat is I mean this estimates the difference in response level between b and c. So, let me write that alpha 1 hat estimates the difference in response level between a and c alpha 2 hat estimates the difference in response level between b and c and then how you estimate the difference in less response level between a and b. So, the difference between difference in response level between a and b can be estimated by alpha 1 hat minus alpha 2 hat. So, alpha 1 hat minus alpha 2 hat this estimates the difference in response level between between a and b. So, if this this estimated value is is large then you can say that say alpha 1 hat is large then you can say that there is a significant difference in the response level between a and c I mean again you know it is it is hard to say what I mean by for large alpha hat. So, we need to go for statistical test to test the significance of alpha 1 hat. So, what we will do is that we will test whether alpha 1 hat the null hypothesis is that whether alpha 1 hat is equal to 0 against the alternative hypothesis that alpha 1 hat is not equal to 0. So, if the null hypothesis is rejected that means the alternative hypothesis that alpha 1 hat is not equal to 0 is accepted that means there is a significant difference in response level between between a and c. And we know how to how to test that. So, it says that if if desired t test can be performed to test the difference in response level between a and c sorry between a and c. So, we are formally going to test that whether the null hypothesis H naught which is equal to alpha 1 equal to 0 against the alternative hypothesis H 1 which says that alpha 1 is not equal to 0. So, this alpha 1 is basically the difference in response level. So, I mean I hope that you understand what I mean by this difference in response level. So, if you recall the turkey data example. So, suppose one set of data is from there we have one set of data from Wisconsin and then another set of data from Georgia and and the response is the turkey weight. So, the by this difference is in response level between turkey I mean between say Georgia and Wisconsin what I mean by that is that whether there is a difference in turkey weight significant difference in turkey weight which are originated from Georgia from the turkey which are originated from Wisconsin. So, that is what I mean the difference in response level between two sets of data and for the turkey data it is it is their weight whether the weights are significantly different for two different origins. So, to test this hypothesis we know how to test this hypothesis we go for test statistic I am sure that you can recall from my first module or second module simple linear regression model or multiple linear regression model. The test statistic for testing this hypothesis is that t equal to alpha 1 hat by x prime x inverse m s residual. I mean here basically I am looking for the variance of alpha 1 hat and if you recall the beta matrix there so that is beta naught beta 1 alpha 1 alpha 2. So, in this matrix the third diagonal element is the variance of alpha 1. So, this is I will put 3 3 3 3 means it is a third suppose this is this is the variance covariance matrix for beta and then here you have the data the diagonal elements. So, this element is the variance of so this quantity is the variance of alpha 1 hat. So, I denoted by 3 3 and the critical region is that you reject this null hypothesis if this t is significantly large means it the modulus value of this t is greater than t tabulated value alpha by 2 at the alpha level of significance and the degree of freedom is residual degree of freedom. I hope that you can under you remember all these things. This is how we test alpha 1. So, that this is what we are testing whether there is significant difference in response level between a and c. Similarly, we can test for h naught alpha 2 equal to 0 against h 1 alpha 2 not equal to 0 and this one is this one stands for difference in response level between alpha 1 and c. So, to check whether there is a significant difference between in response level between the data from b and from the data in c. So, the same test statistics. So, t is equal to alpha 2 hat by x prime x in alpha 2 hat. So, this is the inverse m s residual and this is the fourth parameter regression coefficient. So, the variance of alpha 2 hat is basically this fourth element at the position 4 diagonal elements and here is the t statistics. Similarly, the critical region is same. Critical region is t greater than t alpha by 2 and residual degree of freedom for this t distribution. So, we are left with how to check the difference in response level between a and b. So, I told before that alpha 1 hat minus alpha 2 hat estimates the difference in response level between a and b. So, what we will do here we will check h naught whether alpha 1 minus alpha 2 is equal to 0. So, this is the difference in response level between a and b. So, if this is accepted that means there is no difference in significant difference in response level between a and b against the alternative hypothesis h 1 which is alpha 1 minus alpha 2 hat is not equal to 0. So, here also you use t statistics that is equal to t 1 hat minus t 2 hat by variance of t equal to alpha 1 hat minus alpha 2 hat by variance of alpha 1 hat minus alpha 2 hat. And this quantity you can compute because variance of alpha 1 hat minus alpha 2 hat is equal to variance of alpha 1 hat. So, you know how to compute it plus variance of alpha 2 hat minus twice covariance of alpha 1 hat alpha 2 hat. So, if I write down the variance covariance matrix of beta you know. So, this variance of alpha 1 hat is this element variance of alpha 2 hat is this element and the covariance between alpha 1 and alpha 2 hat is this one. So, you can compute this you know once you because you know this one and then it is same you know this follows the critical region is that modulus of t is greater than t alpha by 2 residual degree of freedom. So, if this t is observed t is greater than the tabulated t you reject the null hypothesis and you can say that there is a significance difference between response level of a and p. Well still you know I hope that everything is clear, but I would like to illustrate the same thing for the target data. So, we have this data I explained that this is the edge in weeks and this is the weights in pounds and this is the origin different origin. And what you can do is that suppose you do not know the dummy variable technique and you want to find relationship between x and y. So, first regress y against x and here is the fitted model. Now, what you do is that once you have the fitted model ignoring the dummy variable you compute the residuals. So, you can see there is something wrong with this residual I mean wrong in the sense that for this block of data the residuals are all negative. And for this block of data the residuals are again all negative and perhaps you know these are smaller than overall compared to this and for the last block they are all positive. So, this sort of indicates there is a significant difference in the response level between this three sets of data. So, that implies that you need to consider a dummy variable say z 1 and z 2 and fit the model. So, considering the dummy variable z 1 and z 2 here is the model I explained just now and to clear any sort of doubt you know I just wrote down the x matrix. So, here you see the x matrix this one is for the dummy variable x naught which is always equal to 1. And here you have only one regress here. So, you put the data here for the corresponds to x only. So, if you have x 1 x 2 you put x 1 x 2 here and you have the dummy variable z 1 and z 2. So, for the first block it is all 1 0 1 0 1 0 1 0. So, for the second block it is 0 1 0 1 0 1 0 1 0 1 and for the last block it is all 0 0 0 0 0 0 0 0 0 0 0 0 0 0. So, here is the x matrix. So, you know what is x matrix and here is the y vector observations on the response variable is same as this one and you have four parameters here, ß0, ß1, ß1, ß2. So, you are all set with to fit a multiple linear regression model, y equal to x ß plus epsilon. So, here you do all this calculation, here is the estimated parameters. Now, you see, so the fitted equation is this one, this one, this is ß0, this is ß1, ß1, ß2. So, this is your fitted equation, as I told that, you know, ß1 estimates difference in response level between G and W, that is block A and block C, perhaps. And ß2 estimate the difference in response level between V and W, that is B and C. And ß1 hat minus ß2 hat, that is equal to G 0.27, that estimate difference in response level between G and V. So, you can see, looking at this data, you cannot say whether this quantity or this ß1 modulus value is significantly big or not. So, you have to go for formal test. So, you test for ß1 equal to 0 against h1, that ß1 is not equal to 0. So, you go for the t statistics and here is the t value and the degree of freedom is 9 here. I hope, you know, you can recall why it is 9, because there are, because there are 13 observations total. And since, so 13 observations mean 13 residuals are there and since there are 4 regression coefficients, there will be 4 constraints on the residuals. So, you cannot choose all the residuals independently. There are some restrictions, there are 4 restrictions. So, you can choose, you have the freedom of choosing 9 residuals and then the remaining 4 have to be chosen in such a way that all these 4 restrictions are satisfied. So, that is why the degree of freedom is 9 and the table, the tabulated value is this much, which is much smaller than this observed value. So, this conclude that ß1 is equal to 0 is rejected. That means, this is accepted, which means there is a significant difference in the response level between the Turkey from Georgia and Wisconsin. Similarly, you test for α2 and to test the difference between this V and W and this test will give you the difference in response level between G and V. And here you can, you can see that basically, you know if you put, so this is the fitted model and if you put Z1 Z2 equal to 0 1, here is the fitted model for Georgia. If you put 0 1 in this model, you get this is for Virginia and this is for, if you put 0 0 here, you get this model for Wisconsin and you plot them. So, you are getting, ultimately you are getting 3 different fits and they are parallel, but they have different intercepts. And if you see them carefully, you can identify the difference in response level between the Turkey originated from 3 different origins. So, we have to stop now. Thank you for your attention.