 Hello everyone, welcome to the session of regression analysis. In the previous session we have discussed simple linear regression and different measures of goodness of fit. Today, we will extend the concept of simple linear regression to multiple linear regression. In the previous session, we discussed the steps of simple linear regression and how to estimate the coefficients of least square and also we have discussed different measures of goodness of fit like r square, standard error, prediction interval etc. So, let us see what we have done as a simple linear regression. Today, what we are going to discuss that is called multiple regression. This is just extension of simple linear regression, but basic the fundamental concepts will remain same. So, here if you have a data set say x variable and y variable dependent variable and independent variable and based on the sample data, you can fit the regression like y equals to alpha plus beta x or a plus bx. So, then you can calculate the regression coefficient through least square method we have discussed in the previous session and also the intercept. So, intercept and the slope that we have calculated in the previous session and then we have fitted the line. So, this is what simple linear regression and then different measures of goodness of fit we have calculated which are very important in regression analysis like coefficient of determination of the r square which is nothing but the square of correlation coefficient also or it is actually as far as technical understanding are concerned the r square is nothing but the explained variation by the total variation. That means in the data how many data are falling on the line that is your explained variation and those all total all points and the distance are coming up to be a you know if you put together you will be able to calculate the total variation. We have discussed in the previous session and this is nothing but SSR by SST and this is what the r square value. R square generally range from you know 0 to 1 because r correlation coefficient can be minus 1 to plus 1 in case the relationship are like this it can be minus 1. So, but once you take the square it is actually the 0 to 1. So, higher the r square value better the regression strength or the relationship between the independent variable and dependent variable. We have also discussed the standard error standard error is nothing but the variation of the data from your predicted value. So, how much is the variation you can expect in the future that is standard error that also helps in making a prediction interval or confidence interval of your data. We have discussed detail of it you just you can refer to the previous session of regression and the measures of goodness of fit you will get to know the detail inside of this. Also one more part let me discuss that also like the difference between the major r square and the standard error. r square actually develops on the past data r square helps to define your strength of relationship between the independent variable and dependent. How it is explaining the independent variable is explaining the dependent variable that is been defined by r square based on the past data. So, based on past historical data you are seeing the strength between the dependent variable and independent variable, but in standard error actually you get the future confidence. That means once you establish the regression graph or you know say you know strain line and after that if you have new data sets and if you get a predicted value say with some input data for independent variable or explanatory variable in that case whatever the output you get that is the estimated value. But that standard error and the prediction interval these two combinations gives you the confidence that your predicted value will fall inside this band line or inside the range of the confidence interval may be 68 percent, 95 percent, 99 percent all these we have discussed in the previous session. So, this is what the summary of the previous two session that we have discussed on regression analysis. Now, let us enter into the multiple regression. Also one more part that if you have a look only remember these examples we have illustrated for simple linear regression and we have fitted the line say like this is the line say with this intercept and slope and then this is the r square value and this is the adjusted r square value and the standard error observation and the overall F test value and the corresponding P value we also found less than 0.05. So, the overall regression is also significant and individual level since it is one single variable only to think much of that, but still if you see the overall individual level variable also says that P is less than 0.05 for the corresponding T test and you can say that the regression analysis has been well established with good r square and the corresponding standard error is this and the corresponding forecast is for 10 new sales persons your forecast will be like this. This is what the illustration part also in excel we have done it. Now, let us enter into the regression like multiple regression. Remember simple linear regressions we have discussed the basic assumptions like linearity among the variables. There will be linear relationship between the independent variable and dependent variable. The independent the observations will be independent to each other as well as the errors will be normally distributed. We have discussed same. We have also discussed that equal variance or homoscedasticity that means the variations of the error or the distribution of the y data sets output data sets will be to some extent equal or in a range for a given data set of x. So, there will be no much to not too much of variation in y values for a given x values. So, the variations are in limit. So, therefore the variations of the error is also constant. So, therefore since the constant or equal variance we are considering this is called the basic in assumptions of homoscedasticity or homogeneity of regression assumption. This also help this also same for multiple regression also all four. Now, one more assumption are required for multiple regression that is called multicollinearity. The independent variable should not have any relationship among each other. They should be independent as much as possible. If there are little multicollinearity or correlation between them we can accept and fit the regression line multiple regression line. But if there is a high correlation between the independent variable you cannot accept that relationship or that causal relationship or regression. You have to reduce the multicollinearity. There is a logic, there is a concept, there is a method to discuss the multicollinearity. We will take a separate session and discuss the multicollinearity and other miscellaneous activities of regression analysis. But for the time being this is another assumption that there will be no multicollinearity among the independent variables. So, these are the major five assumptions for multiple linear regression also. Now, keep these five assumptions in your mind. Let us enter into the multiple regression. So, in simple regression what we have done? y equals to a plus say bx. Simple one variable independent variable and one dependent variable and you have fitted the line. Now suppose you have a more than one independent variable. Look at here 1, 2, n say and this is the general forecast for regression line or the trend line of your multiple regression. So, where these are nothing but least square coefficients. So, here you have to calculate the series of equations through least square method, partial derivatives and if you make them equal to 0 and rearrange the data sets, you will and solve them. The system of equations, you will get this coefficients of multiple regressions. So, once you get that coefficient values in terms of based on the data sets, you can fit your regression line. So, this is what look at a is the same as it intersect part and remember in two dimensions, you can understand the line. You can understand the line because of the data, you can see the relationship and this part is a as I mentioned and this part is b say among the data set, the slope part. So, this is what your simple linear regression but when it comes to the and you can imagine the graph here, you can imagine the graph here but when it comes to the multiple regression up to two variables, you can think about the 3D function. When you have n variable independent variable, then it becomes some to some extent n dimensional graph or n dimensional picture. So, in that case it will be called as a y function will be called as a hyper plane because you have many independent variable and in that case it is to some extent n dimensional diagram which you cannot imagine only theoretically you have to understand and the carry for the discussion or the calculation process. So, this is overall it is nothing but a hyper plane remember it. Now, similarly this coefficients you can also calculate using the least square and the other independent variable, this is the predicted value of the dependent variable. So, as it is using least square you can do it which is nothing but extension of simple linear regression, we are not going to focus on the calculations of regression coefficient same as previous session. Only we would like to see the application and the illustration in excel. So, now suppose you have a data set now with numerical examples let us understand the multiple regression concept. Suppose you have two independent variables say x1 and x2 that means and one dependent variable say salary. So, age is one independent variable and experience is another independent variable and you have captured the instances of combination of say x1, x2, x1, x2 and y1 set one data sets. You have captured that all these instances in the past or based on the past samples. So, these samples will help you to create a relationship between the independent variable x1, x2 and y. In that case your output that salary will be dependent on two variable not only one variable now age, the person's age and experience. Suppose if you have this sample data set and if you fit the multiple regression like this say a plus say you know b1 x1 plus b2 x2 say error term will be there we are not focusing on that. Suppose this regression line if you fit this is called the multiple regressions and this is your estimated predicted value and these are your like you know multiple regression coefficient. So, you have to calculate that using least you can do it but we will go to excel and we will see the values of a, b, b1 and b2 these coefficients as well as what is the best fitted line of these data sets. Look at here the salary is dependent on age and experience. So, suppose both will explain the dependent variable not only age or the experience will explain the salary but effectively both will explain the salary. So, how much correlation coefficient or say you know coefficient of determination you are getting because earlier you had only one variable now you have a two independent variable and in that case this multiple regressions will define the impact the explained relationship between dependent variable and two independent variable now. So, that r square is a combination of both the variables. Let us see how it works and we will study that for this a given data sets say you know for a given say x 40 and say y say 14 years. So, in that case what would be your y value, y value so that we are going to study. Let us see using excel. So, here we have come to the excel now and if you see the data set I believe it is visible let me increase the font size. So, if you see the data set here. So, we have taken the independent variable age and experience to independent variable and salary is the dependent variable. So, how to fit it we know this process because yesterday in the last session we have done it look at here once you install the data tool pack you go to regression go to data and the regression and the data tool pack select regression click ok you will get this data set. So, let me rerun it for your look at the salary data I have kept here sub salary as dependent variable and x input are two variable now. So, I will select both the column now both age and experience as two independent variable there is a label because age x the first row we are considering so click label and the output cell you have to select some cell. So, here already we have output cell I have made a color to save the time. So, let me put here say same result will come. So, we can see the predicted or summary output of your multiple regression same as it is the previous one. Now, if you see here look at the age intercept will be as it is look at the age and experience slope. So, these are the two independent variable and the corresponding you know slope we found that means the coefficients of the independent or explanatory variables this is like 99 and 2160 2162 and also if you see the R square value here look at 97 percent quite good very good right. So, the R square is very good of your regression and adjusted R square is nothing, but same as I mentioned in the last session that it is nothing, but people actually prefer adjusted R square because you know if you increase the sample size or say more independent variable system the overall process penalize the relationship because it is penalizing the relationship because you are adding more data and as well as also to some extent you know suppose if you add more independent whole that entire simulations are being done sensitivity are being done by the software and they have come up with the if you do more as the adjusted what you analysis or say sensitivity analysis with the data with more independent variable or similar type of data or say you know more sample size etcetera you might get a better R square, but here actually you know system penalize your R square value and adjusted R square value will be always either same or less than actual R square. So, this R square adjusted R square is more reliable most having more strength than R square one of them you can select as per your decision-making process. Suppose here we are considering R or say adjusted R square standard error as it is I talked about the standard error the deviation from your expected prediction and the gap right upside and downsides if you add that 1 sigma that means 68 percent 2 sigma means you know 95 percent this way you can do it, but if the sample size above 25 or 30 then only you can take that 1 sigma to 1 standard if it is standard error or 2 standard error, but if the sample size are less say 5, 10, 10, 15 sample size in that case you cannot just directly calculate the standard error or the predicted interval by 1 sigma to sigma you have to obey to the standard normal table and from there you have to select the t value and the corresponding you know calculations. That also we have discussed in the simple linear regression session. So, now this is these are the you know measure of goodness of fit whether the whether the model is really good or bad that says from here now come to the you know the overall analysis and the regression coefficient analysis. So, rather than studying this we can come here because I have put a color of all this it will be easier for us to get a output. Now, if you see the prediction first of all let us see the prediction. So, what could be the prediction the predict for a given value of 40 and 14 as the work experience and 40 as the age what could be the prediction. Let me rewrite here I can the prediction would be which I have already written in the PPT, but here you can see it is nothing but intercept A plus B1 say slope of age multiplied by input of new input of age plus sorry plus slope of experience the coefficient of experience like B2 multiplied by the input of experience you will get the y hat or the new forecast estimated value or forecasted predicted value of salary. So, this is what your regression outcome. Now, let us see we have discussed the R square the standard error adjusted R square as well as the look at the overall observations you have 20 samples overall observation and if you look at the overall ANAVA table this is nothing but actually let me use the pen this is nothing but actually you know your it says the overall strength of your regression. Here if you read that the ANAVA table it is nothing but the degree of freedom you have you know total 2 independent variable. So, it is coming up to be the 2 degree of freedom and the residuals will be k like you know AN minus k minus 1 from there you will find you know 17 as the residual number of residuals and if you see the total degree of freedom it is coming up to be 19 and then the total sum of square error and that and then you know that for the regression you will find this and for the residuals data you will find this and the overall total sum it will be like that and mean of square error will be like this too and if you take the division of that you will get the F value look at F is 366 which is quite good higher the F better the you know regression the causal relationship that you have established lower the F like if it is less than 2.5 or something 2.9 2.5 or less than 2 we say that there is no much strong relationship between that. So, this signifies the F test of the overall regression analysis whether it is single simple linear regression or multiple linear regressions and once you can get the F the ratio between the mean square error of the regression line and the residual data this value you will get the corresponding P value. This let me you know make it with clear outcome. So, look at the significant value this is the P right P of overall you know F test and here you will find that overall analysis and here you see that it is like e to the power minus 14. So, it is nothing, but you know less than 0.05. So, you can accept the overall regression the overall regression is established I will discuss more information about it and after that you come to the individual variable overall regressions are been now that means you can build a relationship I will come to that later with hypothesis testing and then come to the next level. So, individual level if you see the P value of like intercept you do not need to see that first row look at this two row age and experience and here you can see the overall standard error for them of the coefficients as well as you know you can see the T value T test and the P value look at here this is also you know less than 0.05 less than 0.05 and this is also e to the power minus 14 means it is also less than 0.05. So, both the variable age and experience are explaining the dependent variable quite effectively. So, you can say that both have both are significant both the variables are significant. So, now let us see how the overall you know unavoidable also overall F test saying significant about the good strength of your regression how we can develop that that you can you know set like this let me delete this we will get to know suppose if you if you develop a hypothesis say if you develop your null hypothesis say you know for overall test F test and the corresponding you know and of a table understanding whether there is a really relationship between the independent variable and dependent variable or not and you need to read the regression line or not that establishment you can do by developing hypothesis suppose your null hypothesis you are assuming that all beta all suppose here you have two variables say b 1 equals to b 2 equals to 0. So, there is no relationship null hypothesis you are assuming that coefficient this coefficients are not there that means there is no strength between like this is what this is what your b 1 b 2. So, these are 0 you are assuming initially and your null alternative hypothesis is that at least at least 1 b sorry at least 1 b i should be not equal to 0. So, these are the two hypothesis you have said now you throw F test you calculate your p value how much you found through this after analysis that table has given to us the regression table has given to us it is less than liquid liquid 1 point something due to the power minus 14 that means it is very less than 0.05. So, since it is less than 0.05 this value less than 0.05 we can conclude that that null hypothesis is rejected and there is at least 1 b the coefficient the regression coefficient which is not equals to 0 eventually here we found both are non-zero right. So, therefore you can say that conclude that from the interval table and the overall F test that the regression is significant. So, you can fit a regression line. So, this is the first understanding of overall test now let us come to the individual variable level initially you found that there is a at least one of the you know coefficient are non-zero the slope are non-zero. So, that you have established right say because your p value of overall test is less than you know which is very less than 0.05 this which is less than you know 0.05. So, you can say that there is a relationship now come to the individual variable level because all variables may not be linked to the output variable. So, there might be only couple of we will have to check that individual level also. So, now if you come here you do the similar test, but here you have to do t test because you know it is in the single variables and the sample data are all are normally distributed say you know data like you know coefficients your y output as a sample data. So, therefore we will be considering simple t test for individual variable level. So, in that case suppose for one each variable. So, for age or say experience whatever you want in that case your hypothesis will be say b 1 equals to 0 and alternative will be b 1 not equals to 0 and you check your p value how much is p value look at this p value 0.0209 which is less than 0.05 you accept your alternative and reject your null hypothesis. So, age is also accepted now you come to experience experience in that case also you propose your hypothesis suppose b 2 equals to 0 and alternative will be b 2 not equals to 0 because in the overall it is at least because there will be many variables, but here 1 1 individually you are testing. So, in that case here also p if you check your p this is very less right 0.00 so 0.0003 something which is less than 0.05. So, you can also say that null hypothesis rejected alternative is accepted. So, that means that both the variable are explaining the dependent variable and here you can see the confidence interval for them also like 95 percent confidence interval for your data information etcetera and overall p are quite significant. So, the final summary is that there is a relationship between the independent variable and dependent variable from ANAVA table we found for a overall F test and for the individual level t test we conclude that there is a relationship between age and xalary and experience and xalary and we have fitted the line this is what multiple regression.