 Hello everyone welcome to the session of regression analysis come to another example you will get a variety now suppose remember this particular example that if you have a stall or say couple of you know exhibition so where you are you have a couple of sales persons in your exhibition so and based on the number of sales person you can see the number of car booked for the company and these are the couple of 1 2 3 4 5 6 7 8 9 10 10 sample are being given to you right this is simple x and this is why we have fit the regression line in the previous session and we found the relationship between them now suppose I would like to add one more independent variable here to to illustrate the concept of multiple regression suppose here we are adding another independent variable say space if you add some space of the floor size the floor size of your exhibition so so in that case you know you might say that you will buy one space for the exhibition and that space is also matter for your number of car booked so the suppose you have put the data of space in your excel sheet and in that case you want to fit a line between your car between your car and sales person say sales person right sales person select x x 1 plus say space x 2 space space is there any relationship if yes how come let's see let's go to excel so here I have prepared that for you look at this suppose here you have say as it is the previous data sets suppose you have a number of sales and the car booked and the corresponding regression analysis we have done for simple linear regression R square we found standard error in the corresponding p and the small p like you know for individual level street test and the corresponding regression output you found now you add one more variable here let's see look at we have added one more variable as a space which we are considering another explanatory variables or independent variable which is nothing but the space and here if you see we have two variable now let us run the regression now if you run the regression let me delete this let me delete this I have already run it but for your information I can delete this so let us go to data and go to data analysis run the regression here you select your say number of car as your dependent variable and now you select two independent variable we have added for the sake of illustration space as another independent variable let's see what I am going to illustrate to you now and level and the output cell say we will put suppose here I have already done here but I am keeping here for your reference say here okay so let's run it so you have run the multiple regression now with two independent variable number of sales person and the space what happened this analysis is the same like this so come here now so this part we are not being seeing we will be focusing on that because you have put a color here now if you look at here see here so r square 82 percent quite good what are the previous r square for single variable 82 percent so it is also quite good and adjusted r square also see standard error you can see observations as it is 10 observations say overall test if you see the overall F test of the from the another table between the independent two independent independent variable it is also significant so there is a relationship there is a regression you can fit a regression but in that case your regression line will not be like this your regression line will be now like say b1 b1 say x1 plus b2 x2 correct because you have two independent variable now suppose let me write it like this and increase the font size now sorry it is too much now so let's see now so what happened so you can fit a line but now overall test says that there is a at least alternative is that at least 1 b i not equals to 0 it establish now null is rejected null is rejected because we found overall test has is significant test well overall from the another table now regression you have settled now let's see whether two independent will both are explaining the dependent variable or not that we want to verify now so if you see this now we have added two variable now look at number of cells we can make the color here also not a matter so you can see the two independent variable and the corresponding in the they are you know slope value coefficients value and see the p value here this is the most important part today now for this particular you know examples if you see here what happened to this p let me put a different color it's actually not significant look at here it is not significant because here the p value this p value is quite greater than 0.05 since it is higher than 0.05 we can say 0.98 so it is saying that space is not significant or space cannot be considered as one independent variable to define your cells only as per this data set as per the data sets that I have given here as per this this we have put in column number b based on this space like you know meter square say these are of 10 meter square p l meter square like this in the exhibition so you found this much of space in the gallery so there suppose you can see that only cells persons are explaining the number of car booked the output variable the y variables space is not related to the output like you know number of car booked so therefore you cannot consider space as one independent variable so this is what you can find from the overall output test of individual t test here you can see this is significant the number of cells person is significant to your regression so therefore or you know to your overall output say number of car cells so you can say that b 1 not equals to 0 but b 2 equals to 0 so in that case can you set your regression like this can you set your regression say you know y hat number of car equals to 73 plus 11.29 into x 1 plus minus say 0.06 into x 2 can you set like this the answer is this your final regression line the answer is no no because here you cannot include space as a one independent variable it is not significant it is quite high this is what the value of individual t test you need to do in multiple regression if both have less p or significant in that case you can include both and you can fit your regression line and corresponding forecast you can do for the cells for a given space and the given number of say you know cells person what could be your prediction you can do it but here you cannot include space as your one another explanatory variable or independent only cells when cells person are sufficient so this is what you know like additional cases or the special cases of multiple regression you need to keep in your data sense so what is the summary summary says that if you have a so many independent variable you may not require to include all in your final regressions you have to see which are important which are significant and accordingly you have to set your final regression multiple regression clear now now come back to PPT again one more case I will share with you now interesting cases now let's see another example for your better understanding suppose here you have now one two three four five five independent variable one person's one candidates performance the number of month monthly cells will say representatives over whatever in a counter person in a sales counter or representative whatever the sales amount of sales will be dependent on his aptitude test score during the interview process the age the anxiety score the experience and the academic scores a college GPA so these are the five variables you have included and you want to send the data sets are good amount of data set you have and now if you want to set up a multiple regression multiple linear regression in that case will all five variables come in your regression analysis or few of them how to put this given the new data sets for a new candidate what could be the tentative predicted cell for that new candidate you need to predict that from multiple regression so let's go to say you know I'll come to that this part so let's go to you know Excel you will get to know the outcome of that we have done it just wait you can see the data set here so look at so let me increase the font size one by one I will show if you see here look at this data set so here the pipe data we have kept as it is right and the sales monthly sales are being given now we'll fit a regression same way I have fit a regression by considering all the fiber don't see that this you know green color and the orange color don't see that I'll tell you the story of that later first let us fit the regression so all these five variable you consider as your independent variable and select them in your regression graph and you set it we have done it and we found our square 90% quite good like there is a regression that I mean there is a relationship between independent variable and like regression R square says that there is a strain between the independent causal relationship right now if you look at the overall ANAVA test and if the p value which is very closer I have look at this very closer to 0 so I have reduced the scale but if you see that it is not 0 0 it is closer to 0 so it is significant also and f is also greater than 2.5 say 2.9 so it is 41 quite acceptable so now overall ANAVA test says that there is a relationship between independent variable and independent you can fit the regression but can you take all all independent variable in your data sets we have done that the summer output I am reading here and if you see the p value for that all these independent variable 5 independent variable don't see the inter set part that this 5 independent variable if you check don't see even their correlation coefficient like you know that the coefficient value the regression coefficient just see the p value here just see the p value let me put a color here we'll get to know the p value look at here the anxiety test core the experience and the college GPA are not impacting or explaining the dependent variable only I think if you click it you'll get to know the p value which are very significant like less than very less than 0.005 0.05 here you can also see or a closer to 0.05 maybe little bit of fluctuation you can consider a variation you can accept and you can fit the model but the p value is too high then you cannot consider them in your regression here it's 83.83 you cannot select here is 0.99 you cannot here is 0.74 you cannot select these three variable in your regression so that in that case you cannot fit the regression by considering all of them let me reduce the font size so I can show you where I have written here you cannot select perhaps in the excel I have written here you come here PPD I have written here so you cannot fit your regression line in your data look at here suppose I'll come I'll show you that that formula you can see you cannot fit this entire regression aptitude test score age anxiety experience and college score you cannot select all only the significant value what are them age you can select and aptitude test score these two are sufficient to set your regression line these you cannot select clear now because they are not significant right so this is what the extension of multiple regression for more data sets now one more part I will have to share with you suppose look at here this data sets if you look at this particular data set here the aptitude test score here it is age here the anxiety score here is experience here is a college academic score GPA so all of them have different you know unit or different scale here it is 22 here it is 9 here it is 01 kind of thing binary number so you can see to some extent say different variables are there in your data sets with different unit or even monthly sales could be million say whatever in money whatever so since the data are of you know different scale or different unit sometimes it is better to convert them into a normalized data so if you do the data preprocessing in the beginning and you reduce the scale and make them normalized data in that case if you fit the regression you will get the similar regression output but that would be much better in defining the explained relationship or the causal relationship and establishing your regression that I have shown you here through this data set this data will scale down through normalization process this explicit data analysis we have also done before that you can see that the month cells the output cells are to some extent explained by aptitude test score heavily look at aptitude test test score has a good strength or relationship with like you know is explaining the or impacting the output monthly sales sales output effectively also the age of the person has a good you know correlation with the monthly sales so they both of them are significant you can see the strong correlation but if you see the anxiety score one only one I have kept look at the scatter so you know there is no relationship to some extent effectively from the graph itself you can say that perhaps anxiety scores will not come into the final regression because data scatter you cannot fit the regression effectively so in that case only these two will be sufficient just one illustration I have shown you that I have mentioned here it has a good merit actually so now the data exploratory analysis now if you do the normalization of the data like you know you can take a value by mean value by standardization right if you take that calculation and you could change the data to a normalized data for all the columns and then if you run a regression for us it gives a better advantage because you have scaled down the data and it to some extent it reduce the influence of particular variable suppose particular variable has a good high-scale value and another value of 2 x 1 variable is only 2 3 and another variable are taking say 5000 5 lakhs say 10 lakhs kind of thing so it has a high value so it might you know dominate the integration the coefficients might become to some extent you know less impactful or you can say they will not be too much of inferential to the output cell x 2 might influence the output cell too much so in this type of situation can be avoided if normalize the data for all of them and it also you know to some extent manage the multicollinearity also we will discuss in other sessions and on interpretability can also be a good understanding the way I have shown you you can see also here you can see the data I have done the normalization of all the all the data I have done the normalization I will show in Excel these data will take into regressions and fit the regression we have done it here and also if you see that individual variables and if you see that they are R square value individual variable if you set and none of them are to some extent good actually except age and aptitude says score has little bit of R square value 45 percent and 63 percent rest all are not good at all so this way you can see that one by one all variable you can do for your own analysis and you can see the relationship whether they have really good relationship or not once that is done the data exploratory analysis are done from your sites and if you fit the regression here you can see the regression based on the normalized data here the data are being normalized so coefficients might change and you will whatever the coefficients you will get that fine and then look at the P value here overall P is significant for the overall test and R square is also good adjusted R square is also good but if you see the P value here also same outcome we are seeing that anxiety score experience and as per these data and college year are not explaining the dependent variable effectively. So, exclude them now you have already fit look at now you have already fit the data say with this normalized data or the actual data say you have fit the regression all independent variable has come here explanatory variables has come here. So, now if you say that this anxiety score and experience and college GPA will not come as independent we have to exclude them because they are not explaining the dependent variable effectively because their P value is high. So, in the from the same regression line you cannot delete them and you cannot fit it you have to rerun the regression by considering these two independent variable only because your coefficients will change in that case final coefficients. So, let us see that go to excel and if you see that we have done the final regression here you can see we have not considered them we have considered only aptitude square and edge we go to data and if you see they are the selection of the independent variable here you can see the selection of independent look at here we have selected only a and b a b and c right. So, here you have to select this is your output cell suppose you can take the normalized data or you can take from here also output cell and your input data cells will be only you have to rerun by considering only the independent variable who are significant not all you have to rerun it and then if you select the level and output cell suppose we will be writing suppose here and then if you do it you will get the forecast like this this is your final forecast and the corresponding you know analysis you can do this is not correct you can even if it is correct we will not select it now because we have rerun the data. So, this is the final sheet and you can see from there you can see all are significant look at here both are significant overall P also significant and also both are significant here which I have shown you in PPT and also you can see the aptitude coefficients and the each coefficient the slope value. So, this is what the overall summary of multiple regression and overall you know understanding of different you know process of multiple regression whether overall ANOVA test or table reading or overall R square adjusted R square standard error and how many variables to consider final independent variable to consider in your final regressions through reading the individual P value this says the overall summary of your regression analysis. In the next sessions we will discuss one more interesting part that multicollinearity if the independent variable are correlated to each other in that case you cannot select all also. So, if there is a multicollinearity among the data sets we have because in the assumption we have considered that there is you know multicollinearity among the independent variable, but now if the data has the multicollinearity between the independent variable then can we select all the independent will as it is or if there is a relationship among the among the independent who will how you need to reduce the impact for them only partially you have to take or you can exclude them how to do that that we will discuss in a separate session called multicollinearity aspects, but overall this is the you know analysis of your multiple regression here I was talking about the final output here you can see only you have taken aptitude test core and the age and the final forecast for this given data sets. This is what the multiple regression