 So, before we go to R square, one basic concepts you can know the correlation coefficient as I told in regression correlation coefficient fits here also very well. If you look at this data sets, general just I am giving summary of correlation, if you see here this data says that simple between x and y if you fit the some regression and if you look at the relationship here you can see the perfectly correlated right. So, therefore R equals to 1 exactly 1, here you can see some data scattered here and there. So, therefore the observed data here you are there is a positive correlation your regression line is quite good positive correlation, but it is ranging between 0 to 1. Look at this particular example point number C, here the graph says that there is no correlation between x and y. So, here your correlation is 0 here also remember that they will try to link that with your R square now, the both definition of R square we have discussed we will illustrate now. So, here you can see there the you know before I go to R square I am giving the like the background information about the R square through correlation coefficient Pearson correlation coefficient. So, here you can see the correlation value here which is R equals to because there is no relationship between the data look at you cannot fit a good model here. So, you know look at the fourth example. So, here you can see it is negatively correlated if you increase x your y will decrease, but it is perfectly correlated look at the point number A graph number A and graph number B it is positively correlated perfectly correlated it is negatively correlated, but perfectly correlated here you have a positive correlated, but not fully perfect, but to some extent good relationship you have built regression you have built. So, this is your you know correlation you have between x and y. So, this is what the essence actually negative example you might sense or where like you know there might not always exactly one negative minus one could couple of data could be falling like this also in that case also you can fit your regression or you can find a relationship. For example, suppose you can I can give one example say if you increase like during corona you can think that you know if you you know increase the mask or say vaccination drive probably your corona cases will fall down right. So, if y represent the current number of corona cases and x represents a the vaccination drive probably you know every day if you increase the vaccination probably your you know number of corona cases will fall down. So, that is a negatively correlated you might, but positive correlated correlated examples are too much in the practice one example of negative correlation also I have explained may not be fully negative, but to some extent say strongly negatively correlated, but there is a relationship. So, these are the four aspects of correlation coefficient and this is the formula using x l you may get to know or python you may get to know this is the formula of correlation coefficient. In regression analysis if you take the square of it this will become r square actually and that also represent the measure of strength or the explained relationship between the independent variable and the dependable here you can see that right. So, the strength you will get to know so, here you will get complete r square equals to 1 because all data falling here also you will get r square equals to 1 despite it is minus 1, but here you will get r square say you know say 85 percent or 95 percent quite good, but here you might get you might get r square also equals to 0. So, this is what the r square or says coefficient of determination or the strength of regression between independent variable and dependent variable it is derived from the it can be derived from the correlation coefficient also. So, this is what r square here I have derived you can see the r value for the data sets of the sales and advertisement I have put the data and I found r 0.9 the strength the correlation coefficient between the independent variable dependent variable are coming out to be 0.9 90 percent and if you take the r square it is coming out to be since it is a less than 1. So, it is square is between reduced not a matter, but r square is come to be 81 percent 0.81. So, that means it is to some extent it is a good strength the relationship are been there that means the y is been explained or derived by x very effectively. So, there is a strong relationship you can say above 80 percent we can consider it is a good relationship. If it is a 1995 then it is very strong r square and you can say that the relationship that regression that you have fit are very good fit. So, therefore r square is one of the most important measure for goodness of fit or you can say you can say that the regression model the basic definition of r square or element definition you can say that how the independent variable is explaining the dependent variable that measure are been done through r square value or you can see coefficient of determination. Now, we found that r square value now through correlation coefficient in regression it works, but in general what is the another definition the common definition in regression analysis what is the common definition of r square. Let us see this part it is nothing, but the ratio between that explain variation by total variation. So, now you might be confused what is explain variation and what is total variation. It is nothing, but the ratio between the sum of the square of regression sum of the square to regression look at it is nothing, but the I give I feel this is the better way to understand explain variation by total variation, but in general in book people use this terminology SSR by SSD what is SSD and what is this formula I will illustrate here now. So, this is nothing, but the ratio between sum of the square to the regression SSR which is nothing, but the deviation between the predicted value deviation between the predicted value and the mean value of the data. You might say that how come I will understand the mean value of the data of the entire Y dependent variable data this is the you know SSR the deviation between the predicted value and the mean value by the SSD the total sum of the error. So, this is the SSD how what is that it is nothing, but the deviation between the actual value and the mean. So, let us see how this R square can be derived through a graphical presentation of SSR by SSD. So, let us draw the graph here if you take a graph here suppose say this is your X and this is your Y say and you have the data say you have the data and you have fit a regression line. So, Y equals to say A plus VX or you know alpha plus beta X alpha plus beta X whatever you have fit it to say. Now what happens here if you look at the data in that case the dotted line is R say suppose couple of points are falling on the line also couple of points falls on the line because you have fit a regression line couple of point will fall on the line and the distance between that you know point from the line you have calculated right the deviation circle and you have the least square you have found this predicted value that we have done. Now what we do you take the mean value like you first calculate the Y bar how will calculate the Y bar? Y bar is nothing but the average of the data right average of the data average of Y i right. So, suppose you get the average suppose here this is your Y bar say now let us see the calculation one first say SSR SSR or nothing but look at the object values look at these are the you know this is your predicted value say you know Y hat say so these are Y this is your Y hat say and this is your Y bar. So, if you take that deviation say from the mean the object value deviation from the mean and the square and the sum you will get the SSR now look at it is nothing but you know say SSR or nothing but you are taking the object value on the line on the line look at on the line on the line right minus your mean value and you are taking the square of them and the sum of them this is your SSR now note down and you calculate your SST this is nothing but the deviation between the actual value and the mean actual value where now these are your all actual value right these are all actual value including the values on the point also. So, including the SSR are less because only the point following on the line regression line you have taken and you have taken the deviation to the mean right and now you are calculating the SSR total sum of square total sum of square all data you are taking and you are taking the deviation your calculation deviation from the mean all data you are calculating in that case you are you know what will be SST SST will be look at all data you are taking all data of your you know data sets minus y bar you are taking right minus y bar you are taking the mean data that we have already calculated both you are subtracting from to the mean. So, therefore, you know you can take this as your SST and sum of them what is that this is the all sum of square total sum of square of all data to the mean and here in the numerator you are taking the only the point on that this is what your R square. Now think one aspect here if SSR becomes SST or SST become SSR what will happen your R square will become 1. So, you have a very strong fit all data are falling on the line when all data will fall on the line when let me put one you know small graph here you will be able to understand. So, that you know this calculations I can you know say now if you see here when you are look at the in the numerator you have the y hat of the data who are falling on the line and in the below you have all the data. Now imagine if y i becomes all y i hat. So, that means all points are falling on the line. So, this all point are falling on the line in that case what happens your numerator and denominator will become same. So, your R square will become 1. So, if you can fit a line where all points of observed value are falling on the line you will get the perfect correlation here you can see you will get the perfect correlation your R square will become 1 this is nothing but your R square which is been defined by you know explained variation by total variation SSR by SSD. So, you will find the perfect correlation whether positive or negative not a matter, but in practice what happens no all points may not fall in general it will not fall on the line only couple of point will fall on the line. So, therefore R square will be less than 1 therefore R square will be less than 1 right there you will try to find what is your R square value. If more point fall on the line or you have fit a line where the distances like these are the distances like the square of the distances like some of the square the deviations right. If the top deviations are very high like more point are falling on that in that case or both become closure. So, you will find the R square is coming closure to the 1. So, therefore the objective will be fit a line. So, that this SSR by SSD become almost same if you can get that kind of line your R square will be very high or almost closure to 1 and you have fit a good regression line that is it, but if it is not your R square you will know in case if it is there is a big deviation like most of the none of the point are falling on that probably your R square will be to some very less. Very less means your regression are not good, but if you more point fall on the line or can rest or closure to the line and most many points fall on the line your R square value will be high this is what the you know goodness of fit way of calculation through R square. Therefore, R square is nothing but the ratio of explained variation by the total variation how many are there on the line and how many a total upside and that and through the mean you calculate to the mean you calculate your deviation in both case and that ever that ratio is nothing but your R square. In general the higher the R square the better the model fit if you have done it that is it is your R square. Now, let us take one example and through these examples we will illustrate the excel and get to know whatever we have discussed today through the analysis also of reading the graph and the entire illustration of today's session of goodness of fit. Here the example says that a car manufacturer has recently three days of roadside exhibition on the introduction of a new model to its of a deluxe car. The number of sales personal employed at each of the sample obtain exhibition you will find 1, 2, 3, 4, 5, 6, 7, 8, 9, 10. So, 10 exhibition they have and the number of sales person are been employed over there and the number of cars booked are given there based on the past data. Now, what you have to do estimate suppose if they would like to set up another stall say or another say exhibition center with 10 employee salesman in that case what would be your forecast of number of car booked will be booked. So, that you are going to predict through regression analysis. So, in that case what would be the regression output is not our objective because in regression excel you will get to know the immediate value of forecast of that. Our objective here is that what we have learnt today that illustration that is calculate the R square, calculate the standard error, calculate the prediction interval. Let us go to the excel and understand this. So, here we have come to the excel and the same data I have kept in the excel sheet now. Look at the number of sales person here like we have 10 sample that I have kept here and the number of car booked in the previous say exhibition of centers are been kept here. So, we will put the regression line suppose if you keep this let me copy this and suppose come here and if you paste the data go to data and use the regression. We all know it, but let me illustrate it. So, let us first fit the regression line. So, why output data cell are like this number of car booked as dependent variable. Now, your independent variable will be number of sales person and put it level because we have selected first row as your level value and say set here suppose. So, you found the regression line now. Now, what we will do? We will read this regression analysis the output cell. How the output cell can be read through regression analysis that we are going to study now. So, here you can see one by one. Let us first fit the regression line y equals to a plus bx or m plus mx plus c or alpha plus beta x. So, these two points intercept and slope. Then what would be your prediction line? Your prediction line will be say for 10 employee or person. It will be say y equals to mx plus c it would be say this intercept into 10 employee plus slope into x number of employee or person plus intercept. So, you get 185. So, this is done. So, your forecast of number of car booked for 10 employees or 10 sales persons will be 185. So, that is done now. This is not the objective to discuss because we found immediately through regression line, but we would like to read the other part. What are then? Look at here what we have studied R square. Look at R square. What is the R square value? You understood the R square now, right? R square is coming up to be 82.82, which is good. So, you can say that this data has a good relationship. What is adjusted R square? Adjusted R square are nothing but if you increase the sample size or say you know do more simulation in the data with similar pattern, you will get some different R square. So, actually it is a penalty that if you increase the sample size, in that case what could be your adjusted R square? So, that we are getting since you are increasing the penalty, you are increasing the data sets or you know you are doing more simulation in the data. So, therefore it gives a penalty and reduce the current R square value little bit. So, that R square is nothing but the adjusted R square, right? System will do different type of trial and error process and they will come up with the predicted value that is called adjusted R square. Sometimes people prefer adjusted because sample size is less here. So, therefore when you have a larger sample size you might say that your R square might be changed. So, that is in general for this type of data pattern what could be your adjusted R square? It is coming up to be 0.79. So, this is your adjusted R square. People sometimes take that rather than R square also. So, we can mark this also. What is multiple R? If in case if you have a multiple variable like similar one more variable if you add like you know and if you go to multiple regression in that case people use the multiple R square value. So, that we are not discussing. Now, if you come here the standard error. What is standard error? Standard error is nothing but the interval from the predicted value sorry the predicted value the deviation which I have discussed today. So, this is what standard error. How much it is? 10.47. So, now you can calculate say one standard error you can calculate the two standard error right. So, one standard error two standard everything you can calculate here now. So, let me put a different color because it is looking very big like high resolution. So, suppose you put this. So, you will find the standard error say. Now, if you look at the number of observations. How many observations we had? We had 10 observations. So, we can see the observations are 10. So, this first part of regression output statistic are being understood now discussed now come to the ANOVA line. So, here before I discuss this ANOVA part and this V value here. Let us go back to the you know the seed where I have already calculated to save the time come here. So, look at you found the standard error here 10. something and if you take say 95 percent confidence table. So, you can put two standard error like plus minus. So, you will get a range of 205 and lower bound 164. What does it mean the interval? I am talking about prediction interval. What does it mean? Your standard error esteem and predicted value is 185. So, it is nothing, but your prediction is 185 say for you know 10 sales percent 185, but your interval is how much say suppose if you take a 95 percent confidence interval predicted interval in that case what would be your value? It will be 205 like two standard error and below lower bound will be 164. So, this way you can calculate your you know through standard error and the prediction interval which also you have discussed. Now, let us come to ANOVA analysis. So, here ANOVA nothing, but it is a the overall analysis of variance of the data. This signifies your overall test for the fitness of your model right about the regression model. Here you can see here we use the F statistic here. So, here you can see the degree of freedom the first one. The regression you have only one degree one independent variable. So, the degree of freedom becomes 1 and residuals here we are taking k minus you know n minus k minus 1. So, it is n is 10, 10 minus k is 1 minus 1. So, it is n minus k minus 1. So, n equals to 10, k equals to 1 minus 1. So, effectively it is coming up to be 8. So, this is the residual. And then total your degree of freedom will be the degree of freedom of residual will be 8 and total degree of freedom will be 9 sum of these 2 right 8 plus 1. And then it is the sum of the square error. What is that? You will get for regression sum of the square to the regression you will get 4050 say and the residual sum of the square of residuals you will get 878. And if you add them you will get the total sum of square error. And if you take the mean of them mean of the data if you divide that this by this you will get same as it is here you will get 878 by 8 you will get 109 like you can see here equals to this by 8 say you will get 109. So, this way you will get the mean of the data. Now, you take the ratio of you know this regression total mean square error to the regression and the residuals. This ratio is nothing but your F statistic, F test value. So, higher the F value better the model you have actually. If F value is very less say closer to 1 or 2 etc less than 2. something 5 etc your you are concluding that there is no relationship between independent will and independent will. You can cancel your regression analysis the trail line forecast. But if the F value is bit high this ratio between this say you know I can show you this is nothing but this is nothing but these value by this ratio right 36. So, this value if higher the F value better the model actually. Now, remember in general what happens let me you know this look at the next part the P value. The outcome of F statistic or F testing analysis are been measured through the P value which is calculated through you know chi-square test and the T data sets. Here we will use for since it is overall analysis you are doing overall test you are doing through you know the overall fitment of your regression model use the F test. But when you go to the individual variable level will and the corresponding P value we calculate we use the T test there. So, this is the difference for overall analysis you use the F test of regression model and for individual variable you will get T statistics and the corresponding P value. But remember for this difference you will be able to observe better in multiple regression because there you will have more independent variable in your list right like x1, x2, x3. So, in that case we will find the P value for individual independent variable with dependent variable. The lower the P value less than 0.05 is the major significance value if it is less than that then your null hypothesis you reject and you say that there is a relationship. But overall test you will get one significant value for overall regression that is also the significant cutoff point is P less than should be less than 0.05. Now, how you define it that F test or the individual test say both I am explaining at one time. Suppose if you suppose you have developed a y-caps to suppose alpha plus say beta x beta is your correlation could be the relationship between like the slope say. So, you propose a null hypothesis say beta equals to 0. That means there is no relationship between independent variable and dependent variable and you propose another hypothesis alternative hypothesis that beta not equals to 0 right. And then you do your calculate your P value here through significant test and then if you found P value say less than 0.05 effectively you reject your null hypothesis and accept your alternative hypothesis. If you accept your alternative hypothesis what does it mean that there is a relationship between beta not equals to 0 there is a relationship between independent variable and dependent variable. And your analysis suggest that your F test is significant and P value is quite less and you feed the regression. But if P value is quite high say 0.251 it is like very high value P in that case 0.75 etcetera in that case you can say that you know there is like null hypothesis is accepted and now alternative is rejected and you can see that there is no relationship between independent variable and dependent variable. Now this is the overall case now for individual also you can do individual test also like for P test analysis and the corresponding P value. But when you go to the multiple regression what happens when you go to the multiple regression in that case this changes this you know you will get say you know beta 2 x 2 etcetera we will discuss their detail. But just for your summary like in that case your null hypothesis will be all beta say beta 1 equals to beta 2 equals to say 0 suppose 2 variables I am considering and null alternative will be at least 1 beta i not equals to 0. Again you check your P value for overall test if P is less than 0.05 you reject your null hypothesis you will accept alternative at least one of them will be non-zero. So, that means there is a relationship and you can feed your regression. But how many of them how many of beta or say the coefficient the regression coefficient are non-zero that you can find from here from this individual test. Here you have only one variable you have only one variable here you will find that coefficient not do not see the intercept P value just see the sales person which is a representative of your independent variable. So, here you can see this is quite less than 0.05. So, it is significant. So, individual level also it is significant and the overall also it is significant. So, the regression is fit good fit. So, here is the measure of you know data sets and here you can see also 95 percent interval are also been predicted for the sales and the intercept. No need to think about that just think you see the interval for confidence interval for the sales representative. So, you will find that it is you know it is quite good. So, now what happens is that you found the overall test of your data through analysis and F test you find F is quite good 36 which is quite good the ratio and the corresponding P is good and for individual also you found the P value look at here which is also less than 0.05. So, both are ensuring that your regression analysis is quite good and also your R square is quite high 82 percent and standard error will be always there. Here you found the standard error in the corresponding confidence by a predicted interval also you have calculated and your intercept and slope also we found through regression value and you can make the prediction for the sales representative now. This is what you know the overall understanding of summary output of regression analysis. Now, we will extend this concept for multiple regression also in the session of multiple regression and we will understand how this table or the summary output works. Similar way I have already explained, but I will summarize that for multiple independent regression also. Now, come back to your PPT presentation and what we have done today. We have let us come back to the particular output cell now. So, we have discussed the different type of you know goodness of it approach. We have discussed the R square concept, we have discussed the standard error, we have discussed the F statistic and the significance value of your overall analysis and the reading of coefficients and like intercept and the slope and the corresponding P value which is very crucial here corresponding P value of your T test of individual variables. This should be less than 0.05 and here also the overall test should also be significant and the forecast are coming out to 185 and the corresponding 95 percent interval of predicted interval is coming out to be 2 standard deviation of that. Though you have to multiply not 2, but actually T value here say 10, say 10 sample it might be 2 point something say around 2.45 or maybe less than that also. So, that you have to multiply and you will get the predicted interval. So, this is what the summary of today's class. So, let us come back to the first slide. So, today what we have discussed? We have discussed this coefficient of determination, web statistics, concepts, standard error and predicted interval. These are the measure of goodness of it. How good your regression model are that are measured through this couple of estimation process or you know the measure of accuracy process. This helps you in establishing or being confident in establishing your regression analysis for any type of data with industrial data or in social science data. I believe you will be able to apply these concepts in your regression model whenever you fit some regressions and you will build your confidence whether your model is really good or bad or whether they have a good strength or not. So, with that let us conclude the measure of goodness of fit of regression analysis. In the next class we will enter into multiple regression.