 Hello everyone, welcome to the session of regression analysis. In the previous class, we have introduced the basic understanding of simple linear regression and the least square method, how to estimate the parameters and also the curve fitting using trend line and excel illustration also. Today, we will extend this simple linear regression concept or linear regression into the basic features of goodness of fit. That means, what are the measures available to test a regression is good or not good. So, in order to do that today we will focus the major aspects of measures of goodness of fit like r square and then standard error, then F statistic and prediction interval. If you look at this particular slide in the left hand side, I have mentioned the summary of yes or the introductory session of regression analysis. Today, we will also illustrate the detail of this output how to read the output analysis at the end of the session. This is the overall regression line. You can see the y hat, the forecasted value, predicted value and the measure of coefficient of the regression coefficients. You throw this formula, you can calculate it. I will show you again today. So, this is the summary of last session. Now, let us extend these points that what are the measures of goodness of fit. Fitting that you are doing whether that is a good fit or not, that we are going to test, throw some measures like in time series data, we have used mean square error or say you know RMSE percentage absolute division. So, we have discussed that, throw them we measure whether our you know if you have a minimum RMSE say we say that your forecast throw a particular time series model is good. Similarly, here if you have a good r square, higher r square you will say that the model is good or the fitness the relationship between independent variable and dependent variable are explained very well. So, therefore, these concepts will study today and then at the end you will get to know that how once you fit a regression for a data sets and you will be able to understand how the model is really performing or explaining the dependent variable through independent variable effectively or not that we will get to know. First the coefficient of determination it is r square it says that you know how well the regression is fit or you can say that why the dependent variable is explained by the independent variable. It is the ratio between the sum of square error by total sum of square error to the mean how to calculate this r square value also later. This one coefficient of determination for example, if you have a data set say x and y and say you have fit the line is y equals to say a plus p x or alpha plus beta x. So, there you will have to calculate the r square value like explain variation by total variation how to do that I will discuss later. So, once you get r square value the higher the r square value closer to 1 better the model is. Then next point is that this measures the goodness of fit that your regression model is really fitted well. So, all points are falling on the line that in or closer to the line that ensures. Then standard error of regression is the second measure of goodness of fit or one more you know way of measuring the fitment of your regression model that is called standard error. What does it says? It says that the deviations the variation from your predicted value or predicted line look at the line here from your predicted line how that deviations are being spreaded. If it is less the model is good for example, suppose if you predict here at that particular point say in that case how the deviations are being deviated or from your mean predicted value. Suppose this is the predicted value and what are the deviation over there. So, that calculations are like the way we calculate the standard deviation. Similarly, here we calculate the standard error that will also tell you that whether your model is having a good prediction or for future if you predict something for a new data set independent variable input you will get to know how accurately or how the model is reliable with the spreadness of the error that is called the standard error the interval will calculate. Then F statistics it is also you know one of the parameter or one of the test which helps you in establishing your regression model. It is used through ANOVA analysis in this case that means once you do the ANOVA analysis of it the overall regression test or overall regression fitment of your data between dependent variable independent variable there you find that you know if your F value is high and the corresponding significant value the P value I will show you if it is less than 0.05 you can say that the data are being well fit. So, overall like at least one variable will explain or independent will explain the dependent variable. So, this gives the overall you know significant test of your entire regression model not the individual variable level to the independent variable when you go to the multiple regression you will get the better essence of F statistics in the ANOVA analysis of regression test. It is also one of the major if in case look at here the with no its compare the fit of the regression model to the model with no independent variable null model that means if there is no relationship between say suppose y equals to alpha plus beta xa if there is no relationship between the data. So, beta will become 0 between x and y. So, in that case what happens in your null hypothesis will get rejected and in that case what happens it become a to some extent like regression free model or you can say simple intercept model. So, there is no relationship between the independent or dependent. So, therefore, F statistics has a good significance or the good merit in understanding the goodness of fit of your overall regression models. So, we will discuss this aspects also using the result analysis through excel and then another is a prediction interval as I am discussed like it comes from the standard error. So, once you calculate the standard deviation say one standard deviation two standard deviation similarly also you can use the prediction interval through the calculation of standard error and you can see how the predicted value at the end is been deviated from your upper limit and lower limit. So, if the deviation interval or predictor interval is say you know comes in a very closer range that means your model is having good prediction. So, how to do that or how to calculate it and how it also it becomes a part of goodness of fit that also we will study today. And then one more aspects like all Pearson correlation co-efficient or simple or correlation co-efficient that also helps in making in measuring the goodness of fit between the independent and dependent. In general it is been used for in two type of data analysis and the relationship calculating the relationship between the two data sets say whether they are really they have correlation or not, but in regressions we do use that between the independent and dependent to calculate the co-efficient of determination. It actually if you calculate the Pearson correlation co-efficient and if you take a square of it just take a square of it it will actually in turn convert into a co-efficient of determination in regression analysis. So, therefore, the R square that I told that you can calculate through correlation co-efficient by taking a square of it. So, these are the four five you know measure of goodness of fit approach of checking whether your model regression model is good or bad here I can summarize like you know co-efficient of determination R square then standard error of regression and then F statistic through overall and of a test of your entire regression model and then prediction interval and the correlation square of correlation co-efficient because that these measures collectively provide a comprehensive assessment of how well the linear regression model fills the data. So, remember these five points at least one or two you should test the after setting up your regression model whether they are really having a good value or the merit or not. If you found that this couple of you know goodness of fits approach are good or giving a good outcome or good you know value to the model merit to the model then you can say that yeah you have set a good model or fit a good regression model. Now, let us see this and discuss one by one of all of them we throw numerical illustration. So, first we will discuss the standard error then we will go to R square and predicted interval in the R square and one by one. So, let us take one example throw that will illustrate the standard error first. So, here we have taken like the first example that I have discussed similar like sales and say advertisement and sales we have taken two variables this is your independent variable x advertisement the amount of advertisement you put that tentative sales will increase right. So, here suppose we have how many six datasets suppose we have. So, these are your amount of investment in advertisement and that they are sales say in millions say and then if you like the previous session we have discussed in the previous class about the fitment of regression analysis and the list square and the corresponding coefficients of regression line. So, we have calculated this numerical data and then we have calculated the B coefficient and A coefficient that means your alpha plus beta or say you can alpha plus beta x or you can say A plus B x. So, that we have calculated through these two formula and we found it and these are the mean of the data using them now we look at the data. So, we will fit some regression right. So, here is the regression line and here is the B value here is the A value and here is the estimation for suppose if the spending of the advertisement for the next year you have fit the line look at this is the line you have said and then you are putting the advertisement value suppose for the next year your advertisement you are putting 6 lakhs here next year for 7th year you are putting advertisement as x input as 6 lakhs and then in that case what would be your tentative or predicted cell. So, that you are predicting through regression line. So, what could be the cell you can see that cells had or predicted value here for 6 as input of x you are getting actually this on the line as the predicted value of cells what is the predicted value of cell here it is 3.25. So, now what we got from here this is nothing, but actually the summary of the previous session and you got the cell as like if you put the data 6 here you got the cells as 3.25. Now, let us go to the standard error calculations. The standard error remember the first one of the goodness of fit approach we are illustrating it. So, standard error is nothing, but look at the prediction where predicted value is here suppose for a given value of this for whatever and the variation from your predicted value from your mean prediction you can say from the average prediction or the mean predicted value or how much deviations are there of your data in the future that you know that interval we are trying to get or that error you are trying to get lower the error better the model. If you have too much of variation say in your prediction then you have not fit a good model we have a less variation that means you have a good fit a good model. So, that how to calculate the standard or like the standard deviation the measure of dispersion we calculate from there here also you calculate the dispersion the deviation from your mean prediction that we are trying to calculate through standard error. So, it is nothing, but summation of error square you can say by n minus 2. So, here the error you will get for data sets and if you calculate that this is the like n is the total sample and through that you can calculate n minus 2 is nothing, but you know for multiple regression we call it is n minus k minus 1 and if the k is the degree of freedom say here you have one independent variable because simple linear regression x and y. So, k is coming to be 1. So, n minus 2 is coming here, but in multiple regressions n minus 8 may be say 20 sample size or 30 sample size and k could be say 3 independent variable and accordingly you can calculate that we will discuss that in multiple regression, but here we are assuming that only you know simple linear regression and here is your standard error like square root you are taking because you are calculating like the standard deviation from the variance you calculate standard deviation. Here also you are calculating the square root of this value nothing, but your standard error. So, this way you estimate it and then if you put the data of all these through calculation of the previous table that I have shown you and you will get the standard error this 0.306 which is nothing, but in terms of this data set it is nothing, but almost 3 lakhs. So, this is your standard deviation that means if your prediction here how much in the previous data set if you see its the prediction was for 6 lakhs of investment in advertisement your sales are coming up to be as predicted as 3.26 million. So, therefore, in that case if you come here it is 3.25, but it can have a deviation of 3 lakhs to upside or downside. So, this is what your standard error. So, it is nothing, but plus minus actually 3 0 6 0 0 0 or you can see that 0.306 in terms of this value whatever the scale you put. So, this deviation it can be plus side it can be minus side this way you can measure the standard error. I will show you in excel more detail of it. So, this gives you the lower the standard error value lower the standard error value better the model actually. So, that means deviations are not used sampling been spreaded in a like you know homogeneity aspects I have talked about like homosedasticity aspects I have discussed in the as a definition in the previous class. So, that aspects we are actually capturing effectively. So, therefore, you know data are calling insights and therefore, you have put a inside your real estate one range and you can see that you know data has a good less amount of variation and whatever the variation are there among the data you are actually capturing through standard error. Now, let us go and see the interval now through standard error and the corresponding interval which is also part of with the standard error. So, now let us see the prediction interval more detail of standard error I will show you how they are being are calculated using illustration this through this example as well as the excel also here the upper limit of standard error look at the graph here. So, upper limit upper limit of standard error or lower limit with the data set right. So, you got the sales prediction like standard error value here hold it and now if you come back to the data sets and the let us understand the formula first. So, it is nothing but the predicted value the line on the line the predicted value for a given input sale of x a the predicted value plus standard error. So, it is two standard error three standard error like you know say once like one sigma two sigma this way in standard deviation we use the confidence interval here also we use the one standard error two standard error look at t is the number of standard error of prediction interval right. That means that like you know you might have a prediction interval like this right or let me put like this you might have a prediction interval like this. So, what are the interval that you are predicting you are you are confident. So, that are being calculated through this upper limit or lower limit assuming that the that that it observed points are normally distributed around the regression line right around the regression line in that case like standard deviation calculations you can also use the similar concept here of standard error and the confidence prediction interval rather than confident interval you call it equal to the prediction interval for you know one sigma plus minus one sigma or one standard error you will get say around 68 percent confidence or prediction interval. If you make it if you put your desired as a 95 percent in that case you have put a plus minus two standard error. Standard error we have calculated with the mean value you will get to know the range that you know once you make a predictions that you will get to know that the variation of the data will fall in between that range and you are confident of 95 percent if you take 99 percent in that case your deviations will be plus minus 3 sigma of that. So, let us see this in your illustration of that particular example. So, here we found the you know if your input is coming out to be say 6 lakhs as advertisement your sales will be almost 32 lakhs right 32.5 lakhs. So, this is your sales now this is the predicted value now. So, this is your predicted value 32.3 0.25 we talked about right or say 32 lakhs 32.5 lakhs. So, this is your prediction now, but you have to calculate the confidence interval or say prediction interval now through the standard error. So, what standard error we found we found the standard error with this one right we found the standard error and this is your predicted y hat this is your nothing, but y hat plus. So, suppose if you want to set 95 percent confidence interval or predicted interval in your forecast data in your forecast data in that case your upper limit will be predicted value plus 2 of standard error which is coming out to be 38.6 lakhs and if you subtracted you will get the lower limit of your range with 95 percent confidence or say predicted interval in that case it would be you know 26.38 lakh. So, this is what the interval prediction. So, you are confident 95 percent you are confident that your predicted value because it is all about prediction right your predicted value will not go beyond 38 lakhs say and it will not go down below 26 lakhs say. So, that means only 5 percent chance that only 5 percent chance that it may go outside of that rest 95 percent data your predicted value will fall in this range this gives you the confidence of your forecast or the regression line and the range are like that. Now, remember one more part here I have used this particular formula this particular formula look at this the upper limit will only like y plus minus 1 sigma 2 sigma 2 1 standard error 2 standard error here we have 2. But this is valid the number of standard of error of predicted interval valid if you have a good amount of good sample data good amount of sample I say 25 30 sample if you have a minimum in your data sets you know data sets say x and y say if you have a good amount of data sets sample size say 25 30 or more than that in that case you know you can take this t value with 1 sigma 1 standard error 2 standard error 3 standard error and you can get your confidence interval of 68 percent 95 percent and 99 percent and you can wind up your prediction that I have a good predictions with this is my confidence interval whatever this you can do it. But if you have a less sample of data but for this example how much sample data we had let us go back the data you can see only we have 6 sample size right only 6 sample size we have. So, in that case generally people if you have a less sample in your regression data people do not take this t as 1 sigma 1 standard error 2 standard error like that people try to relate that through the normal distribution data and from there they calculate the value of t here also I have listed that for you suppose this is your forecast say and then you have to like in that case forecast plus plus minus t standard error say. So, this prediction this t will not take directly from the to 1 standard or 2 standard error if the sample size is less in that case this t will be selected through the standard normal table. For example, here you have a 6 sample size sample size is sample size was 6 right in that case you go back to the data table and then from that put the 6 and you will get like you know if the sample size is less than that in that case the t stands for the number of standard deviation from the mean of the distribution to provide a given probability of exceeding the limit throat chance. So, that means, you go to the standard normal table and from there you calculate your corresponding t value in that case if sample size is 6 which is very less than 10 25 or 30 in that case we will not select the 2 for 95 percent confidence interval we will select 2.70 from the data from the table you will be able to select it. If the sample size say from the 6 it reduced to 4 say it might go up say 2 point say something say 9 to 9 to something it will come like this. So, if you increase the sample size your this value will come down this t value will reduce. This is what you know the prediction interval through standard error. So, 95 percent confidence interval is now like this look at here if you see since the data are less. So, you are not directly taking the t value t value as 1 or 2 etcetera for 95 percent you are not taking 2 you are taking directly 2 like as from the table 2.75 if the sample size increases this value will reduce and it will come closer to the 2 standard. Now, if you consider say you know you want to consider the say only 68 percent only one standard deviation one standard error. So, in that case t will be 1 if you have a large sample size, but if you have a less sample size in that case what you will be you will put you cannot put like 1 plus minus 1 standard error in that case since less sample size and you want to get a 68 percent say in a confidence interval in your data sets you want to check the prediction interval in that case you might get here the t value will be around say 1.35 something you will get. So, you will multiply that with the mean value and that prediction you will get the prediction of 68 percent confidence you will get from here. And now if you go back to the previous this data set here and if you want to take 68 percent confidence like one standard error then you directly and if a good sample size then in that case you put t equals to 1 and you will get the corresponding value here you will get a two interval of that. So, one standard error two standard error for small sample size you direct go directly go to the table and put instead of t you put the t value from your table, but if you have a large sample size you take directly from that standard error and multiply with one time two time based on the 68 percent 95 percent and say 99 percent if it is 99 percent then you put three standard error this way the model will give you your prediction interval of your forecast.