 Hello everyone, welcome to the session of regression analysis. In regression analysis module, we will cover different type of regression analysis like simple linear regression, then multiple linear regression, then we will extend the concept to logistic regression also. But today to start with, we will concentrate only on understanding simple linear regression analysis and then we will extend the concept to you know coefficient of determination and say you know measure of goodness, then standard error, then R square and then we will extend the concept to multiple regression and one by one the understanding using excel and detailed analysis of regression. So let us concentrate all the basic regression analysis in today's session. In general regression is nothing but a causal model or you can say it is a econometric model where you define the causal relationship between independent variable of your data and dependent variable. Suppose you know if you have a data say you know y and x, so where x is explaining y or y is dependent on x, so x is the independent variable and y is the dependent variable. In that case, you have a say sample of data sets, you have sample of data sets say y1, y2, y3, y4. I will tell you in detail the fundamental difference between regression and say you know time series data. Remember the two data points. Suppose we have taken a one variable regression analysis because we are talking about simple linear regression today. So in time series, what we do? We actually have only that data of dependent variable. You do not have any independent variable. So here you have to study the past behavior of your own data like only dependent variable like say you know you have to record the temperature of a variable of a location and then you are trying to see how the temperature of you know the movement are happening or say you know variation are there and then you want to make the forecast for the future say 2-3 days of temperature. Suppose you want to calculate it. So it is a time series data and you want to make a forecast. There you try to find the relationship with your own past data and you make the forecast for the future. That is it. That is what simple time series analysis data and different type of techniques we are learning. But when it comes to the regression analysis, here your dependent variable is been explained by an independent variable. Like your salary will depend on your performance or say you know to some extent your say placement will depend on your academic record or say you know your past experience. So these are the independent variable. Your academic record or say your past experience we will define will explain your placement package or say you know getting a job in a good company. So this is what the causal relationship between the independent variables and dependent variables. Here the example that I have shown you, look at this X is the independent variable. X is the independent variable and Y is the dependent variable. And here you have observed the pair of data like X1. I will show you in the next slide you will get to know X1 and Y1 and then X2 and Y2. So therefore you take the collection of the data between independent variable and dependent variable and then if you plot it in your graph, you may find some causal relationship among them. That relationship you want to establish through regression analysis. How the regression, the linear trend are being there between X and Y, how it is explaining the dependent variable that we are going to understand today through regression analysis. When you come to the predictive analytics or say you know machine learning techniques, perhaps regression analysis is one of the most important tools or techniques of machine learning and business forecasting. Here you try to establish the strength and the direction of the relationship between the variables like independent variable and dependent variable. If you look at the application of regression analysis, whether it is say in operation, whether it is of any organizations or any company, whether it is operation, sales planning or say you know production planning or say supply chain demand calculation, whatever. In economics, in finance, in marketing, in social science, everywhere you need the regression analysis as a part of your predictive analytics tools or say techniques where you need to find the causal relationship between the independent variable and the dependent variable which will give you a good strength of relationship between the independent variables and the dependent variables. Why you do it? It helps you in future if similar X comes, if similar X comes, what would be your Y? So, that forecast you can actually make through this regression analysis. How to do that? That we are going to discuss today. But what we are going to cover, as I mentioned, we will cover in this particular session of business forecasting. Since we are covering many modules, we will restrict our discussions to the simple linear regression, then different type of you know ingredients of regression like R square, standard error, etc. Then we will discuss multiple regression and the logistic regression with couple of applications through case or maybe through numerical examples. So, in simple linear regression, there are major four assumptions which you need to keep in your mind that is called the linearity between the independent variable and dependent variable, the linear relationship between X and Y, between independent variable and dependent variable and then that relation should be linear, not nonlinear. There if it is nonlinear, then it will go to nonlinear regression analysis or polynomial regression. So, we will not discuss that in this particular module. And then independence between the observed data or you know observation of observed independent variables, they should be independent to each other which will help you to some extent that multicollinearity part that I was talking about. There should not be any relationship between independent variable and then normality. That means the error of the residuals which I will show you today, the residuals of the data should be normally distributed with mean 0. That means that they are to some extent not related to each other and therefore they are identically independent, normally distributed data. Then I was talking about equal variance or homosidasticity. This is one assumption in basic regression we consider the variance of the errors is constant. This is the assumption. It does not mean that you cannot manage the opposite of homosidasticity. That is called heterosidasticity. You can control that also, but we will restrict initially our discussion about the homosidasticity, which is nothing but suppose if you have a data say x independent variable and y dependent variable. In that case, suppose you have a data between a given x, given x here. Suppose you have the data. Look at the data that I am plotting here. Look at the data to some extent. If you draw a regression line say y equals to say a plus bx. Suppose you have fitted this line plus some error term will be there. So here in this case, you can see the variation of the data to some extent are fixed. They are constant. Outlayer are not much there. Therefore wherever you are in your data sets where you are here of your independent variable or here or here, you can see the variation are not much. So therefore the variation among the data are to some extent homogeneity are being maintained. So we call it as homosidasticity or equal variance of the data. So that you can make a better focus. But if there is a too much like here, you might have less variation. But if you go, if you increase your x, your variation might be high in terms of output cell or y cell. So in that case, if you have too much of variation, then we call that situation as the heterosidasticity. So now let us go the calculation process or the coefficient calculations. Like here I talked about you know a plus bx. So what is a? What is b? How we can calculate? This epsilon we can understand that is the residual part or say you know quite nice additional like variation that is called error part. We will not focus that now. We will focus calculation of the regression coefficients that is called a and b. So how we will calculate the regression coefficient of your line, of your trend line. So actually a is nothing but your the intercept and b is nothing but your slope or you can say the gradient. So here this is your a part and b will be like you know y equals to a plus bx. If you draw this trend line between x and y, this regression coefficient intercept and the slope of the you know gradient point you have to calculate slope of the regression line you have to calculate. In order to calculate that we use least square method which is very popular in regression analysis. So first of all how it works the least square method and how to calculate the coefficient a and b or c. Sometimes people call it is alpha and beta y equals to alpha plus beta x. So in that case alpha and beta you have to calculate it. Effectively it helps this regression analysis helps this least square method helps in minimizing the error of deviation of the data. That means suppose you have a data here I will show you if you have a data these are your actual data for a given x. So this is the deviation from your fitted line the regression line that you have established you do not know yet which line you are fitting the best fitting line you are trying to establish by calculating the coefficient of a and b. But you do not know how to you know which line is the best for that the least square method will help you in calculating the coefficient and making the best fit. What is the mechanism? The mechanism is nothing but it is a simple mathematical technique that you calculate the error of deviation based on our fitted line from your actual data sets or observed data points to the fitted line. These distances you want to minimize and that you minimize throw the least square method. Let us see to understand the entire concept here look at here here I have captured you know couple of say observation of dependent variable y for a given value of independent variable x. So here couple of observations and look at that this first observation if you see this is the deviation right this is the deviation that means the distance between the observed point and the predicted line or expected line that you want to establish right this line. Now look at for second point here is the gap the deviation here is the deviation here is the deviation here is the deviation. So these deviations you want to minimize the sum of this deviation you want to minimize. Now the point here is that here is the pairs of data which I discussed earlier now we can think about the graph now. Suppose if all the distances from the observed point and your predicted line are 0 then you have fitted the best line right. But the point here is that since the data are being scattered or you know in plotted in a different location look at the data point here. So therefore you may not be able to fit the line where the deviation the error will be 0. If it is 0 then your correlation between the data will be almost 1. Since it will not be all points will not be following on the line expected line you have to minimize the distances right or the error point. But how to do it? What do we do? You do we do not take the only distances and try to minimize that. In that case what happens this could be say you know say say plus 5 here it could be say minus 5. So effectively plus 5 minus 5 will be cancelling out. So you will not find you might feel that you know there is no error and all points are following on the line and it is a best fitted line. So you will commit some mistake. So therefore what we do? You take the square like standard deviation you calculate through variance calculation similar way here also you actually you are minimizing the variation or the variance of the data from your predicted line. So therefore what you do? We take the square of the data. You can take absolute value here I have mentioned look at that simple distances are not the best right. Minimize the sum of residences of all the variables is the objective but we do not take that. We also do not take the absolute value also because even if you take the absolute value the negative and positive part of error are being adjusted effectively because you are taking the positive also all error have been added. But this is not the best way like you know penalty are not given this line some points are falling closer to the line and some points are bit a bit higher despite they are you know to some extent almost equal variance they are falling. But if somebody is falling much away from your line you should give more penalty in calculating that particular distances or the deviation right. So therefore you know to adjust the penalty whoever is having more distances you give more penalty to that that penalty or can be adjusted through the square of the higher the error they take the square of it rather than taking simple absolute value. There are another advantage of taking the square of the errors and minimizing that rather than the simple absolutely simple addition will not work I told you even absolute value will not work because it is giving equal importance to all the distances of the error. This distance also are getting same weightage and with this distance also which is not a correct you should give more penalty to this particular deviation then this it is closer so that and line is much suitable much appreciable then this particular gap. So therefore we take the you know the penalty concept keeping in mind or background we take and also it helps you in calculating the you know your regression coefficient a and b because you will take the gradient and make them equal to 0 like you know partial derivative so which will help you in calculating the coefficients also. So therefore in summary we consider the square of the error and we minimize that look at that the best strategy is that to minimize the sum of the square of the residuals or the deviations from the observed points to the predicted line or expected line that we are going to establish that you measure and minimize that that called linear regression fitting or best fit. Now we will calculate the regression coefficient that means these two coefficients now we are going to calculate to minimize the least square error right. So this is what the formula you can see the error square sum of them and this is a formula so we will minimize it we have taken the partial derivative with respect to two parameter least square parameter a and b to s. So first derivative we found partial derivative we found this and then because we have two parameter now we have to estimate that so we are taking two partial derivatives to make them equal to 0 and then we will find say you know this first equation and the second equation if you make them equal to 0 we will get this from this data sets now look at the partial derivatives with respect to a and b and then if you make them equal to 0 what you will find you can see this if you make them rearranged and if you set them you will get this equation and this equations and after adjustment look at the sum of a is nothing but total a number of a so n a and this will become the final calculation and then after adjustment you will get b like this and a as this a is what a is your intercept part and b is your slope part so it is nothing but you found your forecast as a plus bx so b value you found and a value you found so you calculate this to coefficient of your least square for the regression. So once you find the a and b the estimated value of a and b you can get your final forecast of your data set of your data sets you find your final forecast y equals to alpha plus beta x people write here we are writing a plus bx and that line your forecasting line or simple linear regression forecast through least square method. Now let us see how this calculation are done through least square method the coefficients have been calculated a and b let us see with some numerical example. Suppose here we have taken a basic example that a maker of a golf search has been taking the relationship between the sales and advertisement look at that and using linear regression he would like to establish the causal relationship between sales and advertising given that if the company invests say 53,000 as an advertisement cost in the next year what could be the sale. So in that case this is the question and the company has the past couple of data set right instances so they had seven instances 1 2 3 4 5 6 7 and the other data if earlier when they invested 32,000 they found sales to be 130 say whatever in terms of thousands or lakhs whatever so they found this so these are the couple of data they have seven instances they have in the past they recorded that so that data sets are nothing but your support of regression. So here you have the x so sales advertisement and the sales this data are been given here I will show the graph next slides I have to fit the regression line right so how to do that the first step you calculate the x bar you calculate the y bar look at the formula this formula we have calculated in the previous slides these two formula of regression coefficients we are going to use to calculate the best fitted line using least square. Now you can calculate the x y with the data and x square the sum of them and y square sum of them so all ingredients are there with you now you can calculate the you know x bar y bar also we have calculated so B the coefficient part that slope part which is nothing but if you put the data here which is nothing but n what is n here 7 n is 7 that means number of observation look at here n is 7 so 7 observation you have now you calculate the B the slope part is 0.98 I will explain that concept of B also here and then A the intersect part how much 1 0 2 if you put them in your regression line y equals to A plus y equals to A plus B x plus error term will be there we will discuss that later in the next session but if you see here the regression line this is your best fitted line now what are the question of the company that in case in next year if they will have to invest 53,000 in the advertisement purpose what could be the tentative sale the expected sale the prediction the prediction comes out to be 154 understand that now before I go to graph and see how the regression line they are being best fitted by minimizing the square of the errors remember this B part what does it mean it is mean that the change of y for a given value of x that means for 1 unit of x change what could be the change of y so rate of change are being explained through B the slope part of gradient part here it says that almost 98% so look at the relationship the strength between the independent value and dependent value the R square value will calculate later but but look at here through excel I will show you but here you see the slope part it is quite strong relationship between the x and y and therefore it is almost you know 0.98 so this is what the from say from 53,000 if you make it 54,000 you will see the incremental change of y so rate of change you will be able to find through the slope part now what is the intercept intercept means that like 1010 that means if you have a data so this is what your intercept so 102 a equals to each fixed now so it is called you know relationship free or intercept line we call it a intercept line I will show you that when you propose the hypothesis in the next session when you between the dependent variable and independent variable and how the null hypothesis and alternative hypothesis are been defined to calculate the p value or the significance whether the relationship has a great great significance or not through overall ANAVA test of F test value and the you know t test like you know through the output of the regression data we will discuss detail there you will get to know but for the time being imagine that you know that if you don't have any relationship then the intercept part or intercept line are here y equals to 102 there is no relationship but it is to some extent intercept line we call it as a right so now if you see the graph here look at here so how much almost you know how much we found here 102 so now if you fit that line and if you based on this the observed data sets you can see for 53 x equals to 53 you can see the output cell is here right this is what the expected outcome the standard error part etc we will discuss in the next session but you can see the fitted line and this is what your regression line through least square now is the overall concept basic concept of regression we will go to excel we will understand how the calculations are be done and all these coefficients are been calculated before we go to that I will show you one more examples to bring the better clarity sometimes what happens you know here in the previous examples if you remember that your independent variables were the advertisement cost and dependent variables are sales right and you have found the causal relationship between x and y but in practice there might be situations where you know your independent variable will be only the time period you can use regression here simple regression here for prediction also or bringing the causal relationship between sales or whatever the variables you want to observe based on the time period also so here maybe quarter or maybe you know here or maybe down week whatever these are time count only right and here you are considering your time series data and what you are taking that we are considering that for it for when the period of one your demand was like this say 74 and then when you go to the next period your demanding like this so in that case also you can consider this as independent variable and this as a dependent variable why demand you can set the regression analysis here also or train line here also and you will be able to find the relationship between 1, 2, 3, 4 like this between the time period which you are considering as your independent variable and the demand as your dependent variable and you feed the line here same calculations you can do x bar y bar you know and then you know x square x y and if you calculate the entire formula B and A you will get this line look at how accurately it is also predicting look at the graph here look at these are the data sets you have now now for which period say 8th period you want to calculate the what could be the demand here you can see if you put this particular line train line or regression line here we are getting the demand you are getting on the line only remember one more part I should explain you here like you know whether this particular data set or the previous data sets here we have taken the data 1, 2, 7 as x and these are the black line points are your observed data and then you have put the you have set the or you have feed the regression line the blue line right and for given x say y x equals to 8 or x equals to 9 etcetera you are calculating right through calculating the observed output through as a predicted value look at y hat so this is what your train line and regression analysis remember one point that it is not like that all I you have to consider the higher x the you know in the on the line of x axis and then only will be able to calculate y you can based on this fitted line since you have fitted the best line by minimizing the square of the deviations and the error I have found the best a and b the alpha and beta or you can say the slope and intercept for a given x now suppose in case again x comes for you cannot take this particular this observed value now because it is a prediction now in that case you have to take the point on the regression line this point you have to take say say suppose here say so for for you have to take this point on the line of regression as a predicted value not the actual what happened earlier what happened earlier that was already they had already happened you cannot consider that as a predicted value given the even you would take the existing data sets as in independent variable data set but your output will be on the regression line similarly remember this part similarly here if you come back to the previous problem here now for this problem so you have plotted the graph right we have plotted the graph what was the x value x value was the you know advertisement cost right and why is the sales value say so in that case suppose 53 I have put in the at the end now right before that we have the different data if you look at the data here suppose this data so look at here 53 is coming from the middle where you had a data observation now 55 now 53 you are predicting here suppose instead of 53 suppose in case we consider say 52 or say 45 say suppose if you consider 45 for 45 with the observed data you found 148 so now you have already got a regression line right you have already got a regression lines now for 45 you cannot take this data as your prediction you have to put 45 here now in next year or next time next instance if you have 45 as your you know x value or independent variable so this you have to put here rather than corresponding because you might think that the 45 is matching with my past data so I will consider this observed data as my prediction no that was the past data set now based on the regression line or the fitted line you have to take the y value from this line from this line clear now so this is what the overall basic understanding of regression now let us go to the Excel and we will understand how quickly you may get to know so I will come to this you know this particular output cell as a later stage let us go to the Excel how it will look like that actually Excel will look like this the output cells between the input data and output data if you put your summary sheet will come like this I will come to here let us go to the Excel now so here I have kept the two examples that we have studied today let us start with the first examples say you know first example say advertisement cost are been given and the sales now we have to fit the regression line so go to data go to data analysis and then select the regression look at here you can install the if in your laptop it is not installed you can install that part data tool pack then come here regression so select this particular y range this is our y range sales and select your x input data cell x range and then select the level part because first row we have considered as the like you know sales and advertisement if you don't if you consider only the data then you don't need to select the level point but since we have taken that also so you select the level and output cell where you want to keep your output cell so suppose you want to keep your output cell here so keep this solve it you found your you know regression output look at here here you found the R square the adjusted R square the standard error I will discuss that later look at the R square value 0.87 so which is very good to some extent around 90% so the explain relationship between advertisement and sale are been established through the least square method quite effectively here it also says that we call it is a measure of you know fitness I will discuss measure of goodness I will discuss that in the next session detail but here if you see the the intercept and slope these two value look at these two value and throw that if you want to fit a regression line your regression line will be y equals to a plus bx right it will be now for say 50 let me put this formula here suppose 53 for 53 what could be your y value right equals to so a is intercept plus b is 0.98 perhaps 97 here of x so here your prediction 154 remember in the graph also we found 154 so this is what you know and if you fit the line you will get the predictions and whatever the x you put here as any instant you will get the corresponding predictions from the regression line which you have established here y equals to a plus bx clear now now more detail about this analysis and the overall hypothesis concept of understanding the significance value of the data will discuss in the next session now come to sheet one here is the another example that we have considered here we have considered the demand as output sale or you know dependent variable but you do not had any x variable here but we have considered time period as our independent value we have fitted the line here also here also you can see 80 percent r square value like square of correlation coefficient which is very good now and then if you see here as I mentioned standard error will discuss tomorrow but if you look at here that intercept and slope it is almost you know 56 is the intercept and slope value is 10.53 so this is what your and p value also here you can see the quite significant value point less than 0.05 which is very good I will discuss that later but if you come back to x ppt now here come here this is what the summary of say particular that data here you can see your intercept here you can see your slope and p value which is very good and then you can see your r square value also and the regression line y equals to what could be the regression line y equals to 56.7 1 plus a plus bx right so 56 plus 10.53 into x so this is your regression line now simple linear regression understood and now for any given x whether in between the data sets or new data sets you will get the output through regression analysis so this is what the summary of simple linear regression in the next session what we will study we will study the detail of r square that means the the coefficient of determination or the strength of the relationship between the independent individual and dependent which is very crucial to understand through graph through concept detail we need to understand about r square remember in regression analysis if you don't understand the detail of r square perhaps the coefficient of determination of the best fit concept you will not be able to understand for any new data sets whether the regression that you have plotted are really a good regression there is a really a strong relationship or not that will be measured through understanding of r square then the standard error is very crucial the deviation from your expected outcome this what is expected outcome right from that how much deviations are there that also you have to calculate with confidence interval we will discuss that what is the observation number of observations are there and then you know overall ANOVA test between the overall regression relationship there might be more independent variable right in that case how you calculate how the relations were been built whether the regression are been really accepted or the closure relations will be accepted or not that would be done through F test of your analysis and there also your overall P of your overall test should be less than 0.05 that means significantly if it is higher than 0.05 so your null hypothesis will be active and in that case there you will be concluding that there is a more relationship between independent variable and dependent variable similarly for individual variables levels also you have to test the through some small t test you have to find the P value here also all this we will discuss in the next session as extension of regression analysis then we will enter into the multiple regression and multicollinearity aspects so thank you