 Moving forward with regression, we will talk about assumptions of the regression. As for any statistical test like t-test, ANOVA or any other correlational test, we must follow and pull-fill the assumptions before we move on to running that analysis or using that analysis. So similarly, we have assumptions which are very important before we run regression on our data. So the first assumption for the regression is that data should be linear because we are following a linear equation as I told you that predicting y, the formula is a plus bx. So the data must be linear. There are many examples where we have a non-linear or curvy linear data. For example, if the dots or the scatter plot are like this, it means the data is not linear. So if the data is not linear, the running regression on that data could be misleading and could give you the wrong results or misleading results. So we can check this assumption easily in SPSS. So the first assumption, the data should be linear. It must be follow some linear pattern like this. The correlation could be low, but still it should be linear. Second assumption that for any two observations, the residual term should be uncorrelated. So as I talked that in regression, actually we are predicting y score and there is an actual y score as well. So there is a difference between the actual y score and the predicted y score and that difference is called the error. So for data for the residual or for the error term, what we want that the error term should be independent, that means that there shouldn't be any autocorrelations. So autocorrelation is very important in regression. What does autocorrelation mean that there should not be any relationship in terms of error term on the one time at the dependent variable and then the second time on the dependent variable. So for example, on the y-axis, you have a dependent variable and here is the error term. So this should not follow any pattern or there should not be relationships. So for each level of the dependent variable, the residuals or the error term should be independent. So usually in longitudinal studies, usually time-serious studies, we have to check it. We can definitely check autocorrelations in SPSS. By Durban Watson test, Durban Watson test allows us to see the error term or residual in your data that there is no pattern or correlation. And the rule of thumb is that I will not go into detail how we will do Durban Watson for formula. In fact, you can simply check with a click in SPSS that there is no autocorrelation and we are meeting the second assumption of the regression. Durban Watson's value, if it is around 2, then it is fine. But as much as it is the middle value of 2, and it ranges from 0 to 4. And the more the value of 2 is, that means there are no autocorrelations, deviating or moving away from 2, like 1 to less than 3, if the greater value of Durban Watson is coming, this means that there is no correlation in the error term or the residual pattern. So the third assumption for the regression is that each level of the predictor, the variance of the residual term should be constant. So this is called a homoscedasticity, the third assumption. Homoscedasticity is also very important because they say that if your dependent variable is here, then the variance of the residual term should be constant. For example, this is our residual graph, we can easily check this in SPSS. So if we have this average, then the residual graph, the error graph should be constant at each level of the dependent variable. For example, if the data is like this, then this means that the homoscedasticity is, that is, on each level of the dependent variable, our residual or error is constant and that is the same. Whereas if the homoscedasticity is not the opposite of this, then the heteroscedasticity will be there. The heteroscedasticity means that this is our dependent variable and this is our residual. And look here, the error or variance is more and then this is going on. So the levels of the dependent variable, i.e. first of all, the dependent variable is your value from 1 to 10. So the variability of the residual of the variance on each value is not your constant. Your graph should be like this. Meaning for this level of the dependent variable, your variability of the residual is more when your work is done here. So it should be constant and we can easily check this in SPSS and we will do it in a while. Then our most important assumption, fourth assumption, which we have done in all the parametric tests, in the F test, in the T test, and we have also checked the requirement of normal distribution. This should be our data normally distributed. And we have already talked about this. To check this, we can also check through graphs, through PPQQ plots as well. And through statistical test, we can also test it with the Kolmogorov-Smirnov test. These are our assumptions for the running regression. Remember that if our assumptions are violated, i.e. if the data is not normal, it is not linear, there is no error term constant in the data and in autocorrelations, the coefficient of regression, i.e. regression tells us how many percent variance is explained due to the independent variable. So it can be exaggerated, it can be misleading, it can be misguiding. So we have to be careful. All predictive variables must be quantitative, i.e. continuous. Another assumption, as we talked about earlier, as we talked about in ANOVA, what will be the scale of measurement for the variables. So for regression, your independent variable could be continuous and could be like dichotomous. But your dependent variable must be continuous, quantitative, running score should be yours. So predictive variable, i.e. independent variable must be quantitative, continuous or categorical. So a lot of students ask me, if we have categorical variable, gender, socio-economic, your marital status, type of family, family size, your number in the family, so this is the data. So can we use that in regression? Yes. For independent variable, we can put all kind of predictors. Either they are continuous or they are categorical. But for the dependent variable, it must be quantitative, it must be continuous, which means it must be on interval or ratio scale. So in this, we have the assumption of multi-coloniality when we move towards multiple regression. One is our simple linear regression, in which we have one IV and one DV. But multiple regression means we have one more independent variable. Remember that multiple regression and multivariate regression, they are very different in separate terms. In multiple regression, we have more than one independent variable. But we have one dependent variable. Whereas in multivariate, we have more than one dependent variable. And they are complex analysis. They are difficult, which we don't do on BS level. So multi-coloniality means if you have multiple regression, meaning your independent variable is more than one, then it can be that they have very high correlation. For example, if you are looking at the relationship of income and wealth on well-being of people. So income and wealth, they are very highly correlated variables. Because in regression, if there is multi-coloniality, because of that, if we soak a variable, then we have some misleading results. So we have to check for multi-coloniality in predictors, meaning in our independent variables, there is not much high correlation. If you have two variables, correlation, for example, we want to predict mental health and well-being from income and wealth. So if income and wealth are highly correlated, and their correlation coefficient is higher than 0.8, which means there is multi-coloniality problem. So there are different solutions to handle multi-coloniality. Either you increase the sample size or usually, if you are redundant, then we pick one of them and eliminate the other variable so that we have clear cut results. So predictor variables should not be correlated too highly. We can check multi-coloniality in BSPSS. We need to scan the correlation matrix. We need to find out the correlation between simple variables. If it is above 0.8, then it is problematic. If it is above 0.9, then it means that the variable is redundant. You should definitely delete one of them. SPSS statistics also compute variance inflation factor. VIF calculates which indicates where the predictor has a strong linear relationship with the other predictor. So with VIF or with tolerance, you can check this assumption. When we run multiple regression in SPSS or simple linear regression, then you don't need multi-coloniality in simple linear regression because your variable is one, but for the multiple, you definitely have to run tolerance and VIF value. In that, if your tolerance is below 0.1, then it means that your predictors have a multi-coloniality issue. But if it is more than 0.1, then it is not a problem. Tolerance and VIF are both reciprocated because if we take the value of VIF and say 1 divided by VIF, then we have the value of tolerance. So you can only focus on tolerance value and you can see if tolerance is below 0.1 value, then it means that your serious problems are in your multi-coloniality related data. If the average VIF is substantially greater than 1, then the regression may be biased. So if our VIF value or tolerance value is out of our range, then if it is below 0.1, then there are serious problems. If it is below 0.2, then there may be potential problems. So you have to check for the values and assumptions. And all these assumptions are important because we want to predict a prominent independent variable about the dependent variable. How much variance each predictor can explain into the dependent variable. So it is important that you first check because if it does not meet, either you have to do some correction or you have to increase the data or you have to go for other statistical techniques which are non-parametric in which we will talk about also.