 Tämä video tarkoittaa, että rekresoinnin analysointi on järjestelmä, tai järjestelmä on järjestelmä, jotka ovat liiskua-estimation-principella. Järjestelmä ja rekresoinnin analysointi on, että meillä on yksi järjestelmä, ja tämä esimerkki on yksi järjestelmä X. Järjestelmä laitaa järjestelmällä dataa ja rekresoinnin analysointi on nämä ajattelua, että nämä observaati אניkin huoltaisinko tämän kärjestelmällä, niin tämä tämän peruseen on yrittää. Tässä ei ole noin järjestelmä, joten meidän dataa on yksi individu Se on yksittäinen peruste, johon ilman järjestelmällä, johon jollakin perustaa järjestelmällä. Järjestelmä me myös ainakin, että kun saadaan, että X on yksi, variabullyjärjestäjä ja mechanism of why are normally distributed here on the recent line. So that's basically a summer for the assumptions, and now we will take a look at specific parts of those assumptions. Before we do so, we have to talk a bit about what the assumptions mean, because there are some misconceptions. For example, sometimes students in our classes say that an estimation technique requires that the data are normally distributed, ja esimerkiksi tästä denkittäisi, että uloses 교ut kannattaa olla siksi, kun haastamme dataa. Kyllä päälle on 2 kuvaamisesta, ensimmäinen asioista on yleensä on yksi asioista, jolloin kokemmeplanevaraajalta. Kyllä ensimmäinen asio on etäri, että opettajien asio ei ole yksi asio, että vaikka elämää penjärvi, on yliopissista. Mennään esimerkiksi asioista ja että uskottajien asioista, joten ymmärtää, mitä esimerkiksi on. Tässä on rikason model. Y on välttämättä sum of the X's, the observed independent variables plus some error term U that the model doesn't explain. Then we have estimation principles. How do we choose the betas? Which set of betas is the best? And one good rule is the OLS rule, minimize the sum of squared residuals. So we choose the betas here, so that the sum of squared residuals, what is the difference between the observed value Y and the fitted value from the betas is as small as possible. So that's what we are discussing this part. But that's not the only way of estimating a rikason model. For example, we could use weighted least squares. So weighted least squares is the same as OLS, except that instead of minimizing a sum of squared residuals, we minimize the weighted sum of squared residuals or sum of weighted squared residuals. The idea of weighted least squares is that some observations are more, provide us more information about whether the rikason line goes than others. And in some scenarios, the weighted least squares is better than OLS. To understand what those scenarios are, we have to understand the assumptions. But that's not all, we have also others. So there's feasible generalized least squares, which is the same as weighted least squares that estimates the weights from the data. So that makes a bit less assumptions than weighted least squares and there are tradeoffs in that. We have also iterated weighted least squares or IRLS. And the idea of IRLS is that it weights the residuals iteratively and the weights for the next iteration are based on the previous iteration. And this is a good technique when you have outlier observations that I'll talk in another video. So all of these techniques can be used in different scenarios. They all work reasonably well in some conditions. In some conditions, one of these rules is clearly better than others. To understand that, we have to understand the assumptions. Also the models, we can use different models. So it's the regression model is not necessarily the best model. For example, instead of a regression model, we could apply a generalized linear model, which takes the fitted values for regression analysis, applies a function there, and then it doesn't make the assumption that the observations are normally distributed. So that's one alternative model. So you can choose either alternative model or alternative estimator when your data doesn't really fit into the model estimation combination that you're planning to use. Here's another one. This is a multi-level model. And this would be applicable when you have, for example, long internal data. So you have multiple observations for each company and the many companies in the data. And you assume that there are some constant differences between companies that persist over time. And then you would use that kind of model because you are in violation of the random sampling assumption in regression analysis. So there are different things that you can use. I recommend always as default option to go with regression analysis and OLS estimation. If you have a good reason to use something else, then do that. But start with the OLS and regression model because it will tell you something about the data that you didn't know before estimation. And it's quick to calculate. Then you go to more complicated things if specific assumptions of OLS don't really fit into your research scenario. Okay, so the assumptions are something that we do. So assumptions are required for certain proofs. So when we say that OLS requires that the error term is normally distributed, it means that it has been proven that OLS is consistent, unbiased, efficient and the estimator normal when among other assumptions the error term is normally distributed. So certain proofs require these assumptions. If we can't assume certain proofs, certain things, then the proof can't be done. So if the error term is not normally distributed, then we cannot prove that the OLS estimator is unbiased in small samples. It could be, but we can't prove it. So these assumptions imply one important thing and they don't imply another thing. So what they do imply is that the estimator is useful when we are close to these ideal conditions. So regression analysis assumes that the relationships in the data are linear. If they are close to linear but not exactly linear, regression analysis will be a useful tool. So these assumptions don't have to hold exactly. If they are close enough, then we will get still good results. Also it doesn't imply that if an estimator has been proven to be consistent under some scenario, then it's immediately useless in other scenarios. So the fact that something has been proven in one condition doesn't mean that it does not work in another condition. But it's important to understand the limitations of these different techniques and for that we test the assumptions typically after we do our analysis. Now that we have understood that the assumptions are something that should ideally hold. But in practice they hold only approximately. And also we have understood that because we are in violation of, for example, the normality of their assumption in regression analysis, it doesn't necessarily have any severe consequences. It just means that certain things can be proven. It could be the thing that we can't prove could still be true. Let's take a look at the actual assumptions. Regression analysis requires four assumptions to provide or OLS estimation requires four assumptions to provide consistent and unbiased estimates. And the unbiasedness property here refers to any sample size. So regression analysis is unbiased regardless of the sample size. You can get unbiased estimates with sample of ten observations. The estimates will be very imprecise but they're still unbiased. The first assumption is that we have a linear model. So that assumption basically just defines the model. And that's all there is to it. Then the second assumption is random sampling. So random sampling means that all observations are independent and each observation in the population has equal probability in getting selected to the sample. This is a feature of your research design and it can't really be tested imprigally directly. You can test it in some aspects of this random sampling. And I will talk about that later. Then we have two other assumptions. Assumption three is there's no perfect collinarity. So perfect collinarity is different from multicollinarity. Perfect collinarity means that one or more of the variables, independent variables in the model are completely determined by another independent variable. So for example, if we have three dummy variables, then we define a categorical variable. If we know two values for the dummies, then we can infer the third. That assumption requires that every new variable that we enter into the model brings new information about the phenomenon. If we know that let's use gender as an example. We only need to know whether a person is or is not a male. If he is not a male, then we know that he's a female or she's a female. Then having a variable for male and having a variable for female would be perfectly collinar because knowing whether a person is a man automatically tells you whether the same person is a woman or not. So that's the perfect collinarity. The zero conditional mean, this is a technical way of expressing it, but it basically tells you that we assume that the error term is uncorrelated with all explanatory variables. And this is a bit more complicated assumption that I'll explain in another video, but this is also referred to as the no endogeneity assumption. And if we look at this diagram of regression analysis, then this assumption number four can be understood as where this distribution here is located doesn't depend on the regression line. So the distribution is always exactly at the regression line instead of, for example, the line going here and the observations being somewhere here normally distributed. So that is called the no endogeneity assumption and the endogeneity is a big issue if we want to make causal claims using observational data. I'll return to that in another video. So under these four assumptions, OLS is unbiased and consistent. We have still two more assumptions that OLS makes that are required for the consistency and unbiasedness of standard errors and the normality of the estimates. Standard errors are unbiased and consistent if the data or the error term is homoskedastic. So there is no heteroskedasticity. What this assumption means that the observations are equally spread out around the regression line. We would have a heteroskedasticity problem if the observations are close to the regression line here but far from the regression line here. So if instead of observing a band of observations around the regression line, we would observe a funnel shape that opens up or a megaphone shape that opens up. So that's the homoskedasticity assumption. These five assumptions together are known as the Gauss-Marco assumptions and OLS is efficient under these assumptions. But more importantly, the homoskedasticity assumption is required for the standard errors to be unbiased and consistent. That is important because the T-statistic for statistical inferences for the P-value requires that both the estimate and standard error are consistent and unbiased. Under those conditions, the T-value will follow the T-distribution with the null hypothesis of no effect holds and we get proper P-values. So that's the fifth assumption. Then the final one is that most people are probably most aware of is the normality assumption. So this is also misunderstood. Regression analysis does not assume that any observed variable is normally distributed. Instead, it assumes that error term, the unobservable part or how much the observations vary around the regression line, that is normally distributed. This rule is actually, this rule implies four and five rules. And these assumptions one through six are called classical linear model assumptions. In practice, the normality of the error term as assumption can be ignored because OLS estimator is what we say asymptotically normal. So it means that when the sample size increases towards infinity, then the regression estimates will be normally distributed regardless of how the error term is distributed in the population. In practice, the sample sizes that we use that are 100 or a few hundred that is enough for this asymptotic normality to start to kick in. In practice, I have tried to demonstrate scenarios where the lack of normality of the error term would be problematic with observations of 50 or more, and I have failed. So I cannot think of a scenario where this normality assumption is a practical concern for applied researcher. Let's take a summary of the assumptions. So we had six assumptions. First, all relationships are linear. That can be checked after the model has been estimated. How we check that, I'll cover later. Then independence of observations, they must be a random sample. This is a feature of your research design and you can check the independence of observations after estimation under certain scenarios. Perfect collinarity and non-zero variance of independent variables. If that fails, then a regression model cannot be estimated. For example, if you are studying the effects of gender on performance on a statistics course, then you only observe women, so you have no variation in gender variable, then you cannot estimate a gender effect. Also if you have two variables that quantify the exact same thing, then you can't enter both into the regression model. This does not need to be checked because you will know that you can't even, if you run running a regression analysis, you will know if this fails because the regression doesn't complete. Then error terms expected value of zero, given any values of independent variables. In practice, this means that all other causes of the independent variable that are not included in the model must be uncorrelated with all causes that are included in the model. That's a strong assumption. It can be tested directly after least course estimation, but we can test this assumption with instrumental variables that I'll cover in a later video. Then we have error term has equal variance, given any values of independent variables. This is there are no headers, get us this assumption. This should be checked after estimation because it influences the standard of regression analysis. And if you have a heteroscedasticity problem, it is easy to fix. Then error term is normally distributed on. I typically check this because it's useful to know if some of the values are far from the regression line to identify outliers. But other than that, this is not an important assumption.