 Let's take a look at the recursion analysis that I got from my former colleague, Pasi Kuusava. Pasi is a smart guy and an business researcher and he wanted to address a research question of what really determines company performance. He was studying return on assets as his dependent variable. His sample size was fairly small, so 100 observations only, but the company said he had a very rich data set of 50 observations, 50 variables, with all these important concepts. He decided to use a recursion analysis, which we normally do for this kind of data. He did a recursion analysis explaining return on assets with these 50 variables. Pasi's initial results were quite impressive, so he got an R-square of more than 50%. We are explaining more than half of the variation of company performance. If we can do that with just 50 variables, that is probably worth writing a paper about. But there is a small problem in Pasi's analysis and the thing is that if you have 50 variables, that is too complicated. So you can't really sell that as a coherent theory to a major journal. So what Pasi did then was to think how can we make it more parsimmonious. So what if we trim down the model to only those variables whose p-values are less than 0.05, which are highlighted here. So we will reduce the size of the model to make this more publicable. Pasi decided to choose those variables and then he redid his analysis. The results were still impressive, so he got an R-square of more than 40% with just about a dozen or so variables. But still if you say that there are performance depends on a dozen different factors, it's not as cool and as impressive as saying it. It depends on, for example, only 5 or 6 factors. So he decided to still only focus on those variables that actually are statistically significant. So he was focusing on those variables with p-value less than 0.10, which some people say is the threshold for marginal significance. And that allowed him to drop further 5 variables from the model resulting in about 10 variables being included. So he rerunned the analysis again with a smaller set of variables and he got R-square slightly below 40% still impressive. If you can explain almost half of the performance with about 10 variables, that is great. And then Pasi proceeded to write this public case. So what's the problem here? Pasi didn't actually have any empirical data and he didn't really write a paper about this. Instead this was a demonstration for his class how regression analysis can fool you if you do data mining. So his data were actually just random noise. Generative this specific data code so there is he had 510, 5100 independent random draws from normal distribution, which means that his data was purely just noise. There were no statistical relationships in the population and no causal effects whatsoever underlying the data. So he just decided to generate random data that are uncorrelated in the population, give them fancy name, run a regression analysis and you get 40% explanation. No underlying structure whatsoever. This analysis illustrates two problems with regression analysis. One is that if your sample size is small compared to the number of variables, so Pasi had two observations for each variable, each independent variable, then the R-square measure will be inflated. So in the first slide if you go back you can see that the R-square actually is not statistically significant and the adjusted R-square is pretty close to zero. That indicates that the variables actually don't explain the dependent variable. But because R-square is positively biased, it appears as if you were explaining the dependent variable, which you are in the sample but it doesn't generalize to the population. So that is the first problem. The second problem is that if you choose your variables based on what the results are, then all of your statistical tests will become invalid. So remember that the P-value tells you the expected false positive rate when there is no effect in the population. So here you had in the initial 100 observations there were about five that are expected to be false positives. If you choose those five you will always have statistically significant results and there is nothing going on in the population. So your estimates will be, your tests will be biased. This technique that Pasi applied is called a stepwise regression analysis. So the stepwise regression analysis there are different strategies how you do that but it's basically running either a big model first and then allowing the computer to trim the model using some kind of decision rule or alternatively you start with an empty model with no independent variables and the large pool of potential variables and then the computer chooses which variables go to the model and the objective is to explain the variation of the dependent variable. So this is a bullet explains that the significance test will be invalid if you apply this technique. He doesn't really cause me against the technique. There are others including me who take much stronger stance against stepwise regression analysis. Alisson's book he refers to this in his regression book as automated variable selection. He takes a negative stance that this is not something that you want to do. He doesn't really explain why. Then Klein's book on structural accuracy modeling provides the most pointed comments on this technique and he says that stepwise regression is something that computer does for you and computer is not very good at generating models. So instead of looking at what the computer does for you use it be using the best research computer in the world your own brains to choose whatever will be in your model and then he calls death to stepwise regression think for yourself and I think that's a bit pretty good recommendation because the problem with choosing variables automatically is that you will be capitalizing on chance and you will include variables that don't have a theoretical grounding. So if you're just building your model from the data then you will find a large number of false positives or chance explanations or at least inflated effects and then trying to theorize afterwards and then present that as if that was the model that you initially planned to do that's very unethical because that doesn't really tell the reader what you wanted to do. If you try to publish something with a stepwise regression analysis in a good journal you're likely to be discrejected because using a stepwise regression is such a bad idea. If you do it stepwise regression and present it as if you chose the variables and don't tell anyone that you used stepwise regression analysis that is lying and that's unethical.