 In social science research, we typically want to make causal claims. What makes causal claims challenging is that we need to eliminate rival explanations. We can't simply say that because X and Y are correlated, therefore X must be the cause of Y. The gold standard for eliminating rival explanations for a correlation between X and Y is the experiment. In an experimental research, we randomize our sample into two groups. The treatment group and the control group. The treatment receives X, the control does not receive X, and then we observe the outcome Y after an appropriate time delay. That allows us to make clean causal claims. Unfortunately, in most of the cases, we can't study things experimentally. For example, if we want to study what is the effects of early internationalization on firm survival, we can't just take firms and tell them to internationalize early randomly. We would have to buy all the firms and that's impractical. In practice, most of the time we have to work with what we observe. We work with observational data and we have to use statistical techniques for controlling for alternative explanations. One of the most commonly used statistical techniques is the regression analysis. Different variations of regression analysis are pretty much covering 90% of what's published in social science journals. Let's take a look at how statistical controlling for alternative explanations and how regression analysis works and what are the key ideas. The idea of controlling is that when we observe a correlation between X and Y, let's say we observe a correlation between CO-gender and profitability of a company, we should rule out the potential spuriousness of that correlation. Correlations can exist for multiple different reasons. Here in Singleton Estrates, they give two examples. I like the example of firefighters. The number of firefighters in a fire scene is highly correlated with the amount of fire damage after the fire. Can we say that increasing the number of firefighters in the scene causes more damage? Probably not. There is a third variable, the size of the fire, that is the cause of the firefighters and it's also the cause of the damage. So if there is a big fire, then more firefighters are sent in. If there is a big fire, then there will be more damage. And the size of the fire creates a spurious correlation between the number of firefighters and the fire damage. With statistical controlling, we try to eliminate these spurious correlations. Or we try to get the causal effect from an observed correlation that is partly causal, partly spurious. How do we do that? Let's take a look at our Taloselma 500 example. This is an interesting finding, just a fact. In 2005, the ROA of women-led companies in the largest firefighters and Finnish companies was 4.7% higher than men-led companies. Let's assume that we have already ruled out chances and explanations. So we want to understand whether it's actually the women that cause this profitability difference. So how do we deal with that? We see from our observed data that there's an overlap between CO-gender and performance. The variables are correlated. If we have a bar chart, it shows that female-led companies ROA is 18.5, for male-led companies it's 14.1. These are just made up numbers, but the difference is roughly the same ballpark than the 4.7. So how do we deal with these alternative possible explanations? Well, first of all, we need to have some kind of theory of why there would be a correlation that is not causal. We could, for example, say that there's a third variable company size that influences the correlation. We could say that small companies are more likely to have women CEOs and small companies are more profitable. Therefore, company size causes a spurious correlation between gender and performance. And now, with statistical adjustments or statistical techniques, we want to understand how much of this correlation marked here with A1 is due to the spurious part C and how much is the causal part A2. Assuming that company size is the only factor causing a spurious correlation. The simplest possible strategy for eliminating the rival explanation of size is to make the companies more comparable by matching. So let's assume that most of the women-led companies have less than 250 employees. If men-led companies tend to be larger, then comparing large men-led companies and small women-led companies is not a fair comparison. We don't know whether it's the gender or the size effect. What we can do is matching or analyzing a sub-sample. So we could, for example, analyze a sub-sample of companies with just 250 people or less and a compare. So with this sub-sample, we would find that men-led companies are still a bit less profitable than women-led companies. But the difference is not as great as before. And based on these kind of analysis, we could say that yes, there's a correlation between CO-gender and profitability, but it is mostly explained by size differences with this artificial data. This strategy is very simple to understand and it is very simple to apply, but it's also fairly limited. It is limited because if we start matching, typically if we have let's say five different explanations for the correlation, we have let's say industry effect, we have past performance, we have size effect, we could have other effects as well. If we try to match on many different variables, then the problem is that we don't really find any more matches. If we want to find two sets of companies that are equal on five different characteristics, our sample will just run out. So in practice, matching works for simple problems, for more complicated problems, we use statistical modeling. This would be an example of a regression model. So we would say that return on assets is some function of CO-gender plus company size. We would code CO-gender as one being female, zero being male, and then we tell our computer to estimate this model for us. So computer gives us estimates of these betas called regression coefficients, and then we interpret them. I'll talk about regression a bit more later in this video. But these are the two main strategies for statistical controlling. Either we try to make the samples more comparable by matching, or we build some kind of statistical model that adjusts the difference. To do so, we need to have control variables. So control variables are the alternate explanations, that for the correlation. So if we see a correlation between CO-gender and profitability, and we want to make the claim that CO-gender actually influences profitability. The difference is in profitability is due to some of those companies have female CEOs and some have male CEOs. We need to consider what a skeptic would claim. So if someone does not buy our claim that naming a woman as a CEO causes profitability to increase, that skeptic needs to develop a counter argument. For example, smaller companies are more profitable, more likely to have women CEOs. Companies in asset heavy industries, more men dominated, they also have less return on assets because our returns is divided by larger assets. So we need to consider what are the alternate explanations for the correlation, and those alternate explanations will be our control variables. Quite often when you see an empirical study, you see this kind of section about controls, and this paper by Heckman explains the controls quite well. So they are saying that these are alternate explanations for the data, and then we rule out those alternate explanations using statistical model. Importantly, an alternate explanation needs to be correlated with the explanatory variables. So we say that company size correlates with CO-gender, and also company size or a control variable needs to be a cause of the dependent variable. So we say that it's not the women's CEO that causes the profitability difference, rather it's the company size that causes the profitability difference, and company size is correlated with women's CEO, and that causes a spurious correlation. Let's take a look at example how it works. So this is from Heckman's paper again. We have the correlations, and we have the regression coefficients. So regression coefficients tell us what is the effect after controlling for others. I'll talk a bit more about regression a bit later in the video. But let's try to understand the idea of statistical controlling. So here we have a correlation between patient satisfaction and age of a physician. That correlation is 0.9. It's not statistically significant, but we don't care about that now. It's a positive correlation. And then regression analysis tells us that there is a negative causal effect of age. How come? Well, the reason for this correlation is that there's actually a spurious correlation in the data. So if we look at just tenure and age, they are highly correlated at 0.69. So tenure is how long a person has been employed at this medical company, and age is the age of the person. So obviously if you are just 30, late 20s, you just graduated from a medical school, you can't have much experience. If you are closer to retirement, then you typically have long tenure in the place where you work. So tenure and age are highly correlated for that reason. Older people tend to be more experienced than younger people. We can see that tenure or your experience affects highly the dependent variable, the customer satisfaction. So now we have a situation where there's a spurious correlation. Older people are more experienced, experienced courses, customer satisfaction scores to go be higher. Age has a negative causal effect on customer satisfaction. How do we interpret that? We interpreted that if two people have the same amount of experience, then the patients prefer the younger one. If two people are of the same age, patients strongly prefer the one with more experience. We can calculate the value of the spurious correlation from this diagram by simply multiplying that regression coefficient 0.33 and correlation 0.69. We can see that there is 0.23 spurious correlation between age and customer satisfaction due to tenure. We can do this for all other variables, but if we simply compare this 0.23 here, the spurious part of the correlation between age and customer satisfaction, we have the estimated causal effect, we take a sum, we get pretty close to dopture correlation of 0.09. Of course there are other variables that affect, that also can cause a spurious correlation, but this is the most important one. So the idea of statistical controlling is that we try to take the observed correlations and then take that observed correlation into two parts. We have the part that is spurious and we have the part that corresponds to the causal effect. If we can eliminate all spurious and estimated correlation, then we have a clean causal effect. Regress and other techniques typically use the term holding constant. So control variables are held constant and this is what singleton and strains explain. Holding constant can mean two different things. If we have matching or if we have some kind of experimental study where we have some control on our sample, then holding constant can mean that if we want to eliminate for example gender differences from the analysis, we study only men or only women. Quite often this kind of holding constant by actually making a variable to be the same in our sample is not possible. If we want to eliminate effects of company size, we cannot sample companies that all have exactly for example 100 employees. That would not be possible. Another way to understand holding constant is statistical control. So in statistical controlling like in Regress and analysis, the term holding constant means that we statistically estimate what would be the difference of one variable if all other variables were the same. Other variables are not exactly the same actually, but by statistical analysis we can answer the question what would be the difference if everything else was the same. So for example what would be the difference between women and menlet companies in ROA if the women and menlet companies were the same size and in the same industry? Statistical controlling allows us to answer these kind of questions without actually having to observe companies that are all of the same size and in the same industry. And this is why statistical controlling is very useful. Linear Regress and analysis as I mentioned already is the basic tool for statistical controlling. Many other tools are variants of this technique, but these techniques and a basic understanding of this technique takes you a long way in understanding quantitative research and analysis results. The idea of linear regression is in a two variable case, if we have X variable here, this is the explanatory variable, this is the one that we will manipulate in an experiment, we have the Y variable, the outcome. This is the thing that we would observe after the experiment. But these are actually observed data, so we don't manipulate X. What regression analysis does is that it finds the best line that explains the data. So it tries to find a line that explains the mean of the Y, the dependent variable for each value of X. And in math we write the regression equation like that. If there are multiple different explanatory variables, then the expected value or the mean of the Y variable is assumed to be the sum of the effects of all these different variables. Of course this is a critical assumption for regression analysis. If different effects act multiplicatively, for example, then we would need different kinds of models. But regression analysis is a good starting point. We assume that the CO effect, our geo-gender effect, the industry effect and company size effect, they are all added together and that gives us some kind of expected value for the performance. In our Heckman's paper example, we can for example say that patient satisfaction is a function, it's a sum or weighted sum of physics and productivity, physics and quality, physics and accessibility. And the regression analysis tells us what would be the effect of increasing productivity if quality and accessibility stayed the same. So regression analysis can be used to answer this kind of what if questions. What if one variable changes and others stay the same. What if we name a woman as a CEO of a company and the industry and size of the company stay the same. That gives us an estimate of the causality. Regression analysis is not a magic wand or magical tool that always gives us valid estimates of causal relations. If you study regression analysis, if you want to become a professional researcher, then you will need to read books about econometrics that tell you that these six assumptions are needed for regression. But there are really two important things that you need to understand. If you want to get started in understanding what quantitative research using regression is about. The first assumption is that all relevant contours are included in the model. If we regress company profitability ROA on CO gender and company size, a skeptic comes and says that no, the correlation that you have is because of the industry. If we don't include industry in our analysis, but industry is actually correlated with the CO gender and an important cause in company performance, then regression analysis is not trustworthy. This applies to all statistical analysis, all quantitative research. Having the right controls, the most important alternative explanations included in the model is critically important. Technically, this refers to the MLR4 assumption. Another important assumption is that all relationships are linear. So when we have the y variable and we increase x by one unit, the effect on y variable, the dependent variable is always the same regardless of the current value of x. That can be true or approximately true for some cases. For example, if we consider the size of the fire and the amount of damage that the fire makes, could be linear, so if the building is twice as large and the fire is twice as large, there could be twice as much damage. But it's not always true. For example, if we consider the effects of education, having elementary school education, middle school education, the first nine years, probably does not make as much of a difference than the final years of your university degree when you're working toward your master's degree. So in that case, the returns on education probably follow more closely an exponential line than exponential curve than just a line. So not all years are equal. So the linearity is another important assumption in regression analysis. In practice, when a professional researcher applies these techniques, there are diagnostics that allow us to test for linearity and there are also some diagnostics that allow for testing of all variables that are included in the model and so on. But just to understand what regression does, when it is useful, it is important to understand these two assumptions. Regression analysis and its different extensions for, for example, non-linear case, for the case where we have observations that are cluster, for example, students within classes, they cover probably at least 90% of the research that is done in social sciences. If you understand regression analysis, then you have a pretty good foundation of understanding other things because they are simply extensions and variants of this simple technique. So the idea of regression is that y, one dependent variable, is a weighted sum of x is the independent variable. So let's summarize. In statistical controlling, the important thing is that we have control variables. So control variable is an alternative explanation for a correlation between the x variable, for example, CO gender, y variable, for example, company performance. And controls are held constant either by choosing a sample in a way that studies only, for example, company of the same size or more typically they are held constant using statistical techniques. So we use statistical techniques to answer the question of what would be the difference if all the companies were of the same size, same industry and whatever control variables we have. Controls must be justified instead of just throwing in a standard set. So you need to think through why is there a correlation between my important x variable and my important y variable and then pick controls to rule out those alternative explanations. In practice, the tool that we apply to study this kind of xy effects controlling for other variables is the regression analysis. So this is the workhorse of causal analysis in non-experimental research. Regression analysis makes two important assumptions. Effect of x is always constant on y. So everything is linear. It's not that x, little x doesn't make a difference when you have a large amount of x then having more of that makes more of a difference. Or it could work the other way that initial small increments of x make a big difference and then when you get further on x then the increments are going to get smaller. Regression analysis assumes that the effect of x on y is always constant. Many commonly used analysis techniques are simply variation of regression analysis. So if you understand the basics of when regression analysis works and how regression analysis results are interpreted then you have a pretty solid foundation on understanding also other more complicated techniques.