 In research we are quite often interested in theory, and in quantitative research we use numbers to either develop or justify theory. To understand the relationship between theory and data we need to start by thinking or defining what we mean by the term theory, and we also need to understand a little bit of philosophy to know why and how, using numbers to justify theory, it's a justifiable way of doing research. So theory, first of all, it's an explanation, so theory is a statement about relationships and how, when and why those relationships occur. And it's different from description which just explains what happens without the why question. And typically when we write research papers it's easy to say what happens and how, but doing the justification, the why part of the theory, that's the hard part. So theory is, in Bakkar's presentation, it's presented as levels. So we have first, we have x, the theoretical concept here, and when we go from high level abstraction to lower level abstraction, we have construct which is kind of like a more refined version of a concept according to his definition, there are multiple definitions. And then we have some variables or events that we actually observe. Then we have relationships, so we make claims about these theoretical concepts, their relationships based on relationships between the observable variables. Then we have some kind of boundary conditions and assumptions that are required for this theory to hold. Or we could say that our theory of the CO-gender causing companies to be more profitable is constrained to only countries that are similar to Finland, for example. To understand the relationship between theory and data, let's look at the deep house paper. So deep house theory is an elaborate description of why firms that differentiate should be more profitable or why too much differentiation should lead to less profitability. And their data is from banks, they have 159 banks, they have written on assets as the dependent variable and then some of deviations of proportion of 11 asset categories. Basically what that means is that they calculated what kind of assets the banks held and then they compare how much different a particular bank was from the industry mean and that was their measure of strategic deviation. Then they found out the negative correlation. So how is the relationship, how does it work? Can we, based on this small correlation, come up with this elaborate theory of how the strategic deviation or differentiation is related to profitability? Or is it the other way around? Are we using this theory to come up with some kind of data that we use to justify the theory? The second option is correct. You cannot, based on this correlation only, come up with this kind of elaborate theory because the correlation only tells that there is a statistical association. It doesn't tell whether it's a causal association or not. And particularly it doesn't answer the important question of why these two variables are related. So theory typically needs to come from some place else than the numbers. To understand further the relationship in theory and data, we need to look a bit about the types of reasoning that we do in science. So inductive and deductive reasoning are the two most common forms of reasoning in science. This is a classical example of induction and deduction. So we have major premise, all men are mortal, minor premise, socrates is man, and then we have a conclusion that socrates is mortal. And deductive and inductive reasoning are referred to knowing two of these and then coming up with the third. So how do we infer from two of these to the third? In deductive reasoning we know that all men are mortal. We know that socrates is a man. And then we infer that because these two major premise and minor premise are correct, then socrates must be mortal. So if these two claims are true, then the conclusion also must be true. So deductive reasoning maintains the truth value. So if this is not true, then it also means that either one of these must be false. So for example, if we observe that socrates is a man, then we observe that socrates is not mortal, then our assumed rule that all men are mortal is incorrect. So we can clearly say that if socrates is a man, socrates is not mortal, then all men are not mortal. So we can rule out, we can refute claims using deductive reasoning by observing something that differs from the expected conclusion. Inductive reasoning works the other way around. So we observe that socrates is a man, we observe that socrates is mortal, and then we infer from that that all men are mortal. So we go from a specific case to a generalization. So that's inductive reasoning. Of course just by observing that one man is mortal doesn't mean that all men are mortal. So the problem induction is that even if we observe things a thousand times, we can't guarantee that we would observe the same thing in the one thousand and first case. For example, even if you have all the swans that you have seen in your life have been white this far, it doesn't mean that all swans are white. They are actually black swans in Australia. So inductive reasoning is not guaranteed to work. There is some debate in philosophy of science and history of philosophy of science about induction, but the general understanding is that inductive reasoning is useful, although it doesn't lead always to the correct conclusion. Then for completeness I'll also explain abductive reasoning. So abductive reasoning is the case that we know that socrates is mortal, we know that all men are mortal, and in abductive reasoning we infer the minor premise. So abductive reasoning is reasoning to the best explanation. So we are answering the question that we observe that socrates is mortal. Why could that be the case? So we infer that maybe socrates is a man. This is even weaker than induction and it's not very commonly used except for generating hypothesis in which everything is allowed basically, as I will explain in a few slides from now. So we'll be focusing on inductive and deductive reasoning and let's take a look at the diagram that I showed you in a few slides from before. We have different levels of abstraction. So we have theoretical concepts. We typically want to say something about these concepts and we can use the term construct to refer to a theoretical concept. Then we have propositions. So propositions are claims about the relationship between theoretical concepts. For example we can say that our company's CEO's tenure causes the profitability differences. Theoretical concept and then CEO gender is also a theoretical concept. And the relationship between these is proposition and the proposition is a part of a theory. It's kind of like a statement that summarizes the main claim of the theory. Then we have empirical concepts. So these are some things that we can actually directly measure and we have a hypothesis. So the idea is that if we have a theoretical concept and other theoretical concepts that are related, then we should have also some kind of measures of those theoretical concepts that are related according to a hypothesis. So we make a statistical hypothesis that we actually test. For example we could say that a doctor's assessment on the CEO gender, which is directly observable and return on assets from the profitability report to tax authorities, which is directly observable, are related if there is a relationship between the CEO gender and profitability. That is our hypothesis. Then we collect actual data. So we have actual observations from actual companies on things that we can observe and we test whether there is a statistical association. If there is a statistical association, then we can conclude hypothesis and propositional support. So how does induction and deduction work in this kind of framework? The idea of induction is that we go from a statistical association and we say that because two things are correlated, then there must be a theoretical relationship. So we observe a correlation between the CEO gender and profitability. So basically we infer that the CEO gender or gender difference is the cause of the profitability difference. So we go from a specific observation to a general theory. That's inductive reasoning in research context. Then in deductive reasoning we have a theoretical concept. We define that if these theoretical concepts are related, then also empirical concept should be related and then also measurement results should be related. If we observe that measurement result is not what we were expecting based on deduction, then the theoretical relationship is incorrect. So that's how deduction works. So we infer what we should observe and if we don't, then we say that the proposition is not correct. Of course there are many different ways how this can fail, but that's the dominant way of doing research is the deductive approach. So we go from theory to measurement results and then based on the measurement results we say something about the theory. So that's a general idea. So how is this justified and is it ever justifiable to do induction in quantitative research? Some people say that inductive reasoning is not justified, but it's not so straightforward. Let's take a look at the deep house paper again. So they have a proposition and the proposition that we'll be focusing is that moderate amounts of strategy similarity increases performance. Then they have a statistical hypothesis. The idea here is that if the theoretical proposition is correct, then we should observe that there is a curvilinear concave down or a u-shape. It first goes up and then goes down. Relationship means the data deviation and return on assets. So these are two variables that they can observe directly and this is used as a test for the proposition. And then we have finally the test with data. So they calculate, they have some data from call reports and then they calculate the data deviation. They run the regression model and the regression findings can be used to test the statistical hypothesis. And it was supported in the paper. Therefore they conclude that maybe the proposition is true as well. So the idea of proposition is that it's a theoretical claim. Then we have a statistical claim derived from the theoretical claim. And finally we have some calculations that actually tested the statistical hypothesis. Now the million dollar question is can we infer causal theory from a correlation? Let's take a look at some correlation examples to see why that could or could not be the case. So these are correlations from actual observed data. We can see that the correlation between U.S. spending on science, space and technology is correlated almost perfectly with suicide by hanging strangulation and suffocation on for a 10 year period. Can we make a claim that if U.S. increases spending then suicide and hanging will increase? Or can we say that U.S. should decrease science spending so that less people will make suicides? That's an implausible claim. So that's unlikely to be true. And the reason why there is a correlation is that there could be a common cause. For example it could be that the population is growing and when there is more population there is more tax dollars. There is more spending because there is more tax dollars and also because there is more people there is more suicides. Or this could be because of the state of the economy. Let's say that the economy is growing and both of these are growing as the economy grows or so forth. Or it could be just by chance only. 10 observations, 10 years is not that much. So if you have large data sets and you keep on data mining those data sets like these Tyler Wiggins we made these graphs did, you will find some large correlations. For example number of people who drowned by falling into a pool and films where Nicholas Cage was the star also correlated. So claiming causality from this correlation would be ridiculous. So can we ever make a claim based on a correlation? To understand we have to understand what is the hypothetical deductive method in science. The hypothetical deductive methods are differences between two contexts. First we have the context of discovery. How we come up with new theories and the context of discovery is how we generate theoretical hypothesis. So in hypothetical deductive method the starting point is the hypothesis. It's a guess of what the result could be. So it's a guess that maybe the US spending on space and science actually causes deaths by suffocation and hanging. So it's simply a guess and it doesn't really matter how we come up with a guess. Then we have the context of justification. So how do we justify this claim that US spending on space and science actually increases deaths by suffocation and hanging. In hypothetical deductive reasoning or research logic we apply deductive reasoning. So we assume that the hypothesis is true. Then we assume some other auxiliary hypothesis and then deduce what things we should observe if this main hypothesis and the auxiliary hypothesis are true. Then these observable consequences in a research paper are typically presented as statistical hypothesis. So whereas hypothetical deductive reasoning or the logic of hypothetical deductive research doesn't really say that the observable consequences must be presented as statistical hypothesis. That is how it's commonly done. So one a bit confusing thing here is that the hypothetical deductive method, the hypothesis is the theoretical claim. So that's a hypothesis we think it could be correct. But in practice when we apply hypothetical deductive reasoning we use the term hypothesis for the observable consequence. And that can cause some problem, some confusion. So we have a hypothesis that we should observe if the theory is correct and if some auxiliary hypothesis that we'll cover later are correct as well. Then we can test with data. So if our statistical hypothesis is not supported, we don't observe the predicted observation, then we infer that the theory or the actual initial hypothesis must have been incorrect. It's of course possible also that the auxiliary hypothesis, one of them is not correct. We don't know that we don't know which of the hypothesis is incorrect is referred to us under determination problem of science. But that's how we do. So we don't observe something we infer that the theory is probably incorrect. On the other hand, if we observe the deduced consequence of the theory, then we can claim that the theoretical proposition could be correct. We can't claim that it's definitely correct because deduction doesn't work that way. We can only refute theories, we can't support theories using deductive reasoning. The way the support for a theory comes is when it has been tested over and over and it has survived many severe tests, then we can say that if this theory can't be challenged, then it's probably true. So how it relates to the previous figure? So we have induction and we have deduction. We can actually apply inductive reasoning but only for the context of discovery. So we can make theoretical claims based on correlation. We can see that there is a correlation in the US spending on science and technology and then deaths by suicides and hanging. But we can't use the correlation that we used to make a claim. Initial guess or initial hypothesis, we can't use that to justify the claim. So of course where the initial theory comes from, it doesn't matter. What matters is whether we can justify the theory empirically. Inductive reasoning cannot really be used in the context of justification in quantitative research. It would be very unlikely that your observed correlation that you happen to observe actually fills the requirements that we need for making causal claims without it being a result from a research design that is specifically designed to do that. So deductive logic can be used to justify claims and inductive logic can be used to come up with the claims. But of course when you come up with a claim, you also have to come up with the justification. Unless it's a paper where you just present a claim. For example papers in the category of management review only present claims and no empirical evidence. Of course in the context of justification we have to have auxiliary hypothesis that are true as well. So here we would have to have the auxiliary hypothesis that this empirical concept, let's call it a return on assets data from the trade register and theoretical concept of performance. We have to have the assumption that this empirical concept is a valid measure of the theoretical concept and also that this measurement result here is reliable measure of this empirical concept. And that there are no other causes that cause this statistical association. So we have to have these many many different auxiliary hypothesis. If we don't observe something it could be that the initial proposition is incorrect or it could be that the auxiliary hypothesis is incorrect. We don't know which one it is. We conveniently assume that it's always the proposition but that's not always the case. So finally you can make claims based on statistics on correlation. So you can make a causal claim based on a correlation. But you cannot justify a causal claim only via correlation. So sometimes you find interesting correlations in your data then you start to think why there is a correlation. You come up with a theory that's completely okay as long as you don't present that correlation. That made you come up with a theory as an evidence for the justification of the theory. So you have to keep the context of discovery and the context of justification separately.