 Welcome to today's discussions on research methodology and in my first session today, I am going to continue from yesterday where we had a discussion on how to build regression models. So, yesterday we had this thought of how to connect variables, what is the relationship between pairs of variables, how do we prove that there is an association between pairs of variables and then how do we in particular look into an appropriate methodology for fitting a straight line. So, what were issues that could be problems when you fit a straight line and towards the end of yesterday's lecture, I discussed the possible problems with taking a non-linear formula, a non-linear model, transforming it into a linear model, what errors could crop up if you transformed a non-linear equation into a linear equation and the final recommendation was that you probably want to use a fairly sophisticated software like Sylab to directly solve a non-linear formula in terms of its fit to a data set. So, yesterday I had to do with working out whether variables x and y were related and how to do this in the context of building a model, building a model either linear or non-linear and how to use that model from a predictive perspective. So, today I want to go one step ahead beyond that and continue with our same sets of variables, but this time we start asking the question given a collection of variables do we have a feel for what is causing what. So, in other words what is a cause and what is an effect in any phenomena that we are trying to understand. So, I have titled this talk causality and experiment design and I am going to describe the basics of causality and the basics of good experiment design. So, particularly in science, but also in many engineering domains it turns out that the theory that we are trying to develop ultimately is about making a statement about causality we wish to end up saying that there is a variable x which causes an effect y, so x causes y, but that is actually a statement of a hypothesis it is a model that you hypothesize. So, the issue ahead of us is when we do our experiments and when we collect measurements of x and y how do we go from proving a relationship between x and y to then saying that x causes y or for that matter y causes x. So, ultimately we want to make a statement about causality we want to make statements like smoking causes cancer, a genetic defect causes cancer, exposure to radiation causes cancer, but the problem we have is we have actually made measurements of the amount of radiation that exists about the genetic background of an individual in this particular study and the extent to which cancer may or may not have developed. So, we have measurements we have measurements and ultimately we need to go from having correlations between pairs of measurements to ultimately proving that there is a cause and effect mechanism in terms of what controls the ability to work out causality. Now, causality is a word I am going to use often and so it is important that you are not confuse the word causal with casual two totally different words with two totally different meanings. In fact we are very casual about how we misuse the word correlation and also the word causality. So, when I talk about causal it is in the context of some phenomena which has a cause which ultimately will end up having an effect. So, causal phenomena and causality is what the focus of my talk is. So, make sure that you distinguish what I say from the word casual which has a totally different meaning and if you use the word casual in a discussion of that particular phenomena that you are involved in you would end up totally ruining the science that you are trying to explain. So, our headache is that when we do our experiments and we look at our measurements we are actually establishing relationships between pairs of variables and if you remember yesterday's discussion our relationships between pairs of variables are demonstrated by using an equality symbol. So, all our algebraic equations had an equality sign in there and I pointed out yesterday that we really do not have a notation which allows me to say that x causes y or y causes x and in fact the closest we came to such a notation was using conditional probability. So, if you use conditional probability you have the ability to say that there is a certain probability of x given that y has been seen and that implies that y has happened first and that you are then seeing x. That is the closest we get to coming up with an arrow relating cause and effect. So, it turns out that if what we are getting from our experiments is actually a listing of relationships between pairs of variables. In other words correlations if what we are getting is a set of ideas about which variables are correlated with which are the variables. If you want to prove that there is a cause and effect and that a certain variable is the cause of some other effect then you are going to have to do some more work. Simply collecting data and proving that there is a straight line model between x and y does not give you additional information about what is caused what. I prove that yesterday we said that if you write an equation y equals A x you could always rearrange this equation to write x as a function of y. So, there is no hint that something is a cause and something is an effect in such a linear relationship. All we prove is that there is a correlation between the two variables. So, if you look back in history several people it turns out have struggled with this and Galileo was one of the first people who basically said describe the phenomena first do not get trapped into figuring out what is a cause and what is an effect just start collecting data just describe it. And then subsequently a whole bunch of scientists came up with all kinds of empirical laws when they first proposed them, but these laws really had nothing to do with cause and effect. So, if you go back and look at these laws their equality statements they say that some variable is related to some other variable. For example, Ohm's law says that the voltage drop is related to the current flowing through a resistor, but it is not clear what is causing what and they were all empirical laws obtained by observation when they were first proposed. Sometime later a gentleman called David Hume talks about flame and heat and then if you look at his writings carefully there seems to be the first hint that if there is a flame that flame causes heat. So, cause the cause is the flame the effect is the heat that you will experience when you go near a candle. So, this is the first kind of systematic observation in any scientific literature which suggests that there is a causal connection between two phenomena in this case the flame and the heat and what he suggests is that such connections are things which for example children immediately learn and therefore he implies that one can reason out and learn causal connections between two variables. In other words one can work out the relationship between two variables in terms of what is causing what. Many people struggled with how to prove what is causing what when it comes to a rooster crowing at dawn and sunrise happening at dawn. So, how do you prove what is causing what? Is there a relationship? Notice that by and large roosters will crow at dawn and sunrises of course happen at dawn. So, if you look at this from a mathematical perspective there is a correlation between the two events happening at the same time, but the practical issue is can one go from there and say that one event is causing the other event and a simple system like this caused a lot of trouble to philosophers. So, how do you work out given a set of observations about roosters and about sunrises? How do you work out if something is causing something else? Let us take something closer to us in terms of an engineering example. So, go back to our force mass and acceleration example. So, there is a theory which says that force equals mass times acceleration. Then given that this has three variables in its analysis given two of these variables I must be able to determine the third. That is common sense we are taught this in school, but this can be surprisingly difficult if I ask you what is a cause and what is an effect. For example, is force caused by acceleration. If you think back to what you did in school you will realize that you are told that if you apply a force an object will accelerate. So, in effect the force is presumed to be the cause and the acceleration is presumed to be the effect given a particular object with a certain mass, but how do we know this from this equation? So, if I simply look at f equals m a how do I know that force is caused by acceleration and why not the other way around? For example, if I rearrange it then I will have this interpretation that f over a equals m and then you are left with this question of does force over acceleration cause mass and that is intuitively not something we agree with, but the equation does not prohibit that kind of an interpretation. So, f equals m a again because of the equality it does not prohibit this kind of an interpretation. So, we have conceptually problems with having to figure out what is causing what given such equality statements. So, in the early 1900s Carl Pearson and I talked of him in the context of the history of statistics he started observing that when you look at equations with equality symbols between pairs of variables talk only of correlation and do not talk of causation do not talk of causality. So, in other words limit yourself to simply making comments about what is correlated with what and do not try to go from there into a discussion of what is causing what. So, for in fact this advice more or less has affected the way science has been done over the last 100 years. We look at collections of variables in whatever phenomena we look at and we basically limit ourselves therefore to saying there is a correlation between pairs of variables and it turns out that if you want to come up with a hypothesis about cause and effect you probably need to go way beyond the experiments you have done. And to this day most statisticians and for that matter scientists in general will be very cautious about trying to claim a cause mechanism when the data only shows correlations and that is thanks to the advice given by Carl Pearson. But it is also known that if you do additional intervention experiments and I will explain what an intervention experiment is it turns out that if you do additional experiments you may be able to infer given knowledge about correlation between variables you may be able to infer in which direction there is a cause and effect. So, it turns out that this these studies of asking what is causing what requires at least two features and these are consistency and temporality and I will describe each in turn. So, consistency requires that if you are going to work out cause and effect that as you make your measurements and you are going to employ in your domain with your problem you will have a certain procedure for making your measurements. Consistency says that regardless of the procedure you are using for your measurements the experiment and the process should proceed in a consistent fashion. So, what is basically saying is the act of making a measurement must itself not disturb the experiment. So, in other words if you are interested in finding out the cause and effect of a phenomenon and you plan on making measurements then it is very critical that the act of making a measurement itself not disturb the very nature of the experiment. The other feature that one desires something called temporality which basically is that and this is kind of common sense to most of us this basically says that the cause happens before an effect. There is a cause and because of that cause something happens later on. So, there is a function there is a dependency on time which is what temporality means. So, it turns out that most people who study cause and effect in their particular phenomena are bothered with these two aspects. Is there a dependency of your entire experiment on what happens to the experiment itself as a function of making a measurement in other words will making a measurement mess you up. And second is there a clear distinction between something happening before something else. So, in other words is there a clear influence of time on the outcome of your study and if you look at the first point consistency quantum mechanics tells you that consistency is not possible. It tells you that for instance the very act of trying to measure the location of a particle will allow you to will force you to not get a good estimate of its velocity. The idea is to find the location and velocity of a particle you cannot do both at the same time. And the moment you try to go in there and find the position of a particle you have no clue about its velocity and vice versa. So, consistency in at least the quantum mechanics domain is not possible. It is also not possible in several other domains for example, an example I have gone to several times is the clinical trial. And in the clinical trial the idea was that somebody takes a drug and tries to figure out whether the drug improves your health. So, where is the issue with consistency? The question narrowly being asked is will people who are taking a drug do better than people who are not taking a drug. But in all of this the act of testing on a group of patients must itself not influence the health of the patients. Now in a clinical trial invariably what happens is people, patients are pulled out of their homes and they are put in hospital in some corner. And there they are in controlled fashion either given the drug or given the sugar pill. So, therefore, this means that the act of taking people out of their homes and putting them in a hospital that itself is causing some kind of stress on the prospective patients. So, there is a hidden variable here which is the stress caused by somebody not going about their routine life, going about their work. And so it is possible that there is now another variable which might influence the outcome of your experiment. And all that you were trying to aim for was to make sure that it was the drug which was the cause of improved health. So, in a clinical trial strictly speaking you do not have a consistent experiment simply because we cannot avoid the fact that the very act of trying to provide a drug to a patient under controlled conditions that very act itself implies putting a patient in a certain hospital which in turn causes stress. So, with the issue with time causality and time now almost every single physical law out there does not have an explicit dependence on time. In other words it can be reversible with respect to time. So, again if you go back to F equals m a or if you talk about the position of a ball as you throw it up in air. One can always work out the coordinates of a ball relative to the ground as you throw it up in air from time 0. But the factor of the matter is one could put negative time in there and you could still work out the coordinates of a ball in that formula. So, there is nothing in there which allows you to or nothing in there which prohibits you from having a negative time axis. In other words time being considered reversible. There is at least one law however which does involve an arrow for time and the second law of thermodynamics basically suggests that systems steadily decay into total disorder. Systems evolve to further and further disorder therefore implying that there is a directionality to time. But this actually sets up a very curious contradiction to the study of causality which is if you are studying cause and effect and if systems are decaying as a function of time. Then they are decaying towards a final disordered state and so therefore the whole process of evolution of a system is that it finally ends up in the disordered state and one could therefore argue that the cause of change in nature is because the final state must be one of disorder. So, the cause therefore is itself in the future and the current evolution that we are seeing or variation that we are seeing with a state is in the present. So, the effect is in the present and the cause is in the future and this of course is the opposite of what we had required of temporality. So, the point is therefore that there is no hard and fast rule of what features a causal study must have. But ideally one needs consistency one needs control over temporality. We should have a feel for what is happening before something else that is what temporality means. But invariably these are problems with certain types of studies you will not be in full control of both features in your study. So, we have said that correlation between x and y such as a relationship between x and y, but another way to word this is the moment I realize that variables x and y are related then there is some causal model out there which is at the moment unknown to me which explains a relationship between x and y. So, if I see a correlation it is because x and y were already related by some cause and effect mechanism and so one way to word this is that there is always been a causal relationship between x and y and the data now proves a correlation between x and y, but that precise relationship between x and y remains unknown. So, it is unresolved. Now, there are three possible scenarios the moment you say this the moment you say that x and y are possibly related by the data. The first possibility is that x is a cause and y is the effect and then of course you can turn it round and say that y is the cause and x is the effect, but it turns out there is a third cause which is actually quite important and which actually hurts us in most of our experiments. And the third cause is that both x and y with themselves effects caused by something else that is a subtle one. So, we are used with more comfortable with points A and B where x is a cause or y is a cause, but it turns out that in several experiments unless we pay careful attention it could turn out that everything that we are measuring is actually caused by something else. And so, just to clarify this third scenario to you I am going to give you a set of examples and ask you to reason along with me in trying to figure out how to deal with ultimately third party variables. So, I am going to give you a simple example this is these are to some extent made up examples, but the point will be clear to you hopefully. So, in this particular plot the magnitudes of the data points do not matter, but I am claiming a linear relationship between x and y. I am going to say that x is the number of crimes per year in a city. So, each point on this plot is a city and in for each city I am going to plot on the x axis the number of crimes that occur per year in a city. And then I am going to plot something on the y axis and most of you if you pause and think about it we will try to work out different things to plot on the y axis. The thing that I will plot on the y axis is a number of temples or churches in a city. Now, if you look at this plot you will realize that you are immediately heading for trouble in terms of your research. If this is the kind of data and relationship that you come up with because think of the interpretation of this plot this by the way is actually realistic data. If you take different cities and you count the number of crimes per year in such cities and separately you count the number of temples or churches in such cities you will realize that is actually a straight line relationship which will emerge. So, this is realistic but think of the interpretation if I plot this data and publish it in a journal what is the interpretation. So, are we to say that if there is a large number of crimes happening in cities that those cities need more temples because people have to go to the temples to confess their crimes. So, is that one interpretation of this? So, if you do not like this interpretation that x is a cause and y is an effect then turn it around. Are we to then interpret that because there are large number of temples in a city that people go and commit crimes in that city. So, you can immediately see that if I plot y and x arbitrarily without having a previous reason to plot y and x I could end up in all kinds of tangles with how to interpret this x y plot. So, what then is the correct interpretation of this data and this actually technically nothing wrong with this data. The correct interpretation which hopefully some of you have identified is that actually x was not directly related to y both of these the number of crimes per year and the number of temples in a city were actually both proportional to the populations present in each city. So, in other words if there is a large population in a city you will automatically have more crimes and this is a large population in a city you will need more temples to cater to all the people in the city. So, therefore, this is an example where there seems to be a relationship between x and y, but you have probably had no business comparing x and y directly and that the more scientifically relevant relationship was between x and z where z is population and separately between y and z. So, the moral of this example is that quite often we collect pairs of measurements and we think there is a relationship between the pairs of measurements and we proceed to write up theory and publish claiming that we have found some insight connecting x and y. Now, the fact of the matter is we may have actually been seeing the influence of something else it is almost like a shadow puppet kind of activity. So, those of you have seen these puppet shows shadow puppet shows will realize is actually three dimensional puppets being manipulated by the artist, but what you see on the screen is actually the shadow. So, you see the effect as a shadow and you think you interpret what is going on behind with the 3D objects, but you may occasionally get your interpretation wrong. So, fundamentally now this implies that you got to worry about third variables which third of for that matter other variables which might potentially influence the interpretation of your data. So, the question of course is which way is there causality between x and y and fundamentally causality could be your conclusion about causality could be very misleading depending on how you present your data. So, here is another example where it turns out relationships can be misleading. Again the text on this plot may not be visible to you that does not matter hopefully you can just see the red and blue trends. These are trends as a function of time and what is being plotted on the y axis is a net difference between the matches that Pakistan won against India since the early 1980s. So, what is being proposed here is that most of you will recall that in 1986 there was one famous match at Sharjah where Javed Myanmar hit a 6 of the last ball to win the match for Pakistan. So, that of course was an important event in cricketing circles, but here is somebody publishing data in a medical journal trying to claim that after that 1986 match that this had so much of a psychological effect on the ability to play the game for both countries particularly between in matches between the two countries that Pakistan started winning increasingly relative to India over the next few years. So, this trend is of course shown only till the early or the late 1990s and is claimed that because of that one match that one 6 that the psychology of cricketers on both sides was affected so much that Pakistan started a winning streak against India. So, how is that claim being made before and you can hopefully barely see a black arrow at the bottom of that vertical line that vertical line is 1986 before that India and Pakistan were winning pretty much the same number of matches against each other. The red line has to do with total number of matches played the blue line had to do with one day international's played and there is a lower flat line which had to do with test matches played. So, the claim now is that overall regardless of one day or test matches Pakistan started winning before 1986 it seems to be a flat set of lines after 1986 they seem to be some kind of linearity for all the curves that you look at. This of course you can imagine the moment it got published started all kinds of fights on the internet between Indians and Pakistanis and all kinds of elaborate explanations were pulled out to try and explain those data, but there is one simple explanation which explains this behavior which does not have to account for the psychology of the Pakistani cricketers relative to Indian cricketers. So, remember the hypothesis is that that last ball 6 influenced the psychology so much that Pakistan started approaching matches in a more with a more positive attitude and started winning, but what is the alternate explanation for this is there yet another variable which explains this and the answer is that in general around 1985 and onwards Pakistan started playing cricket well regardless of who they were playing cricket against. In other words they played cricket well against India of course, but also against England against the West Indies and against Australia. So, in that sense in that relative sense there is nothing special with what we are seeing here as a trend. So, the moral to this is that if you want to selectively pull out data and then make claims one can make claims and try to proclaim trends, but the fact of the matter is that the underlying difference between Indian and Pakistani performance is nothing unique the influence was not so much that Indian psychology was hurt by that last ball 6, but rather that Pakistan in general started playing better cricket regardless of who they played against. So, in this interpretation of that data there was no attempt to involve with third variable that Pakistani performance on average was improved. With the net result that you could claim a very controversial trend like this which of course created all kinds of fights online as I said. So, there are plenty of examples like this out there in the literature where people claim relationships between variables and then go on to claim that there is a cause and effect. So, that last ball 6 caused a psychological effect here is another example. There is a study which claims that in seaside towns coastal towns in the United States that if you look at the number of ice creams sold per month in a coastal town. So, the number of ice creams sold per month and you try to plot it against the number of drowning deaths per month they seem to be nicely related. So, it looks like is a relationship between drownings and the number of ice creams sold. Now, that immediately for most of us is meaningless it is it is senseless, but the data says that the data actually shows the linear relationship. So, how do you disprove it or what could be happening? So, once again this is a case where there is a third variable there which nobody is involving in the analysis and that third variable is the fact that during summer months more people are likely to go to the beach and swim and that is of course when most drownings are likely to happen, but it is also during summer months that most people are likely to buy ice creams. So, turns out there is a neat explanation for why the number of ice cream sold per month relates to the number of drowning deaths per month. So, in all of these examples we realize that there are relationships between pairs of variables which can lead to misleading interpretations and here is one final example of how if you simply work with correlations that these could be misleading and dangerous. So, here is a journal article published in the National Academy of Sciences in the United States and this article says and I am not sure the text is clearly visible to you. So, I will just read out the title this article says that warming increases the risk of civil war in Africa. So, what this basically saying is that if the year promises to be a hot year. So, it promises to be a very warm year then many fights will break out in Africa countries will start going to war and actually the final fine print in this study was if you detect that or if you work out that the next year is going to be a warm year then because people are going to go to war you really do not want to pump in money as investments into Africa the next year. So, you can see how such a study now immediately impacts policy making. So, warming as a cause wars breaking out as an effect and the final effect of course is making sure that you do not pump in money as aid to all the African countries. So, what is the problem with this kind of a study the problem is once again there is a simple explanation for this which need not lead to such a drastic result that simple explanation is any time you have a very warm year you are very likely going to run out of water and any time you run out of water you are going to go around hunting for water you are going to go to neighboring countries to see if you can get access to their water supply access to their rivers. So, fundamentally what is being claimed is that people go to war because they need access to water. So, if you tackle the water supply to all these countries then there is no need to go to war and rather than focus on that there is a simplistic analysis here that global warming and the high temperatures that might result might lead to fights breaking out without acknowledging that the true variable which needs to be studied is what the temperature per year on average as a average temperature how does that correlate with water availability. So, that should have been the more important thing to have been studied. So, in a nutshell then there are several examples where we limit our analysis to a few variables which we think we can understand that we can play with whose measurements we think we can get and in doing so the problem is that we lose focus on the true relationships with other variables which we are not even bothering to build into a model. So, the philosophy here is now as you go about your experiments either you have simple scenarios x causes y, y causes x or more likely than not you will have a difficult scenario where both x and y are being caused by some other variables in which case those are the variables steadily that you have got to learn about and the question now is if you are really interested only in x and y and other variables are present which might mess you up with your experiment how do you control the influence of these other variables. So, these other variables are what are called in variable with variable terms they are called hidden variables they are called lurking variables they are called latent variables and they are also called confounding variables confounding as in confusing variables these are variables which might confuse the analysis because you have simply not been included in the entire model being discussed. So, it turns out that a large number of the experiments that we do have to deal with trying to figure out how to cope with the possibility of hidden variables existing. So, in given that in with most of the models that we develop we do not fully understand the system you end up in a situation where you force yourself to now do experiments all right, but you have to do these experiments trying to minimize the influence of any of the variables that might exist out there and it turns out that if you look at all the kinds of experiments people have done over the years they can broadly be characterized into two types. So, one of these is a controlled experiment where you try to choose certain values for all kinds of variables that you think might be important in your model you choose deliberately all kinds of values and then try to perform an experiment and the other experiment the other type of experiment is something that was discovered by Fisher or proposed by Fisher about a 100 years back something called the randomized experiment. So, there is a controlled experiment and the randomized experiment. So, I will describe each of these now in some detail. So, you get a feel for how we go about actually carry out most of our experiments.