 It is indeed marvelous, an irony-free zone where everything is beautiful and nothing hurts, where everybody, regardless of race, creed, color, or degree of inebriation, is welcomed. It's warm yellow glow, a beacon of hope and salvation, inviting the hungry, the lost, the seriously hammered, all across the South to come inside, a place of safety and nourishment. It never closes. It is always, always faithful, always there for you. Waffle House really is a source of comfort and safety for many people who live in the Southern United States. I went to college in Atlanta, Georgia, and Waffle House really was always open, no matter how late we were up studying or partying or whatever it was. But Waffle Houses do sometimes close. Sometimes there are storms that make them unopenable, but it takes a really extreme event because Waffle Houses prepare for disasters. They have their own generators and extra water supplies and stockpiles of extra materials, so they can be open when nothing else is. As a consequence, the United States Federal Emergency Management Agency has an informal Waffle House index. If the Waffle House is closed or only operating at park capacity, that means it's a serious event, and they mobilize additional resources to the damaged area. Why am I talking about Waffle House? Well, because it's awesome, but also because I'd like to use it to introduce the topic of spurious correlations. So Waffle House is incredibly dependable restaurant, perhaps one of the most dependable restaurants worldwide, but it is statistically and reliably statistically associated with divorce rates. Parts of the Southern United States with greater density of Waffle Houses per capita. On the graph on the right, I'm showing you on the horizontal axis, Waffle Houses per million population and plotted against the divorce rate in each state in the United States. You'll see that there is a positive relationship and the states with the greatest numbers of Waffle Houses have some of the highest divorce rates. Now, this is an implausible kind of causal relationship. There's no sense in which 24-hour availability of waffles and hash browns is going to cause divorce, but in the data, it's a relationship that's quite hard to remove statistically no matter which other variables you try to stratify by. Nature is full of correlations. Correlation is commonplace and this is one of the things that makes it a really bad guide to causal relationships. Here's one joke correlation from this nice website, Spurry's Correlations. I give you the URL at the bottom, which is showing the time series correlation between the divorce rate in Maine, which is one of the states in the United States and in the Northeast with the per capita consumption of margarine, this is a correlation of 99.26%. Really hard to find stronger correlations, yet surely this is not causal. The point of this in these examples, the Waffle House example, this silly margarine and divorce example is that we have to go beyond correlations or associations more generally, as I keep saying, correlation is a very narrow and weird measure of association. It's purely linear and it does not generalize well, but we speak more generally of associations. You have to go beyond them and connect them to scientific models to make them interpretable. So in this lecture, what I wanna do is start to introduce that and this week we're gonna spend a lot more time on this issue of the logical connections between scientific models and how they help us interpret statistical models because the unfortunate truth or rather the fortunate truth from my perspective is that no statistical model can be understood unless there is some separate set of scientific models that allow us to understand the causal forces that may be at play. So in this lecture, I have a modest goal that is to introduce you to what I call the four elemental confounds. The word confound often has a very special meaning in statistics, but I'm gonna use it in its normal English sense rather than its narrow statistical sense. A confound is anything that confuses you and there are basic structural forces in the relationships between scientific models and the data sets that they can generate that lead us to be confounded and the good news is there are only four of them and I'm gonna spend the rest of this lecture introducing you to these four elemental confounds, the fork, the pipe, the collider and the descendant and then in the next lecture, I'm gonna show you how we can build much more complicated scientific models or interpret much more complicated scientific models simply as being built up out of these four elemental confounds. Let's start with the fork. The fork is illustrated heuristically by the dag in the upper right there. There's a variable z, which we're gonna call a confounder. This is the narrow sense of confound as it's often used in statistics and z is a common cause of two other variables, x and y. And we're interested in this example as we will be in all the others in this lecture in the causal relationship between x and y. So the thing about the confounder is it creates a statistical association between x and y because it's a common cause. So I'm gonna start introducing you to some common notation for this. So x and y are associated and in red there I'm showing you the standard notation for this, that little symbol with a line through it, y, and there's that little bizarre, looks like a sort of thing you find in a parking lot that stops you from driving through it. x means not independent of, so y is not independent of x. That's what that little symbol with the slash through it means. They share a common cause, z, and that's why they're not independent, they're statistically associated. If you get information about one, you have gained information about the other. That's what association means here. However, once you stratify the data set by values of z, you will find no association between x and y within each of the sub-samples defined by each level of z. And that wasn't clear, don't worry, I've got more slides to show you this graphically coming up, but the notation for this is going to be y is independent of, that's what that symbol means, independent of x conditional on z. So let's think about this a little bit more graphically. Here's the fork, and I'm going to animate this, and you'll see that the thing is that z is a common cause, and so whether the little traveling balls are empty or rather filled as a metaphor for different values that z could take, they're always the same value as they transmit to x and y because x and y are jointly influenced by it. The detailed influence of z on x and y can be different, but z takes the same value in each case, and therefore x and y end up having some association as a consequence of this common cause, z. That's the fork. It may be easier to think about this if I augment this graph a little bit in all of these DAGs as heuristic causal models. The idea is that there are always other causes that we have failed to draw that are not associated with the other variables in the graph, and these are noise in a sense. I've added to noise nodes as it were to this graph, e sub x and e sub y, these are the statistical error, if you will, and x and y, the unexplained variance, unexplained by z, and so x and y are different from one another. They're not just copies of one another, and they're not just copies of z, they have some independent variation, and this is what allows us to de-confound these analyses, so again, I'll start animating it. You'll see that z is always the same value and is always either filled or empty as it travels to x and y, but the noise that's entering x and y can be different, and this means that x and y are not perfectly associated with either one another or with z, so now when we stratify by levels of z, we're isolating the noise from e sub x and e sub y. Let me show you what that looks like in an actual worked code example now. So we're looking at the fork. Again, I show you the DAG at the top, keep in mind that DAGs are heuristic causal models, but we have to fill in functions for these things to really simulate data from them, and so this is what I've done here in the code block on the screen. I've simulated a thousand observations for three what are called Bernoulli variables, which are zero one variables. So the Bernoulli distribution is just the coin flip distribution, heads or tails, but it doesn't have to be a fair coin, and so in this code, first we generate the confounder z, and I make it a fair coin, so a thousand coin flips, and there's a probability of point five, that it's a one, and probably a point five, it's a zero, and then from that we simulate x and y, and they both have the same distributions, but they're independent of one another conditional on z, and so the way to read this code is that when z is zero, there's a probability of point one that x or y takes the value one, and when z has a value of one, there's a probability point nine that x or z has the value of one, so this creates, this is the causal influence of z on x and y, those causal influences are flipped as it were independently. Now let's look at the cross tabs of this synthetic data set, and you can run this code yourself and play around with examples. I just used the table command in R to create these cross tabs, so if you're looking at this little, primitive table on the screen, is the value, when x takes the value zero and y takes the value zero, there are 397 of those cases when x takes the value zero and y takes the value one, there are 84 of those cases in the synthetic data set, so this table sums up to 1,000, and you'll see that y is not independent of x, and that's because of their common cause, that is the diagonal values are much bigger than the off diagonal values, and if we just compute the correlation, and this is not the greatest measure of association in the world, but it's familiar to everybody and it'll illustrate the example, the correlation between x and y is 0.63 in this particular example. But now let's stratify by z, and when we do that, y and x become independent of one another, being there's no meaningful association between them, so in all the cases where z equals zero, we can do the same cross tabulation between x and y, and now you'll see that the diagonals are not bigger as a proportion from the off diagonals. If you look, cast your eyes, same's true for when z equals one, and if you cast your eyes towards the bottom right of the slide, you'll see the correlations for each stratified part of the set, the correlations are now very small. It might be nice to look at this for continuous variables too, to help prime your intuitions, there's nothing special about Bernoulli variables, it's just that for some people, Bernoulli variables are easier to think about for other people, continuous variables, so here's an example with continuous variables. Again, the code on the left will produce the plot on the right, but we can talk about the plot on the right here. We have a continuous variable x and a continuous variable y, and in the total data set of them, I'm showing you the regression line through all values of x and all values of y, that's the black line. Again, for the all values of x and all values of y, the black line is the best fit regression line through them, and you can see there's a positive association between x and y, and that's because of their common cause z, but the cases, the values of x and y where z equals zero are shown in blue, and the values of x and y where z equals one are shown in red, and then I have the regression lines only for those subsets for each stratified set shown in correspondingly in blue and red, and you see that there's no meaningful association between x and y within each stratified sub-sample according to z, and that's the fork. There's a x and y are not independent of one another because they share a common cause, but if we stratify by or condition by that common cause, the association vanishes. That's the confounder. Let me give you a work data example now. We're gonna come back to Waffle House in a sense. We're gonna come back to the divorce rate problem, and we're gonna leave behind the silly Waffle House example because it's obvious that Waffle Houses don't cause divorce. There must be some serious correlation there, but let's deal with something that's of a little bit more scientific value and actually try to model divorce rate in the same US states. So there's this curious fact about divorce rate in North America is that it's statistically strongly associated with marriage rates, states where people get married at higher rates and also have higher rates of divorce. Why is that true? Does marriage cause divorce? It's a funny thing to say. There's a sense in which it's required for divorce to happen, but it's not necessarily obvious that just because people get married more in a particular place that they would also get divorced more. Should we interpret this relationship as causal? How would we measure it? So we're gonna do draw the owl here or in this case, marry the owl or divorce the owl as it were. The estimate is the causal effect of marriage rate on divorce rate as illustrated by that simple dag in the upper right. And then we're gonna move down our flow chart here and build up a scientific model and then the corresponding statistical model and then I'll show you the analysis. So let's think about other influences on divorce rate. One of them, the one I wanna focus on for the pedagogy here is agent marriage. Agent marriage or median agent marriage or just the distribution of agent marriage but we'll focus on the median as a focal point varies a lot among the United States and I'm showing you on the right here median agent marriage on the horizontal and that's years of age against the divorce rate in each state and I've colored the southern states the United States in red here. Those are the states there are lots of waffle houses just as a visual anchor for you and again, you'll see that there's a strong just with your eyes you can run the regression in your head there's a strong negative relationship here between the median agent marriage and divorce rate places where people get married later have substantially lower divorce rates. So a state like Washington DC in the bottom right of this graph has one of the lowest divorce rates in the country and it has the highest median agent marriage in the country. So this is also another cause we might wanna consider and so let's construct a DAG that has both agent marriage and marriage rate as influences on divorce rate and try to deal with our estimate. So now our problem is the direct causal effect of marriage rate on divorce but we have this fork. Agent marriage is a potential confounder because it plausibly influences both divorce rate and marriage rate, right? Why would agent marriage influence marriage rate and not the reverse? If people get married younger there are more people to get married. Now that's just the fact about how human populations work. I'll say that again, if people get married younger there are more people to get married. Marriage rate is an outcome of marriages. It's not going to cause agent marriage in a sense but if agent marriage varies among different ages it can influence the rate of marriage and so the arrow must go from agent marriage to marriage and then both of these plausibly influence divorce somehow and we want to measure any direct influence of marriage rate on divorce dealing with this confound. So how do we do that? Well now we've got our scientific model and we're going to move on to step three. How do we deal with this confound and we know how to deal with this confound because we can identify it as a fork that is agent marriage is a confounder of the relationship between marriage rate and divorce rate and so how do you break the fork as it were, remove the confound, you stratify by agent marriage. I'll say that again. How do you deal with the confound created by the fork? You must stratify by the confounder, stratify by agent marriage. Of course we're also interested in estimating the potential causal influence of agent marriage on divorce and we can do that luckily in the same model in this case. Also on this slide I'm showing you the bivariate plots between all three of these variables and you can see that they're all strongly associated with one another. It's not possible to tell just from these associations what's going on. We need to draw the arrows in the dag in the upper left. Okay, a little bit about what it means to stratify by continuous variable. Agent marriage is not binary and the simple examples I showed you where it was x, y and z, I always made z binary so it was easy to think about and when you stratify by a binary variable it's easy to understand what you do. You just divide the data set and then look at the correlation or association or regression within each sub-sample but when you have a continuous variable like agent marriage what does it mean to stratify by? Well, it means to use in your model whatever functional relationship you've assigned that the confounder has on the outcome and that creates the stratification. It creates the sub-populations. Why? Because it means that for every value of the confounder you've got a different expectation for the outcome so you can estimate a different relationship between marriage rate or whatever cause you're interested in and the outcome. So in a linear regression it's as simple as adding to the linear model another term with the confounder variable in this case a sub i which is the median agent marriage and a slope for it as well. And so what happens then is for each value of a sub i you get an additional term in the equation for mu and that gives you a different intercept. It changes the expected amount of divorce before you even consider the marriage rate and this creates a stratification. There are other ways to stratify you can also do interactions. Those of you who know interactions will talk a lot more about those later. But the right way to do the stratification depends upon the causal model. It depends upon the function you assign by which agent marriage in this example agent marriage and marriage rate influence divorce rate. Now this is what I just said verbally I should have advanced so I'll say it again every value of agent median agent marriage produces a different relationship between divorce rate and the marriage rate. So you can think about grouping from the perspective of marriage rate to agent marriage is just something that makes the intercept conditional on agent marriage. I'll say that again from the perspective of marriage rates the agent marriage variable is just something that makes the intercept conditional on agent marriage and this creates a stratification effect. Okay let's build the regression you're familiar with linear regressions by now we've talked about them all last week. So we need priors now we've got the structure of the model the structural part and now we need some priors. In this case this is part of drawing the owl I'm going to introduce this common practice it's a very good default with working with linear regressions especially linear regressions with more than one predictor variable is we're going to standardize all the data variables. Let me tell you what that means. When we standardize a variable we subtract its mean and then we divide by the standard deviation and what this does is it turns the variable into a set of Z scores that is the values represent numbers of standard deviations from the mean positive or negative. So the value of one on a standardized variable you can look in the graph on the right here where I've replotted median age of marriage against divorce rate but I've standardized both of these variables now and you'll see that the value zero is the mean of each of those variables you can see that eyeballing it I think and then the values on the axes the ones and twos and threes and minus ones and minus twos those are standard deviations above and below the mean. Why would we do this? There are a couple reasons. The first is that it often makes computation much more efficient inside the computer and the second is and this is partly a consequence of this it's often easier to choose sensible priors for standardized variables because we have some general understanding of how big effects could be in terms of standard deviations changes in standard deviation they could create. Let's do some prior predictive simulation so I can show you what I mean here. Let's put in some off the shelf default kind of priors here and simulate from them and I'll draw these up in a second. So on the left I'm showing you the model filled out with the priors and I've assigned normal zero 10 priors so 10 is the standard deviation so that's variance of 100. It's very common in traditional textbooks on Bayesian statistics to see these incredibly broad regression priors like this with variances of 100 or sometimes 1,000 or even more. These are essentially flat normal distributions from the perspective of the outcome space and that sounds like a good idea but I wanna convince you it's not. So the code on the right part of this slide will sample from these priors and we just use our norm to sample from them and then I make an empty plot and then for each of the sampled combinations of alpha and the slope for median age of marriage I plot the implied regression line and this is what it looks like. You can see that these are crazy lines and the reason is because the prior for the slopes is really broad and so that's saying that most of the prior probability mass is for really strong relationships either positive or negative. I'll say that again. When you assign a normal zero 10 prior to a regression slope for standardized variables that's saying that before you see the data almost all the probability masses for very strong or very, very strongly positive or negative relationships between the two variables. In this case basically impossible relationships. So remember, almost all the data is contained within two standard deviations of all the states are contained within two standard deviations of the mean for both median age of marriage and divorce rate. These vertical essentially vertical regression lines on this prior predictive plot make no sense at all scientifically. It's not hard to do better. You can play around a bit and I encourage you to but here are some defaults that work pretty well and still cover the whole scientifically plausible outcome space. So for the intercept on normal 0.2 I know that sounds narrow but let me show you what happens and for the slopes normal 0.5. And now when we sample we get regressions that cover the whole scientifically plausible space that is everything from a totally overwhelming effect of median age of marriage on the divorce rate that is from the lowest values to the highest values there are regression lines that are essentially upward diagonals on this plot and the reverse and a bunch of lines that are essentially no relationship at all much better priors. Now of course as I've said before with these simple linear regressions you can you can have really bad priors and get away with it it won't make much difference but it's good to practice now in for assigning scientifically sensible priors and because when we get to more complicated models you will you'll be glad you practiced. Okay we've almost married the owl we've got the estimate we've got the scientific model that gives it meaning and we've designed a statistical model in light of both the estimate and the scientific model and now it's time to actually run the code. So here's the quap code if you have gone through the lectures from last week and you've worked through the book up through chapter four and chapter five where this example is then there's no surprises here but I put the quap code the model formula and the mass stats notation on the right so you can see the correspondence here. The new thing to note here is I'm using an exponential distribution for sigma and I'm gonna be doing this for the rest of the course typically for scale variables my default prior distribution will be exponential y. An exponential distribution is constrained to be positive which is what we need for a scale variable like sigma standard deviations can't be negative they must be positive they can't be zero either and the only information in an exponential distribution is the average displacement from zero I'll say that again the only information in an exponential distribution is the average displacement from zero and so an exponential of one means that the average displacement is gonna be about one standard deviation and there's nothing else in it no other information in this prior except that this covers a wide range of possible standard deviations if it ends up being much bigger than one it can do that it can also end up being much smaller but this is a very low information content sort of prior distribution for a scale parameter. Okay we fit this model I'm just showing you the pracy plot here this is the so-called forest plot of the high density regions of the posterior distribution of each parameter here unsurprisingly the intercept A is straddling zero it basically has to be for standardized variables because when the means are zero then when the x-axis variable has a value zero then the y-axis variable has to have a value zero that's the meaning of alpha and so we get that confirmation back and then we look at the slopes and what I want you to see is BM this is the slope coefficient for marriage rate and it's small and the fact that it straddles zero isn't what's so important is the fact that even if it were at its posterior mean it would be small there's some uncertainty in it it could be above zero, it could be below zero but in any case it's always small don't fixate on zero is some magical point the fact that that compatibility interval overlap zero doesn't mean zero is what you should assume what you should assume is the high density region and the high density region is just close to zero it's small and that's the way to interpret such a thing and then the coefficient BA for median agent marriage all the high density region is far below zero nothing close to zero at all that's that strong negative correlation which persists after stratifying by agent marriage and marriage rate simultaneously of course we also get an estimate for sigma and you'll see that there but that's just the residual the distribution of the residuals okay, what does that mean? what that means is that there's almost no direct substantial direct causal effect of marriage rate at median agent marriage was a confound creating that apparent causal relationship between marriage rate and divorce rate in the book starting on page 140 through page 144 there's an extended example of how to think about simulating counterfactuals causal effects, different kinds of causal effects from this model you just fit and I think it's very important that you spend some time going through the section and running the code and thinking about it so that you start to think more naturally about the idea that once you fit such a model you can compute different kinds of causal effects from different scenarios for different imagined interventions so please take some time to do that and take a look at that section you really need to sort of draw the owl and get an idea what's going on okay, but for now I want to continue on to the next elemental confound and that is the pipe the pipe is statistically very similar to the fork but structurally very different I'll say it again it's statistically very similar to the fork but structurally very different again we have these three variables X, Y and Z as Z is in the middle but Z is now traditionally called a mediator that is that X influences Z and then Z passes on any influence of X onto Y so consequence X and Y are again associated it's not because they share a common cause there's no cause of X in this graph of course there are causes of X but we haven't drawn them in this little dag rather it's that Z passes on something about X properties of X onto Y and this means Y knows something about X yeah, X doesn't know anything about Y because X hasn't been influenced by Y but this creates a statistical association between them so again Y is not independent of X in the whole sample the influence of X and Y is transmitted through Z is the way you want to think about the pipe once we stratify by Z again it behaves just like the fork there will be no association so Y is independent of X conditional on Z so I want to show you again animation of this idea you can see that X shown in red influences Z and then Z influences Y and Z is in a sense contaminated by X and that's what the red ring around the blue circle for Z shows but of course the values of Y you can't separate out which part of its values were caused by Z directly and caused as a second hand influence by X it's just one set of values Y and so that's our statistical problem is to pull those things apart when we want to so let's think about the binary variable example again this is just like the previous one to show you the symmetry the pipe ends up behaving just like the fork even though it is causally quite different the scientific differences between a fork and a pipe are really substantial there's no confounder here but in the data sets they're going to look and behave the same when we condition by Z so again simulate three Bernoulli variables this time X is generated first and it's a fair coin flip and then X influences Z and then Z influences Y and again when we do the cross tabulation we find that Y is not independent of X the correlation is about 0.64 and when we stratify by Z we find that in both sub-samples there's essentially no correlation very small correlation no meaningful association between the two showing you the continuous variable version of that same experiment again but now for the this is just like what we looked at before for the fork but now it's for the pipe the code is on the left in the total sample values of X and Y are positively associated that the black line that's the regression line through the whole data set but within each sub-population the blue sub-population where Z equals 0 and the red sub-population where Z equals 1 there's no meaningful association let me give you an example and this is an example that's in chapter 5 I believe it's a plant growth experiment and in the book there's a full draw the owl version of this where we generate the synthetic data set and then analyze it so in this scenario we imagine that there are 100 plants and these are plants that are troubled by fungal growth as this kind of mildew that often grows on plants as illustrated on the right here and this is an experiment in which plants have been randomly assigned to receive an anti-fungal treatment or not and then for each of the plants we in both treatments we measure the growth and we measure the presence or absence of fungus fungus grows in both treatments but probably less in the treatment with anti-fungal the estimate is the causal effect of the treatment the anti-fungal treatment on plant growth that's what we want now this sounds simple and if you've been trained to think that causal inference and experiments is simple well I'm afraid to tell you you're wrong and I want to show you how pipes can cause problems there so here's a scientific model let's build it up this is what we've got so far so for every plant there are three measurements and I've drawn them up on the slide here and their causal relationships there's a height of the plant at time zero that's the start of the experiment when they're assigned to treatment groups and then at the end of the experiment their heights are measured again this is the outcome variable height sub one is the outcome height at time zero that's our baseline height measure and then there's this variable F which is the fungal status of each plant measured at the end of the experiment did it have mildew on it or not and then we have our treatment and our treatment has two plausible causal effects it'll have an effect on fungus that's the target this is an anti-fungal treatment and so the whole goal is that it reduces the growth of fungus and then the absence of fungus allows the plant to grow more but there may also be a direct effect of the treatment the treatment could be toxic to the plant so some anti-fungicides are actually bad for plants as well or it could be good the plant could benefit directly from the presence of the anti-fungal treatment regardless of any effect on the fungus so again we've got this direct and indirect effect what I want you to see is that the path from the treatment T to the fungus F to the outcome of interest heights that this is a pipe so we have our estimate the total causal effect of the treatment this is through both paths so we can probably imagine that the right statistical analysis here is to ignore the fungal status I'll say that again the right statistical analysis here is to ignore the fungal status however many people because they measure fungal status are tempted to put it into the regression model and this would be a mistake why? because the path from T to F to H to one is a pipe if you stratify by fungal status you block that path you make it so that whatever indirect causal effect of treatment there is on fungal status that is the desired part of the experiment it will be statistically removed from the results blocking the pipe is bad here so you don't want to do it because it means you'll make a mistake in inference again I'm not gonna draw the whole owl in this lecture but it is drawn very patiently and with complete code on pages 170 to 175 in the book and take a look and you'll see there's a simulation for the synthetic data and then we fit both the correct regression model which ignores fungal status and then we get the right causal effect total causal effect of treatment and then the regression where we add F and you can see that it produces confusion it means you conclude that the treatment doesn't work even though it does and why is that? Well, because since the treatment works through reducing fungal growth once you've conditioned on fungal growth there's nothing more to learn from the treatment so it's a statistical artifact of how you process the sample but it leads you to make the wrong inference so you might think no one would do this this seems silly, people do this this is a particular kind of statistical bias kind of confounding in the general sense of the English word called post-treatment bias it arises when you stratify the analysis or condition on some consequence of the treatment and this is usually in the vast majority of cases a very bad idea so in this case you end up concluding the treatment doesn't work when it actually does and that can happen in lots of cases can also result in inferring the opposite that the treatment hurts and all kinds of bad things can happen there's this great paper I'll show you a table from it on the right to Montgomery at all 2018 how conditioning on post-treatment variables can ruin your experiment and what to do about it I encourage you to take a look at this paper the first part of the paper they do a survey of top journals in their field and find that in experimental studies about half of the papers reporting experimental studies are conditioning on post-treatment variables and probably producing bias causal estimates as a result so experiments, they're good you should do them, do them early and often but you gotta get the stats right just doing the experiment designing the right experiment is not enough if you don't have some logic for connecting your scientific model to your statistical models okay I think that's a good point to take a brief break and walk around a bit think about which elements of the lecture so far are confusing to you and maybe review those before continuing on and when you come back, I'll still be here okay, we've done the fork and we've done the pipe we've got two to go the next is my favorite and it's called the collider it's my favorite not only because it has a great name but because of the very almost paradoxical results that it has in samples so now this is the opposite of the fork in some sense you look at the dag in the upper right we still have x, y and z z in the middle is now called the collider and z is influenced jointly by x and y so x and y don't share any common causes here there's nothing about a change in x which can influence y and no change in y can possibly influence x in this graph but both of them influence z and so the values of z are to be explained by both changes in x and y in the total sample the consequence of this is that x and y are not associated because they don't share any causes so y is independent of x in the total sample once you stratify by z however x and y become associated and this is the thing that is often very confusing when you first meet it the confusing thing about colliders how is it that it makes sense in the case of the fork and the pipe then when we stratify by z we remove the association between x and y but how can stratifying by z create an association between x and y so we started out not in the total sample not being confused or confounded that x and y that x could influence y because they weren't associated but now if we add z to our regression or analysis more generally we end up concluding that x and y are associated perhaps causally how can this happen and how can we avoid this kind of confusion again let me show you the cartoon animated version of this x and y have different colors because they're different influences on z and they take different values but they combine to influence z now the consequence of this is I'm gonna unfold in the slides to follow is that if we learn the value of z we necessarily learn something about the ranges the values of x and y that could have produced it I'll say that again when we learn z that is stratified by it when we stratify by we're conditioning on a particular value of z there's a smaller range of the values of x and y that could have jointly produced any particular value of z this is weird I know but it'll seem natural let's look at the synthetic examples just like in the previous elemental confounds so now the collider here's some simple code to generate Bernoulli variables for the collider x and y are independent coin flips and then z is influenced by both that is the sum of x and y both jointly produce it if either x or y takes the value one then z is likely to take the value one in this example and so we look in the total sample in the crosstab on the slide and you see that y is independent of x the correlation is very small 0.027 but now we stratify by z we look within each value of z and we find the opposite we find that when z equals zero there's a strong positive association the correlation is 0.43 and in z equals one there's a strong negative correlation the correlation is minus 0.31 in this case how could this happen how does this make any sense at all and I'm gonna go through a number of examples of this but the way to keep it in mind the explanation is in a sense simple it just takes a little while to accept it when we learn z that is when we stratify by it so when we stratify by say z equals zero that's like learning z we're saying okay z is zero there are a constrained range of combinations of x and y that could produce the value zero so that means that x and y have to be associated for each value of z the different values of z of course allow different ranges different combinations of x and y that could produce that value and so in this particular example because of the function I've chosen that relates z to x and y as if there's this threshold if either x or y equals one or both then z is likely to be one that's the reason there's a positive relationship for z equals zero and a negative relationship for z equals one it could be different given different causal mechanisms but there's a sense in which it's always true with colliders that learning the collider is that is z in this case gives you information about the combinations of x and y that could possibly have produced it here's the continuous version of it to show you again so for continuous x and continuous y plotted on the right in the total sample of both blue and red dots there's no meaningful correlation that's the black regression line in the middle the horizontal regression line but within each subpopulation stratified by z in this case that is the blue and the red there's a negative association this is a sort of typical result for a collider and this is collider bias because if you just add z to a model thinking it was a confound you would end up it would create an association between x and y and you could think that was because you removed a confound but actually you added one I'll say that again in a regression model if z is actually a collider and not a confound that is it's not a fork but a collider actual collider structure then when you add z and you see that there's a that creates an association between x and y and you could mistake that association for causal thinking you removed a confound and that's why the association appeared but that would be wrong okay let me run you through a couple of examples of colliders we spend a little bit more time on this and there's a big section of chapter six which is all about colliders and you should take a look at that too so colliders operate both sort of naturally that is we often receive data sets which are endogenously stratified by the collider now I'm gonna show you an example of that first and then I'm gonna show you a case where we do it by accident within our statistical model but both of these things happen quite often so when the collider is endogenous to the system that is the sample is inherently conditioned on or stratified by the values of z we often call this selection bias it's a particular kind of selection bias so here's a synthetic example and this is an example from the start of chapter six suppose there are 200 grant applications people have written grant applications spent many hours late nights writing these grant applications that submitted them the granting agency is gonna review them and there's a panel of people who read them and score them and we're gonna assume the grants vary on two dimensions of interest their newsworthiness of their topics that is if the research were successfully conducted the projects were successfully conducted how important would they be to the public that is that's what I'm gonna call their newsworthiness newsworthiness is not silly right there are lots of scientific important scientific projects which are not newsworthy and we should do those too but the return on public investment matters too right we wanna do science that has an impact and then there's trustworthiness which is the reliability the rigor of the proposed project which influences how likely it is to produce the correct answers and I've assumed in the simulation that these things are completely independent so you look at the cloud of points on the right and it's just an uncorrelated cloud these are both Gaussian variables the panel scores these things and they get there's some noise added but they basically get the same cloud back there's no association in the population now then the granting agency selects the highest scoring grants and it does this through some additive combination of the newsworthiness and trustworthiness it doesn't have to in this example I've weighed them equally saying that they're equally important it doesn't have to be true the point of the example is robust to changing the weight of newsworthiness and trustworthiness the point is that after you select the remaining red points in the upper right that is the highest scoring grants there's a negative correlation between newsworthiness and trustworthiness so in funded grants, awarded grants the most newsworthy ones are the least trustworthy and the most trustworthy ones are the least newsworthy that is boring science is the best science but it's not an inherent feature of how the science is done and on this simulation I don't know about real science for a second but in this simulated science world this negative correlation of the science that gets funded and the public ends up seeing it would really be there that is exciting stuff would tend to be unreliable but it's solely a consequence of how grants are awarded this is a collider bias but we only see the awarded ones we don't see the un awarded ones and so the data that are available to us are already conditioned on a collider and therefore biased and this is a kind of selection bias so here's the DAG version of this I show you the collider in as newsworthy and as trustworthiness there are joint influences on A which is whether a grant is awarded or not if we only see the sub-sample of grants that are awarded not the ones that aren't that is inherently endogenously stratifying by award level and then we see a negative association this kind of selection bias is very common because often in organizational institutional selection procedures we don't see the failures we only see the ones who stay this happens in all sorts of careers as well you only see the individuals who persist in the career and so if you look at the traits of those individuals there could be bizarre correlations which misleads you about why those individuals are successful I'll say that again this happens in professional careers as well only the people that stay in the career can be observed and so if you look at the traits of those people there could be bizarre associations among their traits which misleads you about why those people are successful in fact it could be the opposite reasons a couple other examples to prime your intuition restaurants in many parts of the world certainly the parts of the world that I've lived in can survive in a couple of ways either by having good food or a good location of course they could have both but it's hard to have both there are only so many good locations and it seems sometimes difficult for restaurants to make good food a consequence of this is that bad food can be found in often found in good locations why because that's where it can persist bad food in a bad location the restaurant goes out of business but if you have a good location you can survive despite the fact that you have bad food because the tourists will still eat there the flip side of this of course is that if you have really good food you can have a terrible location and you can survive and so sometimes the really the best restaurants are found outside the city center in many places okay another example has the same structure actors can succeed by being attractive or by being skilled there are other ways to see too but let's focus on these two ways they can be very attractive people like to look at them or they can be very good actors or you can be both but most aren't both let's face it talent is scarcely distributed in the world it's a consequence attractive actors will end up being less skilled on average because those are the ones those are the only less skilled actors who can actually survive in the occupation I'll say that again attractive actors will statistically end up being less skilled on average than less attractive actors but it's solely a consequence of the fact that the other less skilled actors who are also unattractive well they don't have jobs it's like the grant example okay so the other way you can get a collider bias of this type is through the way you statistically process things not because the population has already been selected for you but because you're going to stick it in the model so this is the examples the synthetic examples I showed you were of that type we are going to now statistically stratify by the collider it creates this phantom confounding non-causal association so there's an example I want to walk you through structurally here and all the code for this how to draw the owl is in chapter 6 of the textbook the question is to the extent to which age influence is happiness do people get more or less happy with age and how would we even study that question so the estimate what's the influence of age on happiness there's a possible confound in data sets that would let you address this what's the sort of data we're imagining well we survey a bunch of people of different ages and we ask them how happy they are in general how satisfied are they with their lives and we also add a bunch of questions to the questionnaire of things that might be possible confounds that might independently influence their happiness aside from their age so the idea is that age influences happiness but other things do too because some of them might be associated with age and be confounds so let's do a synthetic version of this so I can show you how this works for these causal inference examples it's really powerful to do data simulations from a scientific model because then you can know whether the statistical machinery works in the real world of course we don't know the true causes of the variables never with certainty and so we never can be sure the statistical procedure is correct so you train your intuition and you design statistical procedures by doing synthetic data simulations and that's what I show you over and over again in the book so in this particular synthetic data simulation we're going to suppose that age has no influence at all on happiness that is people are born with some level of happiness and they keep it their whole lives in this simulation and no that's not realistic but it's a powerful way to show you the consequences the dangers of just adding things to a regression so on the right we have the real causal model I'm going to assume which is that happiness age here influences marital status which will be the potential confound that we're going to focus on the extra question on the questionnaire and age also influences marital status you can probably intuit why we might say that happiness influences marital status people who are more happy are more likely to get married the idea that people no one wants to marry the unhappy person and age influences marriage and you might think about how is that well older people have had more years of exposure to life and so they've had many more chances to get married I'll say that again older people have been exposed to more years of life and so they've had more chances to get married and that's the sense in which age influences marriage so both of these things end up being associated with marital status and marital status as you can see from the graph is a collider of happiness and age and so if we condition on marital status bad things happen here's the simulation that the code for this is in the book and you can look at it this is a first example in the course where the scientific model the data simulation is highly dynamic and we're just going to take a cross section of the data at the end of the simulation and do the analysis on that I'll say that again this is the first example in the course in the book where the scientific model is dynamic it's a real population model where individuals are born on the far left of this graph they're all born unmarried and then starting at the age of 18 they have the chance to get married and it's probabilistic each year of life there's a chance it can happen and so you see the gray open points being converted into red filled points those are married individuals and then they march relentlessly off of retirement at 65 there and they go off and live happy lives in the southern coast of Spain or something like that but we take any particular year in the simulation is a cross section that's the kind of data we have available and we look at the associations between marital status and happiness which is the vertical axis on this graph and you'll notice from the marching red points they never shift up or down they just keep marching relentlessly at the same happiness level their whole lives again that's not realistic it's a highly unrealistic example used to create a powerful theoretical example okay so if you stratify you do the analysis and the full workflow starts on page 176 in the book I encourage you to take a look at that here I just want to show it to you visually you can basically do this analysis with your eyeballs and see the collider effect so just looking at the plot of the data age against happiness here with marital status colored in red if you stratify this sample so all the points on the screen here define the sample and we want to know the association between happiness and age and if you just take the whole sample and ignore the colors ignore the married versus unmarried status there's no relationship between the two and the reason is because happiness doesn't change in individuals across the distribution of happiness is the same for all ages that is all the points are filled but if we if we look at the subpopulations defined by the red points and the gray points that is stratify by marital status you can see that both slant down to the right if you fit regression lines through the red points it's going to go down to the right there will be a negative association between marital status and happiness yeah which is the opposite of what's true in the simulation that is you can see that there are more red points at the top because happier people are more likely to get married but the red points as a population shift down and to the right so there's a negative association within that subpopulation and that's a consequence of the collider yeah because there are two ways to end up married in this simulation either you live long enough to get married and those are the individuals in the lower right they're unhappy but they've lived a long time and they've ended up getting married despite their unhappiness or you can be happy in which case you can get married even when you're young and those are the individuals the red points starting around 20 that are at the top of the graph and that's how colliders work they create these once you stratified by the status that is the collider you often find negative associations between the different things that can cause it the same is true in a mirrored sense for the unmarried individuals right there are two ways to be unmarried either you haven't lived long enough yet to get married or you're unhappy and again there's if you just fit a linear regression through the gray points it'll be it'll slope downwards okay do take a look at the detailed code in the book for that example I apologize for not stepping through it in this lecture but we've gone through structurally similar examples and there are no new tricks in the code there so I'll push that off on your study time with some apologies let's look at the last elemental confound here which is the descendant the descendant is kind of a special case because it's really not a different kind of elemental confound from the other three it's related to the other three but it's also not obvious that it happens unless you talk about it so I want to talk about it so in the graph in the upper right the variable a is the descendant in this case so there's some other variables that have a relationship in this case I've created a pipe x to z to y it's not important that this is a pipe it could also be a collider or it could be a fork what's important is that a is the descendant of the z variable and the consequence of this is if we stratify the sample by a or condition on a in our analysis a tends to have the same effect as stratifying by z and so what z is determines what stratifying by a does is taking this particular example where z is the middle variable in a pipe that is a mediator then stratifying by a will tend to act like a mediator as well even though it's not a mediator it's merely a descendant of z it has information about z and so when we stratify by it it's like more weakly stratifying by z so if z were a collider this would be like if it were a collider if it were a confound it would be like conditioning on a confound I show you an example here of the descendant effect I can make a descendant that's quite carries a lot of information about z it's strongly associated with z so this is a pipe so y is not associated we get the same sort of example I won't spend a lot of time on this slide you can run through the code and get the idea if a is strongly enough associated so in the total sample y is not independent of x because of the pipe but if we stratify in each level of a then it's like conditioning on z but it's not as powerful you notice the correlations really get reduced it goes down from about 0.6 to about 0.3 in both subpopulations the name is stratifying on z this is important and it's important to understand that these exist because lots of the measurements we take are not really the things we want to measure I'll say that again lots of the measurements we take in science are not really the things we want to measure they're proxies they're the things that are convenient or possible and in a causal graph we should probably not draw the graph such that the real measure of interest the real causes of interest are measured instead we have descendants throughout our graphs and there are lots of statistical procedures which have as their goal to deal with this unfortunate reality things like factor analysis measurement error models some kinds of social network models not enough unfortunately of the social network literature cares about this but good social network models account for the fact that we haven't really measured the network we must infer it we're on the network but not the network itself that is the we have a proxy of the network and this means that if we enter these proxies into our model we might want to do something a little different than just stratify them normally we might want to take them into account so much later in the course I'm going to say more about this and so you see the dag on the right hand part of this slide what's going on here is a case where there's some causal relationship between X and Y and that's our estimate we want to estimate that arrow between X and Y but we know there's some confound U here which is unobserved we can't measure it what we can measure however are some descendants of it illustrated here with the letters A and B A and B are descendants of the unobserved confound it turns out that if you program this dag shown on the screen as a statistical model as a Bayesian graph you can reconstruct you with enough precision in many cases that you can de-confound the relationship between X and Y even if you can't measure you directly but you need to do something like a factor analysis you can't just put the variables A and B in the model that doesn't work, not usually I guess you could get lucky let's think about the now brought up unobserved confounds at the very end of this lecture here that's where we're going to pick up in the next lecture is thinking about bigger graphs which are combinations of these four elemental confounds so here's a thought experiment, something to turn over in your brain while before you see the next lecture suppose there is some we're interested in the direct effect of grandparents and this is the dag I've drawn on this slide you see G is grandparents and a number of families and grandparents may have direct and indirect effects on their grandkids for some outcome like say educational attainment or income we can think about education maybe it's easier to think about that so we have the educational G measures the educational achievement of grandparents or their attitudes towards education and then P is the parents that is the children of the grandparents those grandparents had some direct effect on their own children and they spend time with their grandkids too so there is some direct effect of the grandkids to their grandchildren and there is also an indirect effect that is the grandparents may have raised their own children to value education and then they transmit that to the grandkids and that's that pipe through the parents our interest is the blue arrow blue arrow. That's our research question. Is there a direct effect? Unfortunately, there are likely to be lots of unmeasured confounds between parents and their kids, like the neighborhoods they live in, and shared temporal exposures and spatial exposures that influence their educational attainment or their incomes. And not all of those will be shared with the grandparents. And so in this case, I want you to think about what you've already learned is that to measure that blue arrow, you need to stratify by the parents. That is, block the pipe, right? Condition on parents' educational achievement. What you want you to think about is what actually happens when we do that in this larger graph that contains more than one elemental confound. And with that little puzzle, I'm going to move on and we'll pick it up in the start of the next lecture. Okay, that was, we're in week three to remind you where we are talking about causes, confounds, and colliders. And we'll have, this corresponds to chapters five and six in the book, which you really should take a look at because they patiently explain all these examples with the code. And when we pick up with the next lecture, we'll talk about bigger graphs and how to analyze them. And this will set us up to talk about other topics in the future, including in week four, the relationship between prediction and causal inference in more detailed ways. I'll see you there.