 Welcome back everybody to sixth lecture. I got a lot to get through and First thing I want to say this is a point in the course where it starts to get harder because a lot of concepts collide We've just taken ordinary regression and done nothing, but add some more terms It seems simple, but suddenly it's a lot more complicated So if you're feeling confused that is only because you're paying attention and you should feel perfectly happy about that It's a good sign that you're confused if you weren't confused. I would be worried So we're all in this together and you should realize that if you don't understand something in lecture a Lot of this knowledge comes from performing it. You have to go through and run the examples. It's like learning a martial art to use wholly inappropriate metaphor You you can't just watch Jackie Chan movies and learn kung fu, right? I've tried it doesn't work And You should watch Jackie Chan movies, but you're not going to learn kung fu from them You have to actually throw some punches and take some punches as well and learning statistical inferences like that I should find a better metaphor than martial arts, but I think you know what I'm trying to say, right? So Don't be concerned if you're confused. That's perfectly normal I've spent decades being confused about quantitative methods and look where it got me, right? Okay, so we're going to pick up Exactly where we left off with this mechanical, but very important issue about how to include unordered categorical data in a regression and I had ended last time by presenting dummy variables or indicator variables, which is the traditional form To do this if you use an automated tool in our or any other statistical package It will probably code dummy variables for you automatically There are a lot of downsides to using dummy variables if you're coding the model yourself It's nearly always better to use an equivalent, but alternative form called an index variable. So you take your Unordered category here. It's just sex and there's just two Categories within it male and female and you make it into an index and index is a variable that starts at one And counts up from there and each unique number in that corresponds to a different categorical value This has a lot of advantages the first advantage is that you can assign the same prior to each category So you can have the same pre-data expectation for both males and females in this version of the model Or you can't with a dummy variable because you've got one of the code One of the categories is coded as a difference from the other and then you have to have a prior on the difference And that makes setting the prior harder The other reason to do it this way is that as you get more and more categories the model looks exactly the same You don't have to change anything except how many values there are in the index variable So if you're doing something like a regression on countries, I don't know how many countries there are in the world Probably nobody does right changes every day but There's a lot and you don't want a dummy variable for each of them, right? That's madness So the index variable approach grows really nicely It's also the foundation of multi-level models And so if you get used to doing it this way, you can slide very naturally into using random effects When we get there later on in the course so But this is perfectly mathematically equivalent aside from the issue of the prior with the dummy variable approach and You can code this in clap just by using this bracket notation once you've created the sex variable Which is an index as it contains ones and twos or for the anonymously labeled sex one and sex two and The bracket notation means a for each sex and linear model and then in the prior as well You say you want an alpha for each sex that creates a vector of alphas two of them So when you run this models you see on this slide and you you get the pracey output you'll get two alphas There's alpha one and alpha two If you had a hundred different sexes in there Well, you could have a hundred alphas and the model code will stay the same You don't have to change anything. This is a really convenient feature of this way of doing things The awkwardness is of course is you want to make inferences about differences between categories So you need to compute those these are called contrasts in psychology in biology. We just called them differences so once you've got the posterior distribution you can Convert to any other parameterization of the model after you approximated the posterior. You don't have to rerun the model So in this case if we want the posterior distribution of the difference between The two sexes. This will be the the average difference in height This is height data. You just extract samples from the posterior That's the top line of code on this slide the second line I compute the difference by just subtracting each sample of alpha comma one Why is there a comma there? This is for every sample from the posterior, right? That's the one at the number after the comma is which of the alphas number before the comma is which sample So if you leave it blank you get all the samples so for every sample this code says for every sample Subtract the first the second alpha from the first and store that difference in a new symbol Called diff fm and I'm going to stick that in the post It's in the posterior because there's a posterior distribution of this difference And then I can just pass that whole thing to precy precy takes a whole bunch of objects It likes things right I use it all I wrote for myself years ago to summarize stuff because I hate p-values And I wanted a summary function that doesn't show them, right? They burn my eyes Really just burn my eyes. No, I joke. I don't like them because they contain no information of use So I just clean up my screen. The other thing is I don't like a million decimal places, right? So two decimal places, it's usually fine And so when we look at diff fm here The posterior mean is minus about minus point eight and then you get a compatibility interval there as well And this is the often we want to make inferences about that difference And so if you had a hundred categories you could calculate this difference for any pair of them you like, right? They're all there in the posterior already Right all the comparisons already exist inside the posterior you just have to compute the ones you're interested in Okay, if the in the chapter there's an example Where we look at the primate data again, and I show you how to do this with multiple species So that'll help you understand what happens when there's more than two There's a code example, and I also show you how to include more than one index so you can have I forget what I did I had species and then I had Hogwarts house or something like that and The Slytherin has a lower mean something like that So please look at the code and and we'll be using this kind of coding in future examples quite a lot It's much more natural. I think it's it's easier to set priors for and if you ever need the difference you can still get it back out Unfortunately, it's just not the default in most statistical software Okay, let me switch gears to the the meat of today's lecture I'm very excited about this lecture because there's a lot of new content that I have not Subjected an audience to before and I'm going to have fun with this now. So uh, I think it's an interesting feature of the scientific literature that There seems to be a negative correlation between surprising things and true things right, so Newsworthy studies seem to turn out to be untrustworthy at a very high rate And the reverse is also true that the most trustworthy science is incredibly boring Right, uh another cell biology paper on some ion channel. It's probably right But it puts you to sleep. You're halfway through the abstract in your sleep, right? Meanwhile, here's here's my favorite example. It's a paper from the prestigious journal PNAS The p stands for prestigious prestigious. No, that's her proceedings Titled female hurricanes are deadly or the male hurricanes There was a response. You can read the title of the response for yourself here This is almost certainly not true. What's the idea now? Hurricanes don't have gender, right? I'm thinking about that But they get names and the national weather service in the united states likes to name hurricanes because it's easier to refer to them by names I guess and they have this convention of alternating male and female names And I forget when this started and there's a list actually and they just go down the list But it alternates male and female And so you can regress these data are in the rethinking package if you want to play with them data hurricanes I'm here for you to entertain you And If you do a terrible regression, yes, there's a correlation between gender of name Female hurricanes have killed more people historically But it's it's not robust to the model specification And secondarily, there's no plausible mechanism by which this could actually work, right? The idea in the paper is that If you name a hurricane with a female name people don't take it as seriously and so they don't evacuate This is historically false stuff But this got a lot of press this was written up in lots of places that spread around the internet very quickly because it's like wow if Big if true you know that phrase Right, so a lot of science is transported by the big if true things and a lot of it took these things turned out not to be true Now this is a bit silly and that's who I use it as an example But there's lots of less silly science that's done as rigorously as possible And still it seems like the newsworthy stuff turns out to be false at very high rates A lot of epidemiological work is like that, right? It seems like every week coffee is either going to make you live forever or kill you Yeah I don't know if cell phones are good or not Whatever Here's a study that I think is is quite interesting and Some of you will have heard this arsenic life Paper that came out was a couple years ago now So this is a lake those of us in california. You know this monolake monolake has a very high naturally occurring levels of arsenic Arsenic is bad if you're a living thing And if you're cyborg, maybe you don't care as much, but why is it bad because arsenic arsenate mimics phosphate and Phosphates extremely important for all kinds of cytoplasmic reactions. It's also part of the spine of dna And if you get arsenate in your cell that replaces phosphate and then things break. It's very bad. This is why arsenic is that deadly poison Uh, it's a cellular mechanism monolake has really high levels of arsenic and there are lots of life forms there And they can get along and reproduce Um, and and the question is uh, how do they do it? What's the evolutionary process is that they're adapted to arsenic to toxic environments? so, um, this is dr., uh, phaliza wolf siren, uh, had published a Quite newsworthy study Which had evidence that there were bacteria in monolake that were actually using arsenic to build their dna NASA got really really excited about this because it means that you could potentially get life In places with low phosphate concentrations and arsenic is Galactically speaking much more common than phosphate And so this was a big deal Um, since then it's turned out probably not to be correct, but it was a rigorous study There was nothing like scientifically. There's no misconduct here. It's just it's probably not true But it would have been really big if it weren't true Right, but it probably wasn't true. There's lots of stuff like this. It seems to be the case And then the flip side i won't give you an example, but the flip side is you know, yet another fruit fly paper Right, which is probably true, but really built So what's going on here now? Lots of people have theories about why this happens Uh, I don't think we need any elaborate theory to explain this impression that we all have which I think it's true Although I don't have a big data set to prove it all you need to get a negative correlation in the published literature between newsworthy things that is things that are clickbaity right And trustworthy things which means things that turn out to be true in the long run Is peer review. It's all you need So bear with me. Here's a simulated example and chapter 6 begins with some code. You can play with this on your own imagine that Either at journals or at grant review panels that fund the work in the first place you care about both The the impact of the work that is its newsworthiness the public will care and it will make a difference in the field That's what makes things newsworthy. Yeah big if true Yeah Would you also care about rigor you care about the trustworthiness and we care about both of those things and we should There's nothing wrong with caring about newsworthiness, right? We we should be doing science that matters Yeah, I don't think there's anything wrong with that motivation if you care about both then they're compensatory and a study can get published or get funded if It is sufficiently trustworthy or if it is sufficiently newsworthy. So even if there's no correlation in the production of science in these things Post selection there will be a negative correlation between them Right and so this is the simulation that I plotted here If this is a perfect Gaussian cloud of random imaginary studies Say grant proposals in which there is no correlation between trustworthiness and newsworthiness at all the codes in the chapter I'm sure you could play with it And then there's a threshold of the sum of newsworthiness and trustworthiness above which you get funded That's why that's compensatory right one can compensate for the other Because both matter and then the blue are the ones that get funded and The line is just a regression line drawn through them and the correlation here is minus 0.8 You can get extremely negative Correlations between these criteria, but this is if you think this is a causal thing It's telling you something about the nature of how science is generated. You're wrong Right. Well, actually, I don't know in the reality why this is true. This is just an imaginary example, right? It's just that you can't know From this correlation necessarily what's happening generatively because of the selection effect Now, why am I telling you this story? It's not to depress you Right, uh, rather it's to set up a fundamental lesson about how multiple regression works This happens inside of multiple regression models multiple regression models when you stratify on a predictor It's like a selection effect. You're creating subpopulations and those subpopulations can have spurious Correlations like this. This is a spurious correlation. It's not telling you how science is generated. It's telling you about a selection process Conditioning on a variable in regression is a selection process And I don't know that's what I hope to transmit to you today some understanding of that And why it is our responsibility to think about the possibilities of this We shouldn't just add things to regressions and hope it all works out because sometimes it won't So this effect is called the selection distortion effect or berkson's paradox. It's extremely common It happens a lot and it happens organically if you will inside of regression models So, uh, I want you to think about regression Regression is an incredible tool these geocentric models. It's valuable and we should never give up on it It does a lot of really good heavy lifting and inference But it it's an oracle. It can see stuff that we can't it automatically finds the most informative cases It ignores the ones that aren't informative. It deals with all this partial correlation structure It's actually amazing that the universe is designed such that this works out. It's incredible It's an oracle, but It's it's like a historical oracle the oracle of delphi the oracle at delphi could ruin your civilization Any moment the oracle at delphi could get a mood and decide No, the prince has to die Right just whatever the oracle wants to say the oracle is not benign Even though the oracle is wise and this is how A regression is or if you want to think about genies in the more Middle Eastern tradition, right? The the genie is powerful, but you have to Give it your wishes very very carefully Because it will take them extremely literally and that's how regression is It will it will answer the question you pose to it But it will take your question very very literally and so you have to understand the language of the genie Of multiple regression and that's what i'm aiming to help you do here so I want to contrast this kind of cautious approach and worrying about selection effects and such With the way i think regression is sometimes well, I don't want to say it's taught this way But it's practiced in the sciences quite often like well, we've got all these potential confounds and let's just add them in Right, and you see these big regression tables and papers. I call this table two Table two in each paper is all the controls they've tried Right, uh, and sometimes coefficients change and it's all an uninterpretable causal salad Right, and that's bad adding Variables to model can create confounds not just remove them So it's not harmless to add this and you need some justification to add a variable to a model And that's what we're going to talk about today Uh, this of the sink here is because there's this expression of the kitchen sink, right? Okay, so let me try to give you some framework and i'm going to Fill this in throughout the rest of the lecture and give you some examples and this chapter has all the code I'm not going to show you all the code in the lecture But the chapter has all the computational all these examples computed out for you And we're going to use a lot of simulated examples because those are the only cases in which we know what's true right so Actually, there's this benign fact that there are only four kinds of confounds at all This is nice, right? This is a provable thing Uh, in this at least in this simplified causal structure that i've been showing you this dag view of things Or you've got directed acyclic graphs representing a causal model There are only four kinds of confounds. That's it and there's no other kind of confound that can ever arise Well, there's residual confounding. We'll talk about that later. So sorry, there's more but Ignoring those things like measurement error. These are the only kinds you get And they are the fork the pipe the collider and the descendant I'm going to explain each of these to you and give you an example and show you what they do And then at the end of the lecture, we're going to look back to this slide And I'm going to explain to you how you de-confound each of them There's a method for de-confounding each and if you know the causal graph, which of course you don't but hey, I mean This is still better than no news, right? Uh, if you know the causal graph you can Either de-confound know which variables you need to add to the regression to remove the confounding Or you can conclude that it's hopeless and you can't de-confound And that's good, right knowing that you can't get the answer from the data set is an important achievement in science Right rather than making stuff up So we're going to come back to this and I'll define each of these in order. Let's start with the fork The fork is the most famous confound. It's in many cases the the first and last example of confounding that people receive in their statistics education This is a variable that is a common cause of two others. We've had an example of this already This was the agent marriage confound, right where median agent marriage Uh, it was created a fork It causally influenced both divorce rate And marriage rate and so it created a spurious correlation a confound between marriage rate and divorce rate You remember that from Monday. It was a long time ago before the snows began And uh, so this is why I draw it up here. This is the fork the idea is you've got three variables Um x z and y and all of these figures the idea is that you're interested in the association Between x and y in particular your interest in the causal influence of x on y z is this third variable that could be a confound Um, and here is uh, if z is is a common cause of x and y It'll look like there's an association between x and y and a regression unless you include z condition on it So in the fork Um, you deconfound the fork by conditioning on z And this is i'll explain later what this does is it it shuts the fork There'll be some language later early again time. It's a weird expression to shut the fork breaks the fork Very the fork. I don't know we need a metaphor, but uh, it removes the confounding It stops information flowing from x to y it shuts the path is we're going to talk about paths in these diagrams later on Um, there's this notation. You'll see we won't use it very much But I want you to recognize it at the bottom of this slide uh this x and then this weird I don't know science fiction symbol here Uh, why that means independent of Uh, so x is independent of y or d separated d for dependency dependency separated from y conditional on z So once you learn z, there's no Remaining association between x and y that's what that notation means You'll see this in a lot of papers, especially epidemiology, and it's a bit confusing once you know what's going on But that's all it means is that x is independent of y conditional on z um The second one is a pipe The pipe is a lot like uh the fork. We haven't had an example of this yet though So I'm going to explain this to you and then we're going to have an example okay, so uh The pipe is a case in which there's a mediation the psychologists are going to know this right Uh a lot of biologists do but psychologists. This is like your first thing. It's like mediating variables mediation analysis So here we're interested in the causal impact of x on y But if in reality it's mediated by a third variable z If we condition on z Then we don't notice The causal impact of x on y it actually knocks it out You might want to do that if you're testing for mediation, right if you want to see if it's there Or you might do it by accident Which I think happens a lot in regression in an observational study where you don't know the causal graph If there's a mediating variable and you control for it, then you will control away the real causal variable Now that's an example i'm going to give you in a second so x causes z causes y is the way to read this thing Or z mediates the association between x and y if you condition on z the middle variable Just like with the fork, then you remove the statistical association the dependency between x and y Right notice that with data alone you can't tell the difference between a pipe and a fork You can't see the causal graph in the data You always need something more than the data to tell what's right. That's why the machines will never replace us I hope Yeah Does this make sense? Yeah, okay Um, of course in reality pipes and forks behave very differently. They're very different causal relationships Let me give you an example of a pipe and a major inferential threat the kind of confound that I think arises It arises both in experiments and observational studies, but probably more often in observational studies where people just start throwing in Co-barriots to control for stuff so For this example is in the book. I'm not going to give you code here, but I'm going to walk through it Conceptually with you So the the confound that gets created is sometimes goes by the name post treatment bias. What does that mean? Post treatment is z the treatment is x it's something you've done to a system because you're you've experimentally manipulated it Or nature has manipulated it and you want to assess the causal influence of that treatment on some outcome of interest x and y Post treatment variables are the variables that arises a consequence of the treatment and are on the path On the pipe to the outcome of interest there could be a lot of lots of things happen, right? And you add arsenic to a lake Lots of lots of cellular processes kick in eventually cells die, but there's lots of mediating stuff you can measure along that path About what happens? so post treatment bias Is the regression confound that arises when you're not aware of this causal relationship That that some mediating variable is on the path and you control for it thinking you're testing for a confound a fork But you're really knocking out the mediator and you end up inferring the treatment doesn't work And I think this happens in embarrassingly high percentages of the time of the sciences Because people don't have a causal model of what's going on. Let me give you an example where it's fairly obvious Uh, I would think and you would never do it Right, but just so you understand this all the data All the code to simulate this example is in the text and I encourage you to walk through it Because there's also regression model there And you can think about parameterizing and choosing priors on that regression model. That's that's also worth doing But i'm not going to do that in lecture. I apologize just just to take the time Yeah, but it's good practice and you should do it so let's imagine an experiment where We have a greenhouse experiment with plants Sorry, I used to be in agricultural school And so I had lots of collaborators in greenhouses and I did lots of stats consulting with people doing greenhouse experiments So this is its history. This is where it comes from and I had bought this in big plants sit there and they wait to be counted All right, it's much nicer than studying monkeys, which don't sit still So We were a big problem in these greenhouses is fungal growth. They're human places, right fungus loves it And so you need some antifungal treatment to help them grow So we've got some hypothetical antifungal treatment and we're going to randomly assign plants to control and to the fungal treatment and The causal model that I present here is Initial height and the presence of fungus cause your final height h1 So h0 is the height of the plant when we assign them to the treatment groups Fungus also fungus reduces growth taller plants and Initially are also taller later. Both of these things interact two arrows inter variable then they potentially interact um and The antifungal treatment is upstream causally from the fungus, right? It influences fungus, but it doesn't influence height directly. It could but i'm asserting it doesn't there could be a direct impact of the treatment on height Right because it could also affect the plant's growth directly What happens here in a regression is if you if you measure fungus and you should if you're doing this experiment You're doing well. You'll measure the intermediate variable You'll measure fungus because this is how you test for mediation, right? You see you want to know You want to know this first arrow from t to f because that gives you the the direct impact on the fungus, right? But of course what you're interested in if you're going to use this in your greenhouse is the full path from t to h1 Right, that's the causal impact if you condition on f It'll look like the treatment doesn't work The correlation between treatment and plant height will vanish And the code to do this is in the book, but I think you can probably see it from the diagram Why does that happen because if you condition on the fungus here? You block the pipe you close the pipe Information can no longer flow along the path from t to f to h1 in that case And so in this case you would never do this, right? You would measure f because then you do another regression just looking at t to f and you say, oh, yeah It reduces fungal growth look the amount of fungus on plants in the treatment group is much lower And then you do another regression leaving fungus out To get the treatment effect which is going to be lower than that right because some plants without fungus still don't grow very well There's other things that affect the plants you going on. Does this make sense? Yeah in observational studies The terror is real because we don't it's not this clear There's not an experiment. You're not sure what the mediating paths are It's hard to measure things And then the causal salad temptation comes in right you've got a big list of stuff that you know about the cases And you can add them into the regression and see if any of those is a mediating variable It'll knock out your Variable of interest and you might conclude that no that thing doesn't actually matter. It's really this other thing These debates happen. Let me give you an example that is potentially a little bit controversial That's why I have fun with it There are lots of debates about the wage gap Right gender wage gap racial wage gaps in the u.s And what the causal diagrams are and a frustrating thing for a statistician With these debates is that no one wants to put up a causal diagram about what's going on And you'll hear lots of times is that if you condition on career choice, there's no wage gap Which is basically true. I mean there's still some differences, but conditional on occupational choice and hours work There's basically no wage gap That's true. But that doesn't mean that gender and race is not causal of the wage gap Right because there's streams. There's a bunch of arrows in this. It's just like the greenhouse experiment You don't then conclude there's no discrimination or there's no problem Just because it's downstream. There's some downstream thing that knocks it out. Yeah, this makes sense So an example that I use a lot to with grant funding Um There are big differences in the amount of grant funding by scientific field, right? So psychology gets way less than cell biology newsflash. Yeah So if you look at Funding rates of men and women in the sciences say in the ERC Just naively women get way less grant money than men But if you condition on field men and women are funded at about the same rate Uh because it's downstream, right? It's gender influences field choice the choice of the field you're in and then some fields receive less funding than others and so There is inequity in outcomes But it's you have to figure out the causal diagram right to know what to do about it Yeah, or if you want to do something about it not everybody cares about that issue But I do I think if you want to fix that and make the more equitable outcomes Then you have to do upstream from grant review. Does that make sense? Sorry, this has been my little sermon for the day, but yeah quickly question I refuse to use the question was each arrow would would confirm the significant I refuse to use the word significant but There could be you can make a simulated example like I do in the text Where there are real does arrows exist There are causal influences at every step Conditioning downstream means that you statistically won't see the upstream effects because there's no information remaining to learn from them Once you know the one Right You won't see that once you condition on job in this sort of thing All of the upstream effects have been taken into account. They're already in job And then job affects your wage, but there's all the other stuff that led you to a job And it's the question that those are the treatments right in this in this ally. Does that make sense? Yeah, but all the arrows can exist and be real and you'll still do the physical Okay Yeah Can we uh, is it quick? Well, I mean it seems like in the pipe situation The x and the y are You know potentially you misrepresent their relationship by conditioning on the z Yeah in the medium, but the opposite is true isn't it important that you be misrepresenting x and y by not conditioning on the z Um, let's come back to this. I know what you're asking and there's a There's a denouement coming That where I try to tie all this together and we're gonna do something called the back door criterion that puts all this together Right, and I know everyone's thinking of a howling wolf song or something, but uh, at least the americans are Doors, maybe the doors, you know for the doors Okay, let's do the collider the collider is explosive. It's great. It's my favorite confound This is where the selection effect at the beginning of lecture comes from So collider is a case. It's like the fork, but reverse So in the fork z is a common cause of x and y in the collider z is a common result of x and y Both x and y influence z. It's a collider of x and y Uh, so you could say x and y are jointly called z x and y are really independent Right, they have no causal linkage between them But if you condition on z it creates A statistical association between x and y a spurious one if you're trying to make a causal inference. This is very This happens all the time. I think and it's very dangerous There's lots of cool effects that arise from conditioning on colliders one of them is The example I gave you at the beginning with the trustworthiness and newsworthiness Uh, selection for a grant or publication is a collider of trustworthiness and newsworthiness. It arises from both They can have no statistical association in the general population, but conditional on z on being published Uh, there is a statistical association between the two, but it's spurious Which means it doesn't inform you about causal relationships And I'll give you more examples colliders are the hardest conceptually, but they're also the most hazardous and they're really worth understanding And they're common, right? They have to happen Just depends upon which variable you're looking at they will happen in all causal graphs eventually So if Another way to think about this as I put it to bottom Learning x and z reveals y Right, so the thing about colliders is there's this finding out effect of learning one side Once you know the collider, let me give you some examples some heuristic examples to maybe help So imagine we've got a light and it's controlled only by a light switch and the presence of electricity And now you're thinking but there's a light bulb I know there's more but this is a conceptual example for learning, right? So Uh, now imagine I tell you that the switch is on and the light is off. Can you tell me whether the electricity is working or not? Of course you can Right because it's logic. This is how colliders work. This is the finding out effect It also works in continuous systems as well It's just harder to give examples where it's like that, but it works for continuous variables as well I'll have an example that will work for On the other side, right if the electricity is on and the light is on the switch is on Right because otherwise the light wouldn't be on you see how colliders work So when you get confused about colliders think of the light switch Or some other example that you like But this is the basic logic of how they have a function. So here's the trustworthiness newsworthiness idea Uh, uh newsworthiness and trustworthy is both influenced being published because it's compensatory So if I tell you there's some study that's not very trustworthy, but it's been published in nature Can you tell me how newsworthy it is? Yeah, you know it's clickbait, right? You know it's on buzzfeed already Yeah, so I should pick on buzzfeed Buzzfeed is definitely not the worst part of the internet All right, I think it's worse So, uh, there are lots of effects like this that happen all the time Here's one that I quite like this comes from Matt Hahn. He's an evolutionary biologist on twitter This is an obsession of his like it is with me is collider effects. Um, so We're here in europe, but everybody here has heard of basketball, right? There's this weird sport we play in Balls of court and dribbling a ball and throwing it. It's a great sport and uh Uh, being tall is definitely causally speaking an advantage in basketball Because there's this hoop and it's elevated above the height of any like normal human being and so you have to The taller you are the easier it is to score field goals. They're called field goals and um Nevertheless conditional on being a professional player. There is no correlation between height Uh, and your shooting percentage But that's conditional on being selected because they're the shorter players are compensating. They're awesome in other ways So almost certainly what's going on here is that the shorter players are amazing in other ways and the taller players Well, they're less amazing in those ways, but they've all been drafted because they're amazing players But this is post selection if you condition on selection You cannot make causal inferences about the connections between the criteria that lead to selection They have been distorted by the selection effect. This is what this happens with in regressions models And that's what I want to show you next So this is this is what nature does the conditioning for you nature is conditioned on a collider in this case And we don't get to see the shooting percentages of players who are no longer playing basketball Yeah, uh, but sometimes we do the selecting inside the regression model and it fools us just as strongly Well, I want to say in this case nobody's fooled right you see this graph and you know immediately that you've been tricked Because of course being taller is an advantage. Yeah, all of these players would like to be taller Okay, let's do an example where this is a statistical model That's conditioning on the collider is creating a confound so this is a simulated example, but it comes from an empirical literature and I want you to imagine this causal graph at the bottom. We're interested in happiness Say we're happiness researchers. We're very happy and we want to know what makes other people happy And we want to figure out why people are sad because we want to make them happier. That's a noble goal I think this would be good science. Yeah And uh, so imagine it's true. This is a thought experiment to show you how tricky colliders can be that Getting married Is both cause positively associated with happiness and age causally why happier people are more likely to get married Because they're nice to be around Maybe right happiness causes marriage Isn't that sense right and also age causes marriage You know like what does that mean? Well the more years you're alive the more chances you have to get married So age causes married the longer the older you are The greater the probability of being married. Yeah, this is two causal effects and they go together Now our question is a happiness researcher is we want to figure out If there's any causal impact of age on happiness do people get sadder as they get older Right. This is something you'll see in this literature. This is how this came to my attention is people trying to interpret that effect That there's a negative correlation between age and happiness Here's a simulation where it's totally spurious and arises from this causal graph so the code to do this is in the book and This is a case. Most of the simulations in this chapter are quite simple I just simulate from you like our norm. I'm just pulling out random deviates This one's different and so I wanted to show you something different. This is an agent based simulation Here I create a population of little happy people And they age and they get married and they live a little lives And then they they don't die at age 65. I send them all to spain. Nobody dies in this simulation Spain just fills up with happy people that are they never age over 65, right? It's true, right? So I've been to spain, right? So, uh, uh, here's the algorithm for the simulation Um, 20 people are born each year In this population, there's uniform happiness at birth and your happiness never changes What do I mean? Uh, happiness is distributed between zero and one when you're born you get assigned a number between zero and one and that's your happiness Uh, and so of all these 20 people born Uh, each each person gets some evenly spaced happiness Right, so there's a uniform distribution of happiness in the population and your happiness never changes Now, I know this is not real, but this is what I'm showing you is the hazards of influence, right? The reality is more complicated and it would be even harder to figure out what's going on, right? That's the point of these examples At 18 years old, uh, you're eligible to marry and then you have your first coin flip chance Your probability of getting married in this population is proportional to your happiness, which remember is constant But your age is not constant Age itself the number doesn't cause marriage, but each year you're alive You have another chance to get married and that chance, which is constant is a function of your happiness Does it make sense? Uh, marry people remain married unto death, right? I have no divorce that'd be a secondary process if you're in, you know the code's here If you want to plate it back, feel free and then everyone moves to Spain. Okay at age 65. Yeah Um, so there's this function called sim happiness in the rethinking package Which does this you can peek at the code, right? It's not that complicated Actually, it's just a big loop where I loop over these steps Um, I set the seed to 1977 just so you can replicate my exact numbers But there's nothing special about that just so you can get my exact numbers back and I said late a thousand years So the population's at equilibrium, right? There's no burn-in effect here. This is added equilibrium steady state age distribution and everything and then we get a data frame Uh, 1300 observations, which means there's 1300 people in the population And we have three variables. This is a cross-sectional sample of 1300 people at the after a thousand years And then we run regressions on them because we're social scientists. That's what we do And uh, here's the statistical model We're going to run a regression where we take happiness and we Put in both We're interested in the associated between happiness and age So happiness on the top line is the outcome and then in linear model mu We've got a slope b capital a for the the slope of age and then a is the individual's age That's our target of inference. That's the exposure that we're interested in But we also think well, we We know the marriage status of people in this population. We should control for that You know, that seems like the right thing to do doesn't it the wrong thing the narrator. That's the wrong thing to do But I want to show you that that's the case. So I've created an index variable Now mid which is your marriage id. I think it's one if you're single and two if you're married You could be other statuses right here could be other things too We could have divorced and that could be a third status and so on And then we put that in as a control seems reasonable. This is a multiple regression. Look what happens It's uh, it turns out. Yeah single people which is a one are less happy on average See the the posterior mean is negative and the whole compatibility interval is also negative Right single people are less happy on average in this population Even though you know From the simulation that happiness never changes right, but single people are less happy why because Because happiness causes marriage right runs the other direction It's not that marriage causes happiness. This is why I kept saying Starting beginning of this week regression models don't have arrows. They just measure associations The causal model is separate the arrows are something that's not in the Bayesian network This model is on the screen is a Bayesian network, but it doesn't have directionality So that's what the dag does is it imposes directionalities that we can interpret what's going on A2 is married individuals And now this is very positive right married individuals are on average more happy and you know why Because happiness causes marriage right the arrow is going the other direction, but the model picks it up No problem the Bayesian network has done this job Now the target of inference the slope with age is negative Solidly negative that whole posterior distribution is below zero Very solidly The old are unhappy. It's a very sad south coast of spain Right No, it's not there's no this is a this is a spurious correlation that arises from conditioning on a collider So let me explain it to you. I'm not done trying to explain this. I know how weird this is Um What has happened? Okay, let's look at the population That you know this is spurious because I gave you the algorithm right you can simulate it yourself You know there's happiness never changes. So it can't go happiness is not declined with age But conditional on marriage status it definitely does that is if we stratified on marriage status And we look within each group married or unmarried there is a negative correlation with age Here's the plot to demonstrate this. So this is the simulated data. This is the After a thousand years, this is a year 1000 each point on this graph is a person Uh, remember every year 20 are born. So there's 20 individuals in each column and every year they move to the right one step Yeah, the last step age is off to spain And then 20 new ones enter the population. Yeah Um Happiness is uniformly distributed and constant. So you just move to the right every time step But sometimes you become blue which actually in this means that you're married, right? And uh, so blue filled circles Are the married individuals and the open points are unmarried individuals. So you'll see uh Starting early on in age 18 The blue points are only at the top because it was the happiest individuals and they're probably getting marriages married as Highest but over time individuals who are less happy always sort of you know, median happy Uh, we'll also get married because they've got a lot of years to find the right also median happy person And by the end by 65 when everybody's ready to go to spain Sorry, I'd like to you know, I'm getting kickbacks from the spanish tourist bureau. It's very nice in spain Most of the population in the simulation is married by then Yeah, now if we if we draw regression lines through both of these sub populations You can see there's a negative correlation between the average happiness and age Right the average happiness and married individuals at age 18 is very high The average happiness of married individuals by age 65 is well right in the middle, right? It's the average happiness in the whole population Likewise the average happiness at age zero is well zero the average happiness, right on the standard by the scale It's just a population average, but then it declines among Single people because the happier ones are migrating to the other sub population But the distribution of happiness has not changed in a single person In this population at all that we have been tricked by the collider Um, so let's look at the diagram again and try to bring this home. So Here's the cause diagram diagram again remember marriage status is a collider of happiness and age There are two arrows entering no arrows leaving if we condition on it We create we allow information to flow from age to happiness And we end up concluding that this arrow with a question mark on it is real and it exists, but it doesn't Why how do I know that because I wrote the simulation Right in reality, we never know but we have to entertain these scenarios and use information external to the data set To present persuasive causal arguments about these effects. Does that make sense? Uh, I I think collider bias is super cool. It's not just here to terrify you But uh a little bit of fear is good though keeps life fresh Um, let me give you another example. I like collider so much Uh, I have another example now. This is one I call the haunted dag Um colliders are so powerful that they can even occur when you haven't measured one side of the collider of the of the collider You know, it can be an unobserved compound and you can still get collider bias. Isn't that cool? So let me show you how that works. So this is like haunting Um, so in my subfield uh human evolutionary ecology We're really really interested in allopraternal effects. That is grandparents and Grandparents, sorry, sorry, I should just say I know you want to leave that behind But no, it's a very important question. What is the material benefit of having grandparents? Uh humans stack generations together and typical human families have three four generations living together And there are resource flows of and there's also information flows between the generations and we want to figure out How important those things are to human welfare? So, um People in other fields are interested in these things too in terms of educational and wealth transmission effects You don't only inherit from your parents, but also from your grandparents Uh, your grandparents can influence your attitudes towards education and then affect your wages All those things are plausible. How do you figure this out in an empirical study? So imagine a situation where we've got grandparents We've got triads of individuals we've sampled and we're looking at something like Um, uh educational Outcomes how many years of education they complete? Uh We have g grandparents p parents and c children and we consider all these arrows are possible and we want to measure each of them That is there's an indirect effect of grandparents because grandparents Uh, well, they they educated their own children. They influence their own children's education Attitudes towards education by you know having books in the home, whatever it is that you think makes this work Uh, and then um, there could be they can pass that on to their own kids So that's the indirect path from g to p to c on this graph. You see that And then there's also a potential direct effect with grandparents do babysitting And they tell their grandkids how important it is to study and all these other things that they won't listen to Yeah Possible that they're that arrow exists and it really matters. It is uh, the problem is trying to measure it This is a big problem. Um, there are almost certainly I should say is one thing you get if you run a progression like this in the literature is you sometimes find negative effects of grandparents That it will look like grandparents are either doing nothing or they're actually hurting the educational outcomes of their grandkids That is that direct path from g to c will have a negative coefficient path coefficient on it What could that be other than grandparents are toxic? Okay, and it could be collider bias. So let me show you how it could be collider bias It's plausible that parents And their children share unobserved compounds that are not shared with the grandparents And they write this down as a variable you you and these diagrams means unobserved Some unobserved compound and you have to imagine that whenever you do an observational study at least but also many experiments that there are Use all over the place Potentially a bunch of them and you want to think about how they can interfere with your inference, right? So in this case, this would be something like a neighborhood you live in do you live in a neighborhood where the neighbors are really into education It's a good neighborhood as a good school, right? That's a it's a big effect Both in germany and the united states school effects and neighborhood effects are really powerful actually on educational outcomes and so uh Even though you haven't measured you it's it makes parents into a collider Do you see that because two arrows enter now if we condition on parents? So say you're trying to measure the direct effect of grandparents on their kids But you realize of course that there's this indirect path through parents And so you want to condition on parents to control for parents, but when you do that you create collider bias You lose or you lose in this graph There's no winning in this graph. I'm afraid Uh, sorry I'm gonna build you up. Eventually. We'll come around. You'll feel better. Uh, so it becomes a collider and we condition on both so We simulate this 200 triads, uh, I assume So b gc set to zero at the top. You see that I'm assuming that the direct path from grandparents to kids is zero They have no effect just assuming that for the rhetorical advantage here You can make it any number. You'll still get a distortion But it's clearest when I set it to zero And then I just simulate you do this graph as if it was real and then you run a regression And what happens is we end up inferring down at the bottom So this is a multiple regression that includes both parents and grandparents on child outcomes You end up concluding that grandparents hurt their kids There's a very strong negative partial regression coefficient between grandparents and kids But it's entirely a result of collider bias Tirely a result of collider bias. How does this work? So this is where we're going to bring it all together Conditioning on a collider opens a path So in the other two kinds of paths in a fork and a pipe that you condition on the middle thing you close the path In a collider conditioning opens the path It's closed by default Right and so this opens a path from g through the unobserved variable The fact that you haven't measured it doesn't mean it doesn't matter To see and this creates a spurious correlation Is it this cool? It is it's super cool So one way to think about this is What's happening when you stratify by parents is this is what the simulated data looks like So on the left I'm plotting grandparent education by the grandchild education Bad neighborhoods are shown in black Good neighborhoods are shown in blue and then I filled in all the Points where the parent now who's not on one of the axes but the parent and it's in a particular stratum of education So this is what conditioning on parents does is it stratifies by different levels of education And then the regression only looks within each of those that's how multiple regression works recall And so if we look at parents between the 45th to 60th centaurs is what I've shown and we draw a regression line through it It's negative and that's why you're getting a negative effect of grandparents. Why is it negative? I know this is like if you're confused that means you're paying attention and I thank you for that Why does this happen? Okay, so let's focus only on the parents that are in this narrow range of educational outcomes Parents in the good neighborhoods to be within this range. They must have had less educated grandparents Likewise the parents in the bad neighborhood to have the same educational outcome as those other parents in the good neighborhoods They must have had more educated grandparents Otherwise, they wouldn't have the same educational outcomes There's no way these this batch of parents could be tied and have their neighborhood differences unless they had different sets of grandparents And that's where the grandparent effect comes from But it's because of a collider is because there's two ways to end up being a highly educated parent Either you're in a good neighborhood or you had an educated parent yourself Both of those arrows enter the p-box The parent box and that's why it's a collider. Does this make sense? It makes a little bit of sense, right? So you'll go home and make a cup of coffee or red bull And sit with this example and work through it. You can understand this these effects are incredibly common place Let me try to bring this all together. I know this is confusing There is a framework that unites all these examples and it's called the backdoor criterion This is due to utipurl who's a computer scientist at UCLA published this book in 2000 2001 causality and Which lays out this framework The backdoor criterion is The idea that if you want to figure out The true causal impact to de-confound the graph you want to figure out the true causal impact of some exposure on some outcome Then you need to shut all the backdoor paths from that exposure to the outcome, right? There's a there's a forward path so called front door, right? So in this example at the bottom We're interested in the the path from e to w but there's some unobserved confound That creates a backdoor path because there's an arrow entering the back of e And we have to shut that we have to slice that arrow off somehow in order to infer a true causal effect when you do an experiment That's what you do because you set e By playing god. That's how experiments work. And so you shut all the backdoor paths when you do a proper experiment By definition you cut all the arrows entering the back of the variable. That's how randomized experiments work But in observational studies and bad experiments, which is let's face it a lot of experiments Because if you do human science, let's face it. You can't Randomize all the stuff about people. It's unethical. You just can't do it. There's lots of stuff with people you could never do ethical experiments with And so these backdoor paths exist and so you want some set of criteria for Which variables you could include that would shut all the paths and that's what the backdoor criterion tells you And this brings us back To ye olde causal alchemy the four elemental confounds you know how to With each of these these are the only ways that variables interact Together in these graphs and you know how to shut each of them. The only one I haven't explained so far is the descendant But I'll explain that so very quickly Looking in the upper left with the fork. This is an open path unless you condition on z Right, so if this is a backdoor path, then you need to condition on z to shut it Uh, likewise with the pipe That path will be open unless you condition on z So if it's a backdoor path, you want to condition on z. Does that make sense? And then with a collider, this is closed until you condition on z So if that's a backdoor path, leave it alone Because it's already shut Does that make sense And then finally the descendant if you've got a descendant here like a coming off of z and you condition on a It's like weakly conditioning on z. It's not exactly the same But since a and z are correlated they they share information And if you condition on a descendant you can partially shut the path So you have to be prudent about that too. What is a like? It's like a proxy You can't measure the thing you really care about But you measure something that you think is correlated with it You have to be just as careful about conditioning in those cases Yeah, does this make sense? Okay, we're running out of time here. So Instead of rushing through any of this material I've got some examples where we're going to play with these examples We're going to show you some causal graphs when you come back on Monday And we're going to exercise the backdoor path on them. We're going to look at them Or we're going to see are there confounding paths here? What's the minimal set of variables we need to include to remove all the confounds? So we'll run through some examples together in class On Monday. All right, you have some new homework. Let me rush forward to the new homework. Sorry There. Okay. There's new homework. It's already online. It involves foxes and a causal graph. That's a fox And a causal graph and I'm going to ask you to make inferences in a multiple regression in light of the graph And then next week I'll finish up with the backdoor criterion and then we'll start talking about overfitting Okay. Have a good weekend. Enjoy the snow