 Thanks everyone for showing up today for this workshop on causal inference. The goal is to give you a basic introduction to formal causal inference in statistics in the sciences. And as the title slide here suggests I think of this as the pre statistical stage of a of a scientific project as you, you sketch out your causal assumptions first and then this enhances everything that comes after it in the workflow. I want to start with this metaphor of constellations though as a way to maybe make this more interesting than statistics, because really I hate statistics I really love that and I only do it because I have to. Right I'm in this for the science. So let's think about constellations first and constellations are important in all human societies and typically societies create mythologies about how the constellations exert causal forces on human lives. And now of course, most of us here probably believe that they don't but it's not a silly idea. All sorts of weird things happen in our lives. And the stars are these mysterious objects in the sky that attract everyone's attention. It's not insane to to build theories that they might be causal. And so we get these really elaborate versions of this with things like horoscopes. One of my favorites is the horoscope of Prince Iskander who was the grandson of Tamerlane, you know, conqueror of all the known world at the time, basically, and born in 25th April 1384. And this horoscope, which is converted into a giant tapestry is predicts this whole life from the positions of the stars, the constellations at the time of this individual's birth. It's not something that just happened to royalty of course, probably a number of you have had your horoscopes done to if you haven't I recommended it's fun. But of course, you know, how could this possibly work. As scientists we were supposed to develop skepticism about these things and of course I respect that, but there's a metaphor here for how statistics interest the sciences that I want you to hold in your frontal lobe for the next at least 30 minutes. That is of course that it's highly implausible that these sorts of horoscopes can provide any kind of detailed prediction and the reason is because very little goes into them. The premise of these things is that you tell me your birthday that the conditions on that date will let me chart out your entire life. This is a very little tiny bit of information in for a ton of information out that's supposed to be unique to the special case that it's asked for. If you read the horoscopes in in newspapers newspapers there were these things people used to read called newspapers I'm not sure they people read them anymore but they used to have horoscopes in them. And the horoscopes are incredibly vague right because if all you share with another human being is a birthday, then only the vaguest sorts of predictions could be true. Yeah. Now statistics in the sciences has a has a feel to this I call this the horoscope syndrome with statistics and that is you tell me your data, and I as the consulting statistician will tell you how to extract a significant result from it. Right. I'm joking a bit here but only a bit. Right. You recognize the motivational problem. Now both of these things are superstitious because it's not plausible that knowing only the data that you can tell someone had a process it and and the next three hours are basically a protracted essay on this, and how what additional information needs to be put in in addition to the data so that we can process it in a way that reveals well scientific facts, hopefully. The other metaphor that we get from constellations is of course that there are objects in the sky which are not stars the wanderers the planets right from the Latin word for wanderer. And what you see here is Mars. That's the the S shape in the middle of the screen is Mars. This is a time lapse photo of Mars on different nights it's taken taken at the same time each night in a sequence. And it traces out this S shape. This is called retrograde motion. Those of you who've done some some astronomy, you're familiar with this phenomenon. And basically modern physics grew a large part of modern physics grew out of trying to explain this phenomenon and understand the structure of the solar system why Mars wanders in an S shape in the night sky the the other shape. A little bit to the left here. Can you see my cursor. Yeah, wonderful this is going to go much better. And this is Saturn. The other planets also do this but it's less pronounced Mars is the one that really has quite wild shape is because it's closer to us. And now of course you know this is Mars has an elliptical orbit and so do we and there's this period where we're close to it and then we pass it and that makes the S shape and so on and you learn this and in secondary school at some point and then better things happen to you and you go on to do other kinds of science now someone saying no you never learned this. Well I'm happy to have told you know it's a fascinating thing and but the point I want to make here is that it's, it's very easy to construct a highly accurate mathematical model of the path of Mars in the night sky that is causally incorrect, and that's what human societies did for millennia. So the, the so called Ptolemaic model which has the earth at the center of the solar system and sun and everything else circling the earth in orbits is incredibly accurate. I mean it's also incredibly wrong from a causal perspective. So that the, the figure on the right hand side of this slide is a cartoonish version of a Ptolemaic model, which uses circles on circles which are epicycles to model up a physical and imaginary physical system which would produce the observations we see in the night sky, incredibly precisely. So Ptolemaic model which had, you know, I forget how many something like 30 of these circles on circles is extremely accurate. You can use it to predict where Mars will be. This is not a problem, but it's wrong. And this is a fundamental point of friction in scientific modeling is that statistical tools themselves don't have any direct contact with causal forces. And so if all they have contact with this data, then you get a Ptolemaic model, which is not bad, as long as you understand it's Ptolemaic and by Ptolemaic I mean, it's just predictive. But it doesn't teach us anything about the actual structure of the system we're studying. So, again the next now almost three hours is a protracted essay and what we can do about this and how we can tell the difference between a Ptolemaic model and say the Hortican model, which is actually has ellipses and, and so on. Okay, so what we're not going to talk about is the, the, we call the statistics wars of the late 20th century, the grand fight between bays represented here by Godzilla, on the left, and frequentism are fissure and statistics represented here by King Kong, on the right. I don't think this is where the big problems are in statistical practice in the sciences it's not about this distinction. There are methods in both of these approaches which are very capable of solving all the problems I'm going to nominate for you today. I prefer Godzilla, because let's face it he's cool. I mean, you know, nuclear powered giant dinosaur, but and I have some more substantial reasons for liking base to and if you take my class I'll tell those to you, but I have colleagues who can do all the same stuff with frequentist approaches and that's their costs and benefits in both ways. The problems come from misapplication of causal inference, which, which is the big thing that we need to focus on. And which of course destroys both of these approaches, both of neither of these approaches neither Godzilla or King Kong produces sensible results unless you get the causal inference straight ahead of time. Okay, I hope some of you at least know this meme so you just just think I'm a crazy person now. This is the world's most famous dog, I can tell from facial expression some of you think I'm crazy now it's okay. Okay, what do I mean by causal inference. The causal inference is more than measuring associations which is all statistical models can do actually. Cisco models measure associations among variables and let me say by association I mean more than just a correlation and association is a generalized kind of correlation correlations or linear associations of a special kind. The causal inference model can do a lot more than that, but they're still measuring associations which is mutual information among the variables and causal inferences more. And there are two kinds of metaphors that you can use to grapple with it, even though they're really the same thing and I'll try to convince you of that today. And the first is the idea that causal inference is about a special kind of prediction it's about predicting the consequences of doing something about predicting an intervention. And then with Mars Ptolemy can predict where Mars was because he wasn't going to push it. Right. If he pushed Mars he wouldn't he wouldn't be able to predict it anymore something very weird would have happened, or if you want to launch a probe to Mars, you got to get the physical structure right that's an intervention. But for just observing the system, you don't have to get to the cause right so causal inferences about the content accurately predicting the consequences of an intervention. And then there's the forward view of causal inference. And then there's the retrograde view if you will. I prefer to call it the explanatory view causal inferences about the imputation of missing observations, things that could have happened, but we didn't observe. Let me let me spend one slide on each of these just to get your imagination going here. So, on the interventional interventionist view. So you think about everyday occurrence like the wind blowing the leaves on a tree. These things happen together with very strong statistical association the wind blows the foliage on a tree moves. Now you know that the wind is causing the leaves to move. But if you're in your house and you're not feeling the wind and you look out and you see the leaves moving. You predict wind. So prediction is can go with causation or against it. Yeah, I'm not telling you anything you don't know. But if you were to intervene in this system say, get you and, you know, thousands of your closest friends to climb all the trees and lights and shakes them violently. Right. I assert it would produce very little wind. It's a little bit in the neighborhood of each tree, but it will not generate much wind. Otherwise, and that's because well the arrow the causal arrow goes from wind to tree and not from tree to wind. Yeah, this distinction makes some sense. We're interested in more elaborate systems where they're potentially more moving parts but the basic principle still holds. If you don't understand the directional nature of the causes then you can't correctly predict the consequences of interventions. So one way you can think about this is this view of causal inferences is an answer to the question what if I do this. And in fact there's a, there's a tradition I'll introduce you today in causal inferences statistics, where we study causal models through mathematical framework called the do calculus, because you're doing something to the system is called do calculus. We won't actually do calculus but but I'll show you some do calculus and cartoon cartoon form. Another perspective is causal imputation about imputation here means using information to infer something that's missing. So, in this under this view is it's like the alternative history view if you know the causes of something that means you would be able to reconstruct unobserved counterfactual outcomes. For example, if the Soviets have gotten to the moon before the Americans, right as it has in this image alternative histories. This, this perspective uses the same mechanics as the previous perspective, but it has a very different purpose and I think a purpose very close to those of us who study human evolution, we're trying to explain the world we're in and why it didn't turn out differently. And that's the counterfactual perspective on causal inferences about explanation and imagination, as opposed to the applied view of trying to predict what would happen if we did something. This makes some sense. Structurally they're the same and what you do but then the purpose you put them to is different. Okay, so just a couple quick slides to try and convince any of the holdouts that this is this is for all of us. Experimentals experiments are no refuge from the requirements of causal inference. The most basic reason is just to understand why experiments work and why sometimes they don't work you have to understand how causal inference works. Right. Why does randomization, let you uncover causes. I often asked this question if this was a more interactive thing I would now open the floor and we could get a variety of answers I assert I have done this before. And understanding this requires understanding how causes work and why this bizarre thing of using random numbers actually works to produce cause that help us infer causes and it does work. I'm not denying that. Of course, sometimes experiments don't work and understanding why they don't work also requires the same set of logical tools. There are a bunch of sub questions that I know the experimentalist fight with all the time. Should you be testing for balance and covariates. People disagree about this. It'd be nice if there were some principles we could use to resolve those debates, there are. What if the treatment is imperfect right. Sometimes you assign the treatment and people don't cooperate. Yeah, so you give people pills to take and they don't take them. Yeah, so this is a famous problem in public health. There's intention to treat is what you talk about and then even though it's an experiment, you have to statistically process it in a different way. Right because there's non compliance. Should you control for anything. Everything. Why not, you only live once. Right. This is the major topic of today. I'll talk a lot about is how you choose what to control for something I won't talk about today because they're simply not enough time. Is how we would actually predict the causal effect for the target population and not just for the sample. And that is a different kind of issue, actually, then just easy to study a sample it's hard to study a population. Yeah, and that's a different issue. All the answers to these things depend upon causal assumptions it turns out. The second thing descriptive research is incredibly important to me until I think to all anthropologists just getting the description of cultural diversity right is our first mission. And this is also a causal problem. And I'm going to sort of assert this today and maybe some time later I'll give a seminar on this because I actually have a whole talk on this one slide blown out into a one hour talk. I could do this for you some other time. Just take it as a provocative promise for now. Getting description correctly depends upon describing how the sample differs from the population you want to describe. And those are causal assumptions right causal assumptions about how the measurements work about why some observations are missing, and so on. Okay. So here's our agenda today with our gentle dog doge. In part one, which we're about a third through actually we're going to talk about causal salad which is my playful term for contemporary practice in in applied statistics in the sciences, where people do non causal statistics and then interpret it causally. And I want to give you a couple of examples about what can go wrong there. In part two, we're going to start to fix the issues we're going to talk about causal design and part two is the biggest part of today's time together. And I'm going to introduce you to formal causal inference, and we're going to do it graphically by drawing our assumptions as diagrams, and then I want to teach you how to analyze these diagrams with your eyeballs, essentially. In part three, I'm going to give you just a peek into the world of full luxury Bayesian inference. This is my term, don't Google it you won't find it anywhere else. And this is what I teach in the course I typically teach in the winter. And I want to show you then I mean the causal design in part two gives you a way to design statistical procedures to get at the causal queries of interest, but you still have challenges of estimation and so part three is about that you can actually go about getting useful estimates from finite samples. Okay, I want to take all three of those parts using two common examples, which are, which are applied statistical problems. The first I'm going to call the two moms problem. Here are two famous moms, or at least one famous mom and her daughter, some of you will recognize the person on the left, right as Betty Davis, nobody, anybody. Making me feel old here. Bay Davis eyes, there was a song and everything. No. Yeah, okay. I'll spend those on talking about right. Okay, thank you else. So, there's a very, there's a bunch of questions in evolutionary anthropology which have a structure about understanding family influences on life history events and those are how many kids you have when you start reproducing who you marry and so on and so that's where the inspiration for this comes from it's going to be seem a little silly but it's actually structurally very similar to lots of important questions in the literature is I read. Consider the situation where the data are you have pairs of mothers and daughters. And once you've measured about them, are there completed family sizes so you know how many sons and daughters, each of these women had, and you know the mom's birth order whether or not they were first born second born and so on. And the research question is to estimate the causal effect of mom's family size on the daughters. And just that daughters don't know how many kids they should have they get advice from their moms we'd like to estimate the strength of that advice, this is cultural transmission kind of question. The second one. I'm just going to introduce these now we're going to come back to these two stories over and over again. Okay, throughout the next three parts is a pure bias question. Different sorts of question but one close to all of us, I think. And this is, we're going to think about situations where individuals apply to various statuses. These could be grants or peer reviewed papers or job applications. So the data are a large set of applications. And the data we have are the applicants social category and this is some category that we suspect might be a target of discrimination. It could be cultural background. It could be health status. It could be a large number of things will have this structure. We also know the field that the individual has applied to each application is applied to this could be academic field department. And so on. And then we know whether the application was successful, or whether it was a failure. And our goal here is to learn if there is discrimination by the social category. This is a causal question, I think you'll see. Yeah. And the first one there's a causal question about the mom's influence on her daughter's reproduction. And then here there's a causal question about whether the people rating the applications are biased against a certain social category or not. So what I'm like to propose right now is here's a summary slide for those two things I just put up, and I'd like to take a 10 minute break right now. I know it's like we just got started but, and I want you to think about this. I know you can walk around and get a coffee or do whatever you want you can completely ignore me but I'm just going to leave this slide up for the next 10 minutes. And wherever you're walking around think about how you analyze this you just had last week Rogers course right you did lots of regression stuff like just think about if you had these data sets what would you do. Okay, and I will see everyone back here at 136. Okay, good. So, hopefully you've given some thought to this I'm not going to pull you for how you'd analyze the data, it's your own private truths. And as we move through the material you can, you can reflect on your answer against. Well what I'm going to show you. I don't think these are easy questions, to be honest. These are, these are realistically difficult kinds of research questions. So let's move into the meat of part one here I call causal salad. And this is just this playful phrase I use a lot to talk about the way standard statistical tools are used to imitate causal inference without actually doing it and I think it's important to show you illegitimate approaches in addition to legitimate ones. At the same time. To the extent we have all done illegitimate things and research and it's just a sociological conspiracy that causes this to happen right you're not uniquely to blame for doing these things yeah, just like I accept no blame for the English language. It's a monstrous thing. It's terrible language, but what can I do. You know I'm sorry. Okay, so. The screen are delicious cinnamon buns and some of them are dog tails. Yeah, and you probably don't have any trouble telling them apart right. You can glance at this and see you would bite some of these and not others. Right. Computers find this to be a very challenging problem. There's a range of kinds of adversarial image grids that people who study artificial intelligence and visual recognition systems use to test various algorithms and things like this there's a whole sets of things like this which are quite difficult for computers to do but your are trivial for human children to do very very easy for human children to do and it's an interesting question why that in general computers are very good at stuff that people are terrible at. Like arithmetic. Yeah, and other kinds of things. And they're really bad at things that human children find easy. I think that there's a deep research topic there actually which I will say nothing more about today but in, in, in this case, what's happening is that the artificial intelligence is that do these visual recognition tasks try to categorize images as dog tails or cinnamon buns. They don't, they're purely statistical engines. What they don't do is recognize objects in the way that you would so when you view an image. What you're doing not the only thing is your, your seeing the image as something that's cast by objects in the scene, and you're, you find objects and then you interpret what you see in light of those objects and computers don't do this at all. They have edge recognition and other things but they don't construct an internal state where they think that there's an object that's creating a scene. And as a consequence they can be tricked in very comic ways and ways that people can't so here's just an example there's a whole literature on this called adversarial examples in computer vision. So, for a very large neural network can be trained on a corpus of animals, animal photos, and then you show it some new animal photos it hasn't seen before and you ask it to say what kind of animal is this this is important for a lot of us even at this institute right because we have camera trap data, and we'd like it to tell us what the animals are. So here's a panda, and this particular neural network is, you know, thanks it's like a coin flip that this is a panda, this is a good neural network it gives you a confidence that's a nice thing about it. And it turns out, this is true of a wide range of neural networks it's not special to this one though this is a really good one. So it's a good example, you can add random noise. There's a tiny amount of random noise to this image in this case it's point 007 random noise that you see in the middle if you feed only this noise to the neural network it'll tell you it's a nematode with confidence 8%. Right, so it doesn't know what it is it's nothing worm. It's just, you know, it's just guessing. If you mix them you get the image on the right, which to a person looks identical, because you can't tell them apart they're the same thing, but now the neural network is really sure this is a given. This is not a special example is that there's a citation at the bottom here there's a literature on this these so called adversarial images. This is very important for self driving vehicles. Yeah, because little bits of noise on a stop sign can cause them to run the stop sign. But news right. So why aren't you tricked in the computer tricked well I started a big problem in this is the robot is cause blind it's just doing edge detection and other tiny things it doesn't think about the cause of the scene. It just uses statistical structure. A memorable metaphor, hopefully a memorable metaphor the panda that would be a given is a memorable metaphor for this thing I call causal salad, which is a set of kind of informal heuristics that are actively taught, I think in the sciences to use statistical tools to to imitate causal inference tasks. And sometimes this can work so I want to start up by saying that it's not that it never works is just that it carries no guarantee and it can easily go wrong. So I'm going to run through a list here and then I'm going to show you examples in the context of the two moms and the pure bias example. Okay. So, here's here's causal salad in a, in a quick definition. We take as our ingredients some vague query. There's an interesting topic we'd like to understand it. And then we find some variables that are related to it. And or we measure them. We add all the variables to multiple regression and let it sort it out. Yeah. Let's go we've all done this. And we pretend there's no confounding. We pretend there's no, we don't talk about it right if there's confounding it must be something we've measured. And we pretend there's no measurement error, or if there is we say measurement error is not super important here. The pattern of missing data is of no concern. Right, we just do some arbitrary operation on the missing data we may not even describe it in the paper. And we pretend that a IC and p values will pick the scientifically correct model. We've all done this. None of this is logical for causal inference and the goal today is to show you why, but I'm going to do first to show you the consequences of doing these things and why they don't reliably recover causes. And that I couldn't resist having a slide on multiverse analysis because I guess that this is the thing that's becoming a thing. I don't like multiverse analysis, and maybe this is, this talk isn't about what Richard likes, there's lots of things I don't like. And so we don't have time for that today, but multiverse analysis can be very reasonable thing to do so it here's a definition I take from a paper from 2016 on the topic. And the process means performing all analyses across the whole set of alternatively processed data sets corresponding to a large set of reasonable scenarios. So the idea is you have a single data set. And there are lots of things you could do to code variables or drop cases or whatever. And people do all kinds of different stuff sometimes the same researcher in different papers will process the same data set in different ways, and this seems a bit illegitimate. And so as, as third parties, we want to understand how much we should trust those results we might do a multiverse analysis take all the data processing decisions that this group of researchers has done and try them all and see how it changes the results. Not a bad idea. There's one on the right hand side here I don't expect you to read it it's just a list of a bunch of different data processing decisions from a particular literature that's studied in the 2016 papers cited at the bottom of this slide. This isn't going to uncover causes either. It's it's sort of taking causal salad and stacking salad on salad right this is a big issue. It's no surprise that statistical results are sensitive to the model structure and how we process the data that shouldn't surprise anyone. What we need is some logical framework to make sense of that variation. Why the results change. And in particular here the word reasonable in the definition is doing a lot of work. What is reasonable mean. And, and as I hope to show you today there are ways to at least be transparent about what reasonable is and in particular the reasonable things are causal assumptions, and these are often unstated, and if we can state them, then we don't have to argue they're reasonable or not, because we can say that they're transparent and we can invite our colleagues to agree or disagree. Okay. So the takeaway message is going to be that if you don't put causes into your analysis you don't get causes out you have to make causal assumptions to make causal inferences. There's a long tradition and philosophy of science philosophy of statistics defending this view. In particular quote no causes in no causes outcomes from this book from Nancy Cartwright, natures capacities and their measurement. She mainly does physics examples because she was a physicist before she became a philosopher, but obviously this applies to biology and psychology and anthropology and everything else as well. The basic point being that statistical models are insufficient because they simply do not contain causal information they don't contain arrows about whether it's the wind that blows the trees or whether the trees blow the wind. And that information has to come from some other model that is distinct from the statistical model you use to process the data. So multiple aggression cannot distinguish between causes and confounds that that's something you have to do with information that you put into the design of the statistical model itself and I'm going to show you how to do that today. P values of course are just by their very definition not causal statements, they rely upon the whole framework being set up so that the difference that they're they're attached to can be interpreted causally. And AIC and related predictive criteria, they're just predictive. Again, they in and of themselves they're they don't carry any causal information. And AIC would happily choose Ptolemy's model of the solar system. Yeah, because it does a really good job of predicting things. Okay, let's come back to two moms and and I can substantiate some of my wild claims for you. So, just to remind you the situation is we have many pairs of moms and their daughters, and for each woman we know her completed family size, and we know her birth order whether she was the first born second born and so on. And we'd like to use these data to get some estimate of the effect of mom's family size on her daughters, whether there's some direct causal effect through imitation in particular. So, how would you approach this well you don't have to say just say that there is a literature studying questions like this so I know how people approach this. And if we have these variables family sizes, let's call them M for mom's family size and D for daughter's family size I'll use these variable labels throughout the rest of our workshop today, and birth orders be one and be to be one is mom's birth order and be two for daughters to make this easy to think about you can just think about be one and be to is being indicator variables for whether the woman is the first born in her family. Right so be one be one if it's a first born daughter, and it would be zero otherwise if you just want to. That'll be fine for the conceptual lessons we have today but of course in general it could be continuous and we could ask all kinds of subtle questions about birth order. So now the question is how would you construct a regression to estimate the influence of M on D to causal query. And we have data and we are all superpowered regression scientists here, we can do this, let's make some regressions. Yeah. I think I'm sort of betting that most of you know this till the notation for formulas for regression formulas. But in case some of you don't I'm going to quickly explain them just on this slide and then I won't again. I hope that's okay. So, we key question here is whether we include all the variables are only some, and of course how to include them as well. First regression equation we have here until the D this means we're going to model M as a function in. Sorry, I've got this reversed it's DM in the rest of the slides it's correct I hope daughters family size as a function of the mom's family size. And this is the kind of if you use our to do your statistics, which is not a bad choice. Then you'll enter this in any kind of formula syntax you have. And then you can start adding things to the formula and I think Roger had you do this last week. Right. Yeah. And then there's extra notation for random effects and so on you can do. So you could add the moms birth order you can add the daughters and so on and so forth. And as we move through these things. How do we decide which of these to do. And there are a bunch of different ways to go about doing this to know if we if we get it right we're going to need a simulation, because the horrible thing about reality is we never know what's true. Right. I'm sorry it's just, you know, you're all adults, you know this is how it is. So, we have a simulation and the code for this simulation is on the repository next to the slides for this we don't need to focus on the details of that we're going to unfold all the assumptions of the simulation as we move through the slides. What we're going to assume is that moms family size has no influence on the daughters. Assume first borns have higher fertility than later borns. It's probably because they inherit more stuff they're given preferential treatment. Or they receive the assistance of their of their siblings in raising their kids could be a bunch of reasons for this. And then I'm going to simulate 200 mother daughter pairs so we have a we have a bigger sample than almost any anthropologist ever gets. And so here in the so here I've got the formulas in the right direction thankfully. So we're predicting daughters, family size with moms. So just to make this simple as think of this as a linear model simple linear regression. It's, it's fine for the sake of the example, although obviously family size can be negative. Right. But that's not what's important about this and that isn't where this goes wrong. So the first one is I've taken three models. The first is one that ignores the birth orders. Remember I've assumed birth order matters birth order effects, women's birth order affects her fertility. So we've got a model that ignores it. And the model assumes there's no causal effect of the moms family size on her daughters. We run that model we get essentially the right answer. Yeah, in this. In this case it's a vague answer but it straddles zero it's essentially the right answer, you wouldn't make any strong conclusions from this. Yeah, and then you'd report, and you should report that. And one thing I will say about my literature is they love no results for things like this. Right. So there's no, we don't have a no result a bias against no results in this in this literature. And then we have a model where we add the moms birth order, and this, this has a dramatic effect on the estimate. Now there's a huge range of possible values that it covers and if you're tempted to say oh it's still intersect zero it's the same that's the wrong instinct, this is a completely different estimate than the first one. Right because it also intersects minus point three, which is a large effect on this scale. Yeah. So now it's it's this is consistent with the idea that, well, it's not a precise estimate but it could be a large negative effect it could be that moms, a family size actually reduces her daughter's family size, or it could be nearly nothing and it's just not a precise enough study to say but it's very different than this result. And then, here's the interesting thing imagine you added the daughter's birth order instead. So the only makes the estimate more precise and closer to the truth. So why does mom's birth order screw up the analysis, and the daughter's birth order helps it. Yeah. How is this why is going on here. Now this is just as an example and you can run this code it's in the script on the repository. This is a provocative example and I'm going to try to reveal to you why this happens as we move through it, but this is a kind of. This isn't a standard teaching example for me but this is the kind of thing that happens and even relatively simple regression problems is that the regression doesn't have the causal structure. And so what the causal structure does it's like, it's like Plato's cave as you know this metaphor right, the scientists are staring at the cave wall, and the lamps are the actual causal model and they're being cast against the walls. And so the causal model is reflected in these results but you can't recover the causal model from them. So that makes some sense. I will explain to you I hope today how this why this does what it does. Okay. Now let's repeat this may add a little bit of realism to the two moms example. Suppose there's an unobserved confound in this, a common cause of mothers and daughters fertility. Such as wealth or education. This is almost certainly true in all of these studies is there's some unmeasured common factors which influence both the mother and her daughter. Yeah, they may live near one another they have similar incomes, educational levels and so on. So any co variation between M&D is contaminated as it were by these unobserved common causes. Now let's repeat the analysis with this new simulation, which all I've done is add this, this common cause, but all the other causes are still there. Now, of course, the estimate of mom's influence on her daughter is contaminated by that common cause and it's strongly positive so notice the low value here is point four zeros, nowhere on this graph. And in fact that that's what confounds do they create this illusion that two things are causally related that's not the lesson here, but that's if we simulate a confound we get a confound. That's that's not a big deal. So you again though is this effect of B1 and B2 that's going on here is that still B1 hurts the analysis, and this is a general feature of variables that have the structure of B1 and I'll explain what that structure is when we get into the next today's slides is that they actually hurt things. The confound's bad enough, but then adding what seems like a perfectly innocent statistical control variable the mom's birth order makes the bias worse. Substantially worse if it effectively more than doubles how bad the estimate is from the truth remember the truth is zero. And B2 doesn't be to actually makes things more precise it's still wrong, because the confound is still working but it's B1 uniquely that does something awful. Okay, this is just an example to stick in your mind and I hope it bothers you and I will unbother you later on in some slides to come. Okay, AIC does not fix this problem AIC is just asking about a non interventional prediction. So remember my definition of causal inference. The first one is about what if I do something. Right like what if I, if I just moms family size, how much will daughters family size change that's a causal inference question. AIC doesn't address questions like that. AIC addresses questions like, if I do not intervene in the system and I collect more data from exactly the same process which model will predict it best. AIC loves confounded models because confounds contain real covariance structure among the in the data, but they mislead you about causes. And so what you get with an AIC analysis in a causal question is that the more stuff you add the more it likes it up to some point right because of course AIC is penalized for model complexity, but that's still a purely predictive issue. And so what we see here is that the models with bad estimates that contain B1, which always makes the estimate worse. AIC prefers those to the models without B1. Yeah. This is adult talk. So, I know some of you knew this but it's just nice to say it over and over again. Yeah, this is not against AIC if you have a purely descriptive goal, and you want a good model for prediction like you're trying to predict which Netflix film someone wants to watch next. And then AIC is a great metric to use. But for scientific questions, I'm not so happy about it, usually. So, yeah, so this is the explanation why a model can make good predictions without knowing the correct causal structure like Ptolemy's model of the solar system. Association is not causation. And of course, as I say here don't ask about p values please don't p values have the same problem p values aren't even designed to choose model structure. So at least AIC is designed to choose model structure p values aren't even designed to do that they're designed to control type on error rate. Okay. Summary of all this. And then I'll move on to the next example. So we assume we assume birth order influence family size that was in the simulations and again if you go look at the code you'll see that it does. Including mom's birth order hurts inference, whether there's a confound or not, including the daughter's birth order helps inference it creates more precise estimates. AIC doesn't fix this model. AIC loves the thing that makes the analysis worse. Yeah, which is be one again we only know it's worse because we know the degenerative model of the data, because we wrote it. And, and when you add M and D when they're confounded be one really exaggerates the confound. This is something I'll talk about in the next part of the workshop today is this is a special statistical phenomenon that exaggerates confounds. But I'll get we'll get to that now you don't have to understand that now. I just want to set the stage here. Now pure bias, completely. A simple example let's address it for the regression. Different things are going to arise here. To remind you we have a sample of applications. We know the applicant social category, which may be a target of discrimination. We know the fields they're applying to, or the context among multiple contexts they might be found in, and we know whether the application success succeeds or fails. So this could be graduate school applications grant application so on. And we want to know, we want to use statistical tools to figure out if they're discrimination there's a big literature on this. And just to repeat the structure here, the variables we're going to have in our example to make it simple. We have the applicants category which I'm going to call X. I'm very creative. And X is our is our potential cause that we're focusing on. And we have the field or department the individual applies to which I called he. And then the outcome why, whether it succeeds or fails. And then the question is how do you construct a regression to estimate the influence of X on why it's a really big literature on this and let me summarize it here and then I'll give you a whirlwind tour and I mean whirlwind it'll be like 20 seconds of this literature. So, there's a there's a litter statistical literature and this going back to the 1960s. And the key finding in this literature is that when you include a variable like field or department in the model, it radically changes the inference. Often reverses it sometimes flips the sign on it. But what there is no agreement in this literature is why. So, let me let me show you some cases. Here are three papers that show you how alive this statistical question is and how old it is so on the left we have a famous 1975 paper that was published in science magazine, which was not as as powerful at the time. But it was a big deal but it wasn't like it is now. This is a paper by three statisticians analyzing applications to Berkeley's graduate schools University California Berkeley's graduate schools I think the data come from 1973. And they walk you through the issue that if you the effect of adding department to the analysis radically changes the results. Right. We have a pair of papers that are published in proceedings National Academy of Sciences in 2015. The first one at top is the original paper and it used a data from the Netherlands from NWO in the Netherlands these are these are senior grant applications to look for gender discrimination in the award of scientific grants and they, they use a statistical procedure. We'll talk about this as we go and they find evidence of it and then below. We have someone saying actually if you add field that completely reverses the result, which is true. Now what I want to, where I want to enter this debate is to say that the data alone or insufficient to say what this change means. And this makes it, it this means we have to do a lot of extra thinking to decide whether this statistical estimate indicates discrimination or not. And I'm going to, I'm going to unfold this and this is a literature that I have worked in a bit to with with my colleague Cody Ross and in the department here, but when we work on it we think about policing instead which is structurally the same problems. I know this is going to be like a shock like record scratch freeze frame. How did you get here, but graduate school applications and bias policing are structurally similar statistical problems. So the issue is that an application is like an encounter with the police. And you may have heard that American police can be rough with people of certain cultural backgrounds. And there's a scholarly literature trying to figure out what has gone wrong here in this so that we can address interventions at the right scale. And this is a very hot literature right now, and the debates are deeply causal because it's the same thing that goes on in the applications literature. If you include a control variable like encounter with police. It really changes the estimate, but it's not clear what that means in terms of discrimination. Okay, I won't talk about policing again I'll just talk about applications but structurally keep in mind if you care more about policing than applications structurally statistically they're the same kind of problem. Okay, so pure bias back to examples. I'm going to simulate 500 applications. And again I'm simulating because it's the only way we can know the right answer is simulate 500 applications in two different subjects, just so it's easier to think about. Say it's like, there's two subjects in all of the university, the humanities and physics. So let's just say that that's those are the two subjects. And, you know, that would be hell, but that's all we've got. And these subjects vary by their average acceptance rate. Except a smaller, a much smaller proportion of its applicants than the other. And it's no mystery, for example, that the social sciences and the humanities, except a smaller proportion of their applicants and the sciences do. Yeah, so it, at least it's famous in North America that social psychology programs are accepting less than 10% of their applicants and physics programs will often accept half. So we start off already with a situation that's got interesting statistical structure. Now, let's assume that category X, which is whatever social status you're most interested in in this literature it's often gender or cultural background. Sometimes it's age. There's a large number of different things. Let's suppose it's not a target of discrimination the people that read these applications are not biased against people based upon that at all. And we can ensure that because we're simulating the data. Then we simulate outcomes why I've accepted rejected. Let's consider these two regressions just to highlight what goes on and I'm going to show you the reversal that happens in these papers. So we do a GLM this is a logistic regression. I imagine Roger did this with you last week, or if not you've done it before. And the important thing is the formula structure inside so we're, we're asking about the cause of X on why so we regress why on X, we get this coefficient here on X that is reliably negative. Yeah, which is consistent with the idea that there's just in individuals of status X are accepted less their applications are accepted less often than individuals who are not of status X. Clear. Yeah. Now we add the field or department to the, to the model and the result goes away and this is this reversal. So it even ends up with the weight of evidence on the positive side although you wouldn't, you know, say that this was actually favoritism, right. Although there are data sets, like UC Berkeley's where it ends up flipping actually reliably to the other side and it looks like the status is favored after you condition on department. This is the reversal and we're going to, I'm not going to explain this for you right now, but we're going to draw this out as a causal diagram in the next section. So you can understand what's happening. Let me just say that the intuition to focus on and I think this will click it for you is that the departments aren't equal one of them is harder to get into. And the status X is associated with department. And this is enough to create the statistical illusion of discrimination in this data set. Okay, and I say it's an illusion here because I programmed the simulation not to have discrimination this is not a claim about the world. To make it clear that it's not a claim about the world let's do the opposite now. Now let's do a simulation where there is discrimination. I'm going to show you you can get exactly the same reversal. So 500 applications to subjects subjects vary by average acceptance rate again, category X is now a target of discrimination in both subjects individuals of status X are less likely to be accepted. All other things equal. And again the code is in the is in the script file that for the materials for this week. Again we simulate outcomes why we're on the same two models. And we get the same result essentially, even though there's discrimination now. And I promise to you is by the end of today you will understand why. Yeah, this will not tell you what's going on in the real world but will convince you that you probably need experiments to answer this question and observational data are not going to work very well, although I'm not completely sure about that I've got some ideas. I've always got ideas. Okay. I think in the policing case for example you can tell from other data that there's discrimination, but it's hard for applications. Okay, I apologize I know I'm moving quickly through some conceptually deep stuff but we're going to revisit these examples over and over again. I've drawn out the narrative, right. We keep coming back to the same theme. So let me summarize the pure bias example before we move on. So we assume there's no discrimination against X. The model without subject E shows discrimination, at least a coefficient is consistent with discrimination, and then we add it we find no evidence of it. Assuming there is discrimination based on category X, with a different simulation, we get exactly the same pattern model without subject defines lower success for X model with he finds no evidence of lower success but there's discrimination and similar simulation and actually it's quite true if you look at the code I'll show you actually later in the slides you'll see it's quite strong in the real data of course we don't know the truth. And so if we're going to make sense of this empirical literature and decide whether these institutions are inherently discriminatory or not. We're going to have to go appeal to something beyond just the data and the statistical models to make sense of this. That's my point. Okay, who at that point I need to rest my voice. The rest of you probably need to rest from my voice. So, I suggest we take 10 minutes. We're going to return at 20 after two. During those 10 minutes you can do whatever you like. But of course I would appreciate it if you would think about these examples a bit. If you have an office mate, or an enemy, you could discuss this with them. And then I will drink some water and come back for the grand finale of this as we as we finish out. Okay. This won't be the last break. I promise. I'm not a horrible person. I'm merely terrible. Okay. All right, see you in 10 minutes. So, where have we been so far we are we have finished part one in part one my goal was to show you that something is rotten in Denmark, as they say that standard statistical tools are wonderful things for describing associations among variables but there's nothing inherent about them for discovering causes as a consequence in even relatively simple causal structures containing a handful of variables like the two moms case. The pure bias case cases I invented but which reflect serious research questions. The simple statistical models can behave in rather mysterious ways, given underlying causes and the goal is to understand how that have a real formal framework for taking causal assumptions and understanding how they induce behavior and statistical estimates, and that framework exists and it's been known for a long time, but it's not usually taught as part of a statistics course. I'm going to give you an introduction to it and there's only so much I can do in the time we still have together I think we've got an hour and a half left right, and there's going to be a break. So, so not actually a whole hour and a half. I'm a humanitarian right so we won't do a full hour and a half straight. But I think I can give you a meaningful introduction to it and what it looks like. I mean, when I teach my course in the winter, the schedule which has not been announced yet but it'll go out to the Institute, and to I div. I teach a lot more about this and it's a 10 week course. Anyway, the second part here I want to call causal design and the whole idea is we're going to do. It's a bit brutalist, which is why I have the brutalist architecture behind it. And by which I mean the causal models we're going to look at today are very bare bones, they're as simple and heuristic as possible to get across the basic principles of, of well causal architecture, if you will, turns out you can do a lot with concrete. Yeah, and I'm actually a fan of brutalist architecture I might be the only person, but I quite like it. There's some great East German buildings right. So, let me start with an example that I know is known to some of you certainly people in my department is this about the constant and I show it to give you a sense about the consequences of these things. There is this paper on cross cultural religious that cross cultural causes of cross cultural differences and religious traditions that came out in was it 2019 yeah it's at the top of this slide on complex societies proceed the moral proceed moralizing guys throughout world history and the details of this story aren't aren't what's relevant here really. The point is this, this was a big paper and it got a lot of attention, it was in one of those you know one of those magazines that has a high impact factor and and then it got retracted after publication, and the reason it got the solo papers get retracted all the time it's not a big deal it's not there's nothing shameful about having your paper retracted, we all make mistakes. And I think we have to normalize the idea that retractions okay. The reason it got retracted reflects the causal salad nature that is in no sense unique to this particular paper, but it exposes us to unnecessary risks. And I don't mean professionally because again I want to normalize retraction, it should be okay to get papers retracted, but risks of tricking ourselves those are the risks that I care about. And so, here's the retraction, the paper that that led to the retraction treatment of missing data determined conclusions regarding moralizing gods. I'm not going to go deep into the structure of this particular problem, but this is a problem that I detailed to some extent in my textbook. So I say more about it when I teach my class. I wanted to point out that the problem isn't, as I think unfortunately many people have taken is that there's a lot of missing data in this problem. So you what you're looking at is a scatterplot of the primary data used in the in the paper in the study. And the horizontal is year time and year where zero is the beginning of the common era, and the vertical axis is the logarithm of the population size and the points or societies at different historical periods. And what's being coded here is a particular aspect of religious traditions, moralizing gods present or absence. And the X's on this plot are cases where no code is available because there's no evidence about the status. So this is 60% of the primary outcome in the paper, which is a lot of missing data. The point I want to make is it isn't that there's a lot of missing data sometimes it's okay that there's a lot of missing data and you can just ignore that. In this particular case, that's not true and that's what led to the retraction, but it's not the amount of missing data that's the problem. It's what causes it. And missing data is part of how we do our data treatment and the appropriate way to figure out what to do with missing data is also a causal problem. And so this is by one slide on this, just to stimulate your mind about it. There's a bunch of ways that people process missing data and it's often just a matter of convenience and sometimes your software just does it blindly without you noticing our does this unfortunately if you use this built in regression functions and are it will vary it will completely silently drop all of your cases that have missing data. This is something you should be doing deliberately I think and when I teach my course I try to give you principles for how to make those decisions. There's a bunch of a bunch of methods here, each of these methods is appropriate in a different circumstance. But to determine whether it's appropriate or not you need something beyond the statistical model and something beyond the data, you need a theory about what caused the missing values to happen. Yeah. Okay. We'll return to an analogy of that near the very end. When you use that to motivate us to get into thinking about what would that look like how would you draw your assumptions, and then use them to design a statistical analysis and this is this procedure that I call causal design, which is related set of methods for constructing and then analyzing causal models and by analyze I mean logically deduce the implications of the model you've written down, and then those implications can be used to design statistical models. So what does causal inference require it requires some model outside statistical model, we could call this a causal model or a generative model. The details of how that model is built what sorts of, whether it's just a drawing on a piece of paper, which is fine sometimes, or it's a set of ordinary differential equations or it's a giant agent based simulation. That's not the important thing necessarily. There are differences but that's not the key issue. The issue is that you have to have it. You have to decide what the statistical procedure should look like. So step one here in my two step guide to doing causal inference, you make a causal model. And in step two, you analyze the model to design both data collection and statistical procedures. You can also test the model with implications of the model as well. Okay, so let's get into actually making models and we'll revisit our two moms and pure bias examples to do it. The fundamental component of a causal model is some function that is some operation that determines how some variables are influenced by others. So you can think about the influences being inputs into this thing we call a function, which is a little machine that processes inputs into outputs, and the output is the variable the measurement. So variables here mean measurements things that you can observe. And these, this is the fundamental unit mean these are the things you have to decide to construct a causal model, you think about the measurements. And then you, you, you link them up through these functions in the simplest case which is what we're going to work with today. These functions are represented by arrows in diagrams and this is why I wanted you to have some paper because we're going to draw a bunch of these, or rather I've drawn them on my slides and I'd like you, when you see one of these, if you would indulge me to draw it on your paper. So this is the, they can be to hunt thing, right, where it's copying something down helps you think through it, and I'm going to ask you to do some operations with these as we go and my experience trying to teach this is it works better if I get you drawing. Yeah, I had to learn this stuff from papers and and I had to copy all the diagrams out of the papers and manipulate them to try and understand this stuff so I, on the idea that my psychology is similar to yours because we're the same species. I'm going to suggest you do the same thing. Daniel skeptical, but so his skepticism is justified. Okay, so the diagram in the middle of this page is an example there are three variables creatively named XZ and Y. Americans call Z, if you haven't heard it before, and, and the arrows indicate directions of causal influence so in this case, the variable Z influences both X and Y, X and Y do not influence one another. I'm going to talk about the consequences of a diagram like this in the slides to follow. So using just arrows and, and variable names you can build up quite big causal diagrams like this we're not going to do really big ones today but if you look at epidemiology journals you'll sometimes see causal diagrams like this that cover whole pages. Yeah, and you think about a complicated disease like HIV. There's a bunch of risk factors and they're trying to model it in its natural social ecology and so you get really big causal diagrams. So bad news and then the good news will come later the bad news is there's really no method for making causal models other than science. Subscript to that is there's really no method to science other than honest anarchy. This is my opinion, and this is a particular philosophy of science maybe, but that's not bad anarchy is not bad. We're transparent about our propositions and we debate them, hopefully with with dampened authority, and we try to arrive at common understandings. So there's no algorithm that's going to make the model for you you need background knowledge to do that you have to put causes in make assumptions, but the assumptions have implications and we can test some of those implications and that's where the data comes in later, but the first step is building the model. Okay. These simple letter and arrow causal models are usually called directed acyclic graphs or DAGs. And we just call them DAGs, and DAGs do not make any claim about how a variable influences another one. All of their, the only thing they say is that particular variables influence other particular variables, they're so called non parametric causal models. And so when you analyze them you're analyzing them in general for any functional interaction where arrows into a particular variable. And I'll draw this out in a bit but this is a bit strange to get used to one of the consequences of this is that some of the decisions that are incredibly important in writing regression models like interactions. DAGs don't show interactions because an interaction is a particular kind of function. So DAGs is saying let's postpone the decision about exactly how these variables interact to produce an outcome. And let's just say that they do. What can we decide based upon only that information and that's what we're going to do today. Obviously, how they combine matters for lots of scientific questions but DAGs don't show interactions. Okay. Does that make sense. It's a decision you have to make later. Let's go over and visit this point so it's clear. Okay, examples. World's simplest example. It rains, the ground gets wet. You know which way the arrow goes right doesn't go the other way. Puddles don't make it rain. Rain makes puddles. Exactly. Yeah, and the data can't decide which way the arrow goes. We get this. Okay, this is a dag. Let's analyze this. Right. I sort of a joke but imagine an intervention, right, you can intervene to make it rain it would make the ground wet. But if you throw a bucket of water on the ground it doesn't make it rain. Right, so that's the causal structure. So keep adding variables to these things. If you want to measure more stuff right and there can be branches so multiple arrows can come off a single variable and go to different other variables so when it rains. It also turns out that you see people carrying umbrellas. And there's a causality to this right in a certain direction. Also the price of umbrellas changes, right suddenly very sharply when it rains. And then the ground being wet has consequences like wet shoes. Yeah, and so on. Does make some sense. Yeah. So I say, I think those of you who've left your cameras on and are nodding because it makes me feel less insane. Yeah. I like a lot lecturing to big audiences actually because you get all that feedback and so. Slightly more complicated example that that foreshadows some of the lessons to come. We think about a case where there are multiple causes of a single outcome of interest so let's think about something like a lamp. I have a lamp on my desk that I used to read and there are a bunch of things that have to be true for it to work and they need to be true simultaneously the power has to be turned on and it needs a working bolt, for example. And both of these things combine interact to make the lamp turn on. Right, but this dad doesn't show exactly what that interaction is like. Here it's a very powerful interaction both have to be satisfied in order for it to turn on the dad doesn't show that structure. So go beyond the dag and and specify the functions and analyze that as well. And a bit later I'm going to show you an example of that for the two moms case I'm going to show you what we could do with the two moms case when we actually assume particular functions but let's postpone that for now and just think about conceptual causes and directions. Okay. An example with the same structure that's more timely. Someone's infected with some disease I don't know what and they need to be exposed to it and their vaccination status matters. Yeah. Good. So, for your own research areas I'm not going to ask you to do this now but for your own research areas I bet you could sketch some decks. So my students do this when they when they start their PhD projects, and they fill pages very fast because they know a lot about the domains they work in, and then I have to say well hang on, like which part of this do we care about. And that's part of what we'll do today is say that, obviously the whole world is one causal system, but you don't have to analyze it at once. Yeah. Okay, let's return to the our two examples again, and do this exercise let's build some dags for the two moms case in the pure bias case. Now I want to start out by saying multiple dags are possible for both of these scenarios. So I'm not going to make any claim that I've absolutely got the dag right, but I'm going to, I'm going to construct something that I think is reasonable, and I'm going to use it to motivate an analysis just to show you how you go from a to the consequences. Yeah. And the simulations that I've run reflect the dag that is the dag, the simulation obeys the dag. When you have a generative model like a dag you can produce data from it, and then you can run it in reverse so to speak and analyze the data to, and we're going to do the forward version right now. So, just to remind you the structure of these problems in the two moms case, the variables that we're going to have measured are the family sizes of the mother and daughter called M&D. And their birth orders be one and be to which for cognitive simplicity you can think of as first born status or not. And the question is how would you connect these variables. And for the pure bias case, we have three variables, the field, the social category of the individual, and the outcome why, how would you connect these variables. So let's let's take these one at a time and there is an algorithm for doing this if you know the measurements that are important to the system, even if you haven't measured all of them. Right, you might know that there's a variable that's important to a system, but even if you can get data from it, you want to put it in the dag. So you nominate those variables and then you can go through them one at a time and say, which arrows entered this variable. And then you connect all those sub graphs, and you have a glorious day. Yeah, that your your peer reviewers can trash. No, that you can have transparent and honest scientific debates about. That's the whole goal here. And there are literatures like epidemiology where this is completely normative normative and dag are incredibly common. So for the two moms case, I'm going to put the four variables at the top and we're going to take them one at a time. Let's start with mom. What influences mom's family size. Well, I asserted as a setup that her birth order does. So we have an arrow from mom's birth order to her family size. Now we take the daughter. The daughter's family size is influenced possibly by her mom so we draw an arrow from mom, mom's family size to daughter's family size we want to measure the strength of that arrow. That's the research question. And then also since it's symmetric, right so every mom was once a daughter. Is that true. Yeah, it's got to be true. You can tell my biologist, and then there's an arrow from B2 to T. Now, with B2. There's this question. Wait, hang on, where's my pointer go yeah so be one. Does anything influence be one. Well nothing we have measured here does. Probably there are things that influence bombs birth order, but we have we don't have any of them measured to give you a hint about what I mean let's think about B2. Is there anything in this graph which might influence B2. I think the answer is yes. It's not important to the structure of our problem but just, you know, as help you as think about how you work through these things. The daughter's birth order has to be influenced by the mom's family size. If mom only has one child, then we know with certainty the dust was birth order. Yeah, for very big families the range is different and so there is a causal influence of mom's family size on the daughter's birth order. Yeah, this is not going to be the key structural feature of this problem and to think it through but it's important to think about that right is that you know things about these variables, other than just their values. You know, because you can measure them and you understand by human biology, you can say things about the data that are not in the data themselves and that's how you build causal models it's your scientific knowledge. So one of the things I love about this approach and putting the causal inference before statistics is now that we're not going to do statistics oh yes we will. We do this before ahead of time and it puts the scientist in the driver's seat. Right, so that you're not some victim of some fickle statistician, telling you to do some will coxswain test. Yeah, you've got you would gain control. Okay, so now very healthy thing to do is once you've drawn all these lines with the variables you can think of you should think about unobserved variables. You can possibly create confounds between pairs of variables in your diagram. So in this particular case there's almost certainly a confound between the mom's family size and the daughter's family size. We don't know how big, but and we don't know what it is it could be educational background cultural background. Common exposures that affect their health, any number of environmental variables right which are something other than the mom's direct influence on her dog. But the other pairs of variables could also have confounds as well and if you might have ideas and you want to think this through this particular confound is something I want to focus on today, because we can we can solve this confound problem with this causal diagram. We don't need to measure the confound which I've called you. Let's talk about you usually when there's a you a letter you in in these diagrams it means an unobserved variable you for on and often you'll draw a dash circle around it which indicates that it's unobserved that we haven't seen we don't have it measured but we believe it influences the other variables in the diagram. This is like the psychologist working at like a latent variable right. This is very common in structural equation modeling in psychology. And reality has these things. This is, you know. Okay, you'll often see this perfectly equivalent notation is just do one of these dash double headed arrows this is perfectly equivalent it means there's some variable you that we haven't measured that's a common cause of these two. And it's just this this is equivalent notation. Yeah. Okay, now the pure bias we're going to set that dag aside for the moment I hope you've got it on a piece of paper you love it. Now we're going to draw one for pure bias and then we're going to come back and we're going to do some work with these. So, only three variables now same routine, let's think about x. Is there anything that influences x will not in the usual way we think about social categories the idea is that these are not things we could intervene on they don't. Obviously it has some cause but it's not some none of the variables here are causing your gender or your cultural identity. Right. Instead this is a cause of the other things. So, he, for example, is influenced plausibly by x, that is your cultural background or your gender influences your choice of subject. Or if it's not volitional there are structural social forces, which guide people of certain social categories into these choices. Yeah, there's a range of sociological hypotheses that are compatible with this arrow. And then why is influenced both by hypothetically by x. This would be consistent with the discrimination arrow. Right, is that some social status is x have better or worse chances of being accepted because of that identity. And then either subject influences success rates because there are more open slots in some subjects than others. Now again my North American example. Sorry, I was at the University of California for a while. And so this is what my university experience goes back to their social psychology program and communications programs were super popular. They couldn't even accept 10% of the applications they got they just, and they accepted a lot of students but they just got thousands of applications every year. Whereas physics was accepting half and they wanted more applications. Yeah, but no one was applying. Yeah, because physics is miserable. No, sorry, there's no physicists here to get angry at me I hope. Oh I'm recording this. Oh my God, what have I done anyway. Okay, so now we think about this diagram again. So let's think about potential unobserved variables on this graph and, and again the convenient way to do this is just go through the pairs of variables and think about whether there might be some confound on that edge. And in this case there's a very plausible one I think and this is a big deal in the literature. And, and that is this edge here. Can you still see my cursor. Oh, good. This will be very useful now. So, this edge that connects department or subject to the outcome why almost certainly has an unobserved confound which is sort of the quality of the applicant. Yeah, their capability and their suitability for the program. I've correctly observed this thing this is the thing that, you know, this is the holy grail of psychometric testing is to figure out some way to estimate q. Right and and no one is sure how to do it. But everybody thinks they know it when they see it. And, but we usually don't have it in these data sets. We don't have randomized data sets we don't have deep records from the person's birth about them when they apply to UC Berkeley, or the grant application or whatnot. But we know it influences the chance that the applications accepted, and it may very well also influence the subject they choose as well. I don't know how to do this as we move through the idea but one idea which which I get from an economist who studies gender effects discrimination and in labor markets, Aaron Hengel, who's an economist in Liverpool, who studies this problem. She argues that this is a big effect in economics, she has a great paper called publishing while female in economics, which is about gender discrimination in the field of economics, and she argues that. It's just so discriminatory, at least in in Anglophone economics that the women are on average better than the men, she can say this. And those are the ones that can stick it out. And so you get this quality variation difference was influenced as the field. And so this has been found in in other discriminatory environments has been argued for chess. There's a female chess players are far above average relative to a random sample. And it's because if it's a discriminatory environment, you, you know, and you're not excellent then it's just not worth the bother is the argument so again I learned this, I get this from handles papers I think it's a really interesting idea. It's difficult to know if it's true, but there's lots of evidence consistent with it. So believe that argument to entertain my example just suggest that there are going to be features of the applicant which simultaneously influence which field they choose, and the probability that they're accepted. Okay, analyzing DAGs. Now we've got some DAGs, and we want to derive implications from them and those implications can be used for a number of different things. Testing the assumptions of the DAG to driving statistical procedures that will credibly give us estimates of the thing we want. I'm going to focus on the latter here that is deriving statistical procedures and not testing assumptions but again in my course I say more about this and I'll have some book recommendations at the end as well. No matter how complicated that gets the rules for analyzing it are the same, and the rules are very simple. And now I'm going to teach them to you. So, the lucky thing is no matter how complicated a DAG gets no matter how many variables it has you can decompose it into triadic relationships among variables. And there are only three possible triadic relationships among variables. See the universe is not always hostile. Sometimes complicated things have a really satisfying simple internal structure. I think this is a wonderful fact. So I call these the elemental paths. And I'm going to teach you each of them and its properties. And then when you understand each of them and its properties you can analyze big graphs because you can break it apart into these triadic relationships and do cool stuff with it and that's what I'm going to show you today. So, the first of these is called a fork. It's called other stuff, but I'm going to call it a fork and I'm going to stick with that so it's called a fork because you've got three variables, and the middle one is a cause of the other two. And so the branches out like a fork, if you want to make it you know look more like a fork you could draw the x and the y up a bit and then it'll look more like a fork. Yeah. So I'm going to give you the metaphor, the word is supposed to help you remember it. Z causes x z influences x the influences why x and why don't don't influence one another directly. But as you can probably guess what's coming they're going to share information, because they share a common cause. The pipe. In the pipe causes flow in one direction, among the variables. So, so we have some variable x, because these names are meaningful they're arbitrary but some variable x which influences an intermediate variable Z, which in turn influences why. So it's a pipe because causes flow. Now you can call this a chain sometimes this is called a chain in the literature as well. And then finally the collider in the collider it's like the opposite of the fork and the collider Z is influenced by both of the other two variables, which neither of which directly influences one another. And the collider so the fork in the pipe actually behave quite similarly, although they're causally quite distinct they behave quite similarly in data in aggregate data, and our job is going to be to distinguish them in the science. And that's going to be a focus of what we talked about. The collider behaves quite differently. And the problem is if we only have X, Y, and Z, it's not clear which situation we're in. And that's what analyzing these things as for us to try to know what statistical operations are required to figure out which world we're in. The best thing of course would be, if you have scientific information to tell you how to draw these arrows. That's the best idea. Okay, let's take each of these in turn and do some work. All right, let's start with the pipe. And remember the pipe, it's X to Z to why X influences easy influences why the properties of this to learn and I'm going to put these up in words now I'm going to show you a picture on the next slide, is that when we ignore Z. X and Y are associated one another. So in a linear system that would mean they're correlated in a nonlinear system, they have mutual information. Which means if you learn one you learn something about the other. Yeah, they're statistically associated. But when you stratify the sample by Z. So you take each value of Z. And then you look at all the X and Y values associated with that particular value of Z, X and Y are no longer associated within for each value of Z. I'm going to show you this in picture form in a moment, and I'm going to say why then I'm going to show you the picture and I'm going to say why again. Okay, so the, the reason why is because all of the information. Y has about X is transmitted from Z. So after you've learned Z, there's nothing else about X which helps you know why. That means that once you learn Z X, learning X will not teach you anything extra about why that means they're not associated. They don't have any mutual information after you've accounted for Z. That's what stratifying by Z does in a statistical model. It's like, I've become aware of Z now is there anything extra about why I learned by getting X to and the answer is no, in a pipe. Good. Exciting. Yeah, I see you're trembling with excitement out there in the audience. Okay, let me show you the picture. So same thing I've simulated the code to do the simulation is in the scripts at the bottom of the script called the D separation examples. And so we simulate the pipe and I've made Z binary so it's easy to see what's going on. So what this means is larger X values are associated with Z being one and smaller X values are associated with Z being zero, which I've colored with different colors to show you all of the, the red colors here are Z equals one and all the black points are Z equals zero. And you'll notice that the red is associated with larger X values. That's the influence of X on Z. And then Z influences why and larger Z values that is equals one creates larger Y values. So now the dashed trend here is a linear regression on all of the data. And you'll see that there's a positive relationship between X and Y X and Y are associated ignoring Z. But when we stratify by Z, and that's what these other two lines are these are linear regressions through only the X and Y values were Z equals one, there's no association. And when we do a linear regression through the, the X and Y values was equals zero there's no association. And this happens again because all the information why has about X is transmitted by Z. So once you've learned Z there's nothing, there's no additional value tax, and that's what this lack of association is after you stratify by Z. Good, so burning sensation in your brain. Yeah, that means the medicine is working. Yeah, okay. So, now the fork fork is similar to the pipe in the data but causally it's really distinct, and often your science will tell you that you need a fork instead of a pipe. So, so the information on this slide is going to look the same. Because in data these two things are indistinguishable ignoring Z X and Y are associated why now it's a different reason they're associated. It's because they have a common cause which is Z. Yeah, so they both carry X and Y both carry information about Z. They may carry different information about C. Right, but they're going to overlap and some of the information they have about Z and that will make them associated. When we stratify by Z meaning we learn Z X and Y are not associated. So for each particular Z value now, we look at the XY pairs, there will be no correlation in a linear system or mutual information in a nonlinear system between them. And that's because after you learn Z. Learning X doesn't tell you anything extra about why. So let's look at the picture. I guarantee you I promise you this is a unique simulation and the code is in the script and you can prove this to yourself. This is actually simulated from a fork. So, again, X values are associated with larger Z values, I've made Z binary just for the sake of making it easy to see Z could be continuous and this is all still true. So if we do a linear regression on all the data, there's a positive relationship X and Y are associated ignoring Z. Once we learn Z we look within each Z value, there's no association between X and Y, or at least in a reason in a large enough sample there's no reliable statistical association right now finite sample you can get sampling variation. Yeah, let me re-emphasize this point they already said the fork in the pipe look the same in the data alone you need something else to tell these apart. Some other sets of measurements or some other kind of scientific knowledge that helps you tell which is which. The collider is quite different. The collider is remembered the inversion of the fork. Now, if we ignore Z, X and Y are not associated because X and Y are independent causes of Z. They're independent of one another in the world. Z depends upon both of them. Z carries information about both X and Y. Yeah, but X doesn't carry any information about why why doesn't carry any information about C. Now when we stratify by Z, we learn what Z is. Now X and Y are associated. But it's not because they're causally linked. And this, yes, my friends, is the deep dark truth about causal inference is that they're very powerful mechanisms in even triads of variables for making strong associations not be causal. You're used to the idea of a fork is doing this and you can get tricked, but this is a much trickier one. I'm going to spend a few slides trying to explain it to you. Some of my colleagues have expressed a deep sense of betrayal and I teach them about colliders because they're like why wasn't I taught this 20 years ago. And I don't have an answer for that. But you were not betrayed. Your instructor was told to teach you something else. So this is what the collider looks like in simulation and the code to do the simulation is at the bottom of the script on the repository. So, again, when you want to think about here is that X is influencing Z, whether it's zero or one again I made Z binary so we can split the data and see what's going on, and why is also influencing it so larger values of X and X are more likely to produce red dots smaller values of both are more likely to produce black dots, but there's a compensatory effect here to where if you've got a smaller why a bigger X can compensate you can still get a red dot. And vice versa, if you have a small X but a sufficiently large why you can compensate and get a red dot. And the X and the Y have a compensatory causal relationship in this model in producing Z. And this means the world in these causal system because it creates all kinds of confusing behavior and we'll talk about some of it. It'll show up in some of our examples. So, let's let's take a moment just two more slides to understand this before we get back into the into the examples the two moms and the pure bias. So, why does learning or conditioning on or stratifying by the outcome Z induce an association between the causes X and Y that's the puzzle. This is what you want to understand and that's just the relationship that I show on the previous slide right this this this relationship here. The first value of Z you've learned that Z equals one and now you look at X and Y and they're negatively related, and strongly so, even though they're not causally related to one another. Why does this happen. There's a very good reason for it that you can develop an intuition for. So let me describe it statistically and then on the next slide I've got an example, one that I actually think is true in a cartoonish way. So an association between variables like X and Y indicates mutual information it doesn't indicate a cause it means they share information and they can get that shared information through different routes. So, what mutual information means is if I learn a variable X for example, then I also learned something about why, even though I haven't observed why if X and Y have mutual share mutual information, then learning one tells us something about the other. And what I'm saying is in a collider for any given value of Z, X and Y have mutual information and it's because they're compensatory in causing Z. This is the so called finding out effect. And this is going to click for you hopefully on the next slide just bear with me a second I know this is deeply weird sort of thing but your life is full of colliders. I promise you it is as soon as you understand this you're going to see it everywhere. The natural world is full of these collider examples. Okay. So, here's the summary for any given values the learning X tells us what why might have been right because if it constrains the range of why that could possibly produce that Z. Here's an example. Let's think about restaurants. Yeah. So something we all have some experience with well maybe not recently. I don't know people going to restaurants again. So the dollar sign here excuse me on California represents profit, right, making money. And I suit that they're certain that there are many things that influence profit but two really important ones are the location of a restaurant. Is it at the city center where many tourists see it. Is it out in some neighborhood that no one wants to go to. Yeah. The food is the food any good. Yeah. And both of these things jointly influence how much money a restaurant makes and whether it can stay in business. If if a restaurant doesn't have a make enough money of course it goes out of business and it no longer exists in your city. This happens constantly in Leipzig right restaurants pop up like mushrooms and then go out of business again. There's a there's a storefront on my street in my neighborhood that has a different restaurant every six months. This is very ordinary in restaurant business I'm told. So here's the explanation how what happens because this is a collider is it structures your experience with restaurants in powerful ways. Because the world is populated with the restaurants that can stay in business. So a restaurant in a good location can make money, even if it has bad food. For example, Vapeano. Yeah, sorry, it's those of you who like it but I don't. It's in a very good location. Prime real estate. The food is an offense to the country of Italy. Sorry, it's perfectly fine food but it's not good food. Yeah. The train station is amazing and so they can make a lot of money. Yeah, and there are lots of restaurants like this. Yeah, where it's just location is everything. Are you at the train station. That's a great way to make money as a restaurant be at the train station, but you don't need to have good food. The opposite version of this is that if you have really great food you can survive, even if you're at a terrible location, long out of the way far from a tram stop. You're going to be willing to walk it or bike it to get there and get that really good food. Yeah, and you probably thinking of a restaurant right now that's like this. Now of course there are restaurants that have both. That's absolutely true, but the ones that had neither are dead. They're gone. And the only ones that survive have at least one of these things to a sufficient extent that they can stay in business. And so in the survive population of surviving restaurants there ends up being a negative correlation between how good the location is and how good the food is. But it's not because the location causes bad food, and it's not because having a good chef causes a bad location. That is not what's going on. It's just a, it's a consequence of the collider. And this is called selection bias. So in only those restaurants that are sufficiently profitable can survive they're selected into the population. This will induce a statistical association among the common causes that influence selection. It's almost any selected population. This goes for sampling data as well. You can get spurious associations non causal associations among the variables through this mechanism. Good. I see concentrating faces. This is good. Okay, I love colliders. I'm going to go to the apartment. No, I'm showing colliders all the time. They're sick of it. So now you've got the three elemental triadic relationships in mind and, and in a sense you know from the things I just taught you everything you need to know but I'm going to draw it out for you in a use context and we're going to see how to use that information to do you perform something or something called the back door criterion. And this is a way of drawing out the implications of the DAG for deciding which control variables we need to add to a regression model. So here's the flow chart version of it in two steps first you design. And what the implications are for you can get statistical procedures that assuming the DAG is true will give you the desired estimate the causal estimate you want. And then the second thing that we're not going to do today is you can use the same criteria learned from these triadic relationships to design tests of the causal models. And this is, for example, if you believe some variables are related as a collider, you can test for what happens in the data, you can see that they become associated within each level of the of the collider variable itself in the middle, right you can test for that in the data, you can see what becomes associated or not. Okay. So let's start with one here and work with the backdoor criterion the backdoor criterion is a is a strange term. So basically for saying, what variables do we need to stratify by, or condition on to block non causal pass in the causal network. So we're going to, we want to design a statistical procedure now which takes into account the causal relationships and our causal model, and the backdoor criterion is a way to do that. So you think about the backdoor criterion in my vague summary here if you look this up you Google it you're going to get a mathematical definition which is confusing this is my attempt to do better, although this is vaguer than the mathematical definition. The backdoor criterion tells you that a valid causal estimate is available for you to derive statistically, if it's possible to condition, which means to stratify by variables such that all backdoor paths are closed. What's a backdoor path a backdoor path is a non causal path. We'll have examples of this in a second so hang on a non causal path that enters the cause rather than exits it so this diagram at the bottom is where we're going to start and then I'm going to have a bunch of examples so x is the cause of interest. The backdoor is an arrow entering right that's like it's backdoor, and then it's front door is an arrow going out causes are transmitted out the front door. Yeah. And the way but you can think about this is if you intervene on x experimentally, it destroys all the arrows entering it. And so backdoor pass are the things that experiments automatically block. Again, backdoor pass are the past the non causal pass in a natural system that an experiment automatically blocks because it removes them. Right, so you're all the things that would influence your treatment assignment, you're stopping them through randomization. I'm going to have a diagram that shows this coming up but I think this is a powerful thing to understand about why experiments work right and why sometimes they don't work because sometimes your treatment assignment doesn't work. And then you haven't blocked all the backdoor pass. Let me show you some examples. Very simple example here. There are two paths, and now I'm going to try to draw. Oh yeah. This thing. So, there's this path. Can you see my blue. There's this path, which is a causal path it's a front door path from X to Y we're interested in the influence of X on why this is a front door path the arrow exits X and goes into why this path is a backdoor path, because this arrow right here enters X. This is also a non causal path you can see, because if we changed X. The change would only propagate through the front door. It doesn't because causes do not flow against arrows. Causes flow with arrows, but mutual information can flow against arrows. Now so you can get this is what this is a confound Z is known as a common confound it's a fork. Right. Z is a fork with X and Y. It contaminates the association between X and Y. It's a non causal backdoor path. In experiments, it would obliterate this arrow, because we would be assigning the X values, Z would have no influence on X. And then we could get a causal estimate of X on why. In a non experimental system you can also get a causal estimate, but you need to be able to condition on Z conditioning on Z will block this path. We're going to have many examples of this and I'm going to walk through this slowly you're going to get bored of the examples, but the rest of our time together today will be examples of this. Because if you have. There's too much junk on my screen. Okay, so if you have a fork, like this path across the top, you know you can block the fork, you can, you can make X and Y independent by conditioning on the variable in the middle. Remember that's how forks work. X and Y are associated in the fork, until you stratify by Z. When you stratify the middle variable in a fork, you block any information from propagating along it statistically. And so, in this particular example if this is the causal system, you want the model. Y, X plus Z. Z is here just to block this contaminating effect. This makes sense. I'm going to show you examples where you don't want to add variables to that's coming so don't get excited, but I thought I'd start with something familiar. I hope this is familiar. Yeah. Okay, next example, wait, I got to close this annotation window clear all close. Yeah. So, here's how the backdoor criterion works and I just sort of did it for you in a cartoonish way on that previous slide. You're going to identify all the past connecting X and Y in a simple dag like the one we just did there were only two but I'm going to give you some examples in a moment where there's more than two but that's okay. You can find them. You will train your eyes, you will analyze these things with your eyeballs. Now, once you've listed all the past that connects to X and Y. The ones that have arrows entering X are backdoor pass, and they couldn't contaminate causal inference. They don't necessarily so you're going to have to hang on for a second and we'll have some examples, but those are backdoor pass. So the important thing, as I said before causes do not flow against arrows but association does so we need to find for all of the backdoor pass that are open and they're transmitting false signal of causation. You need to close them, and you know how to do this because you understand the three elemental triads, and you know that you can close a pipe by conditioning on the middle variable Z and you know you can close a fork by conditioning on the middle variable Z. What about colliders. Well, they're closed by default and what you must not do with a collider is open it. So colliders create confounds when you condition on the middle variable, the collider variable. I'll show you what how this works. So this, this is an important point because it's not harmless to add control variables there are bad controls in the world. Haunting your data set, sometimes adding things contaminates inference. And this is if you remember nothing else today remember that. Okay, let's have some examples of what this is what I just said I showed you this so you know how to close these things. So for the fork. If you condition on Z you close this path so this is a nasty backdoor path into X into our cause X you condition on Z gold. Same for the pipe. If this was like a backdoor path into why you condition on Z, and it would be closed. So the collider, if you have a backdoor path, and you and you will see how colliders can be relevant to non causal pass in examples to come. You don't want to condition on Z because that would open the path the path is naturally closed, because you can think like these two arrows hit one another in the middle and that stops information from flowing in either direction, but as soon as you see remember that creates this finding out phenomenon like the restaurants, you said oh the restaurant is this still in business, it's got good food. Well it's probably in a bad location. Yeah, and this happens in all sorts of data sets. Okay, examples, I'm going to do two examples for you. I'm going to draw my screen to do it for you and then I'm going to have one for you to do and we're going to take a break. And you're going to do it. Okay, so break is coming. It's probably somehow on time to be the last time in my life. Okay, so our goal here is to list all the past connecting X and Y. And then which ones need to be closed and how can we do that. So again there's two past there's this one simple enough that's a front door path. This is a in this example this is the only path we're interested in. And now we've got this fork here. This is the path on top. This is a backdoor path because the arrow intersects. Yeah, so we need to close it by conditioning on Z. Yeah, this is exactly the same example as before. Good. Next example gets more fun now. I'm going to layer stuff on for you. All the past connecting X and Y. There is one to my I cannot write with a mouse. I need a tablet. Two, three, how you like them letters. Those are good. Yeah. This is a front door path. So this is no problem. What about the others. There are no backdoor pass here. So, I know you saw that I told you I do this for you, because there are no arrows entering X. Right. So, path two is a front door path and it's part of the causal effect of X. We don't want to condition on Z. Because part of what X does to Y is through Z and if we condition on Z we do close this path but then we're not getting the causal effect of X, we're just getting the partial causal effect of X, holding Z constant. Right, which is not what the natural system is like if you manipulated X in the natural system this represents Z would change. And then that would induce some change in Y and the total change in Y comes through both paths. I know psychologists are familiar with this what do you call this mediation moderation mediation sorry you've got all these m words and none of them make sense to me. Throw it in process, right. I can make SPSS jokes. So, and then the path on the bottom is also not a backdoor path. This is a collider on C. And in this case as well, you must not condition on C. Because if you do, you will create a bias. This doesn't cause create any association between X and Y until you stratify by C. Once you started by by C it's like selecting on the population of successful restaurants. Yeah, and then X and Y have an additional co variation which is due to the structure of C. So this is a collider and conditioning on colliders opens past rather than shuts them. Now that you'll see that the thing about colliders that's pernicious is that often that your sample like in the restaurants example has already been conditioned on a collider. You only have it for certain values of C or Z in that example. And therefore, well, you have selection bias, right, and this is one of the mechanisms by which selection bias arises and samples. Okay, here's your example. And I want us to take a 10 minute break. Obviously, I can't force you to do it. But for some portion of this 10 minutes. I think it would be good for you to sketch the dag on this on this screen onto your own paper, write out every path connecting X and Y. There are a few. And then answer the same question, which ones are backdoor pass, and which one should be closed, and which variable should I stratify by to close them. Let's take 10 minutes, which means I'll come back to you at 27 after three. And then I will work it through for you. But I really want you to try this on your own so that you get a sense of the burn, and you can practice it. If you don't, if you feel a little frustrated with it, you haven't gotten it yet. That's fine. It takes some practice. This is like anything. It's like learning to play accordion or something like that. You need to practice a little bit. But I think you should practice. So please let's do this 10 minutes. I'm back at 327. This does a some practice. I remember learning this and, and but you get good at tracing them with your eyes and seeing the structures and I chose this particular one because it highlights the fact that different paths can intersect and this can create interesting patterns. So let's start by just listing all the paths. And I'll use my cursor to highlight them. And then I think you'll see the issue, and I'm sure a lot of you found it, but if you didn't, that's why the exercise is here. Yeah, that's just how it goes. So first of course, there's this path from x to y, we call that path one. And this is a front door path. I don't really say much more about it. There's a path along the bottom down here is called that path to write this is a backdoor path. See is kind of a classic confound it's a fork. See is a common cause of x and y contaminates the association of x and y. And so even if there weren't a direct causal error from x to y would make it, it would make x and y associated. So we're going to need to condition on C. So I can start listing the things we need to stratify by over here on the right. So C is one of them. We've got a path right here. Yeah, this is also a path like C, but it has the in the middle it's a common compound. It's a backdoor path. Z is a confound this path is a fork, we need to condition on Z. Because this is a non causal path that creates an association between X and Y. And there's another path. This path here. Right. There's also subpass like this, and like this. But we can just focus on this very top one because the subpass are subsumed into the conditioning that we're going to end up doing here. This path is a backdoor path. You see it's a backdoor path because the arrow enters X, right it's a backdoor path, but it's not open by default it doesn't do anything bad to us, because there's a collider in the middle, and colliders close paths, until we condition on them. Right, but we're conditioning on Z, which is the collider. We have to condition on Z. Otherwise we can't close this other path. Having fun. Yeah, so this is why I chose this example. So we're going to condition on this, this bad boy right here, and that's going to close this path, but it's going to open this one. And so now we have to condition on either a, or B to close this or both, but we only need one to close the path across the top. It turns out it's better to condition on B. And there's a detailed explanation that I want to have time to give you today but the intuition is that if we condition on B it increases the precision of our estimate on why because B is a direct cause of Y. So it's creating variance in why and if we can stratify by B it's going to give us a better X estimate of X on Y. So you've got a choice between something that's a direct cause of your outcome and something that's a direct cause of the exposure, choose the outcome. Yeah, there's a detailed mathematical version of this argument where you can actually prove that this is true. But this is why I would choose B here. But in principle either will work either will will block the confounding the choice between a and B is just about precision it's not about bias and estimation. Good. Was this an okay example. You feel the burn you like it. You're all welcome to stop by my office I'll write a random dag down for you. And you can try to try to use the backdoor criterion on it. Okay, let's let's move on we're in the we're in the home stretch now we got a half hour together still and I'm going to make it count. So I want to teach you a few things that all emerge from the backdoor criterion, and are actually useful to you in thinking through your own work, and perhaps even more importantly interpreting the work that you read, and the work that you criticize. So the important part of knowing these things is that improves your peer review, in addition to improving your own research. So I said I'd come back to this example about what experiments do so for this particular graph we looked at the effect of randomization on x is to remove all these arrows that interacts, because when you randomize on x you're the cause of x and nothing else causes x. Yeah, and so that radically simplifies this graph. And now you don't have to condition on anything actually to get an unbiased estimate of x on why, but you still might want to condition on be in order to improve the precision. Yeah, of this NZ, right because they're both direct causes of why, but you don't have to you'll get an unbiased estimate of x on why we just x in the model, you don't need any controls to do it. And that's the magic of experiments. Yeah, the choice of controls and experiments and truly randomized experiment is just about precision. It's not about making inference possible, but but inference causal inferences and observational systems is possible. Because what we're doing with the backdoor criterion is deciding how we can statistically mimic an experiment. To do that you need to know the causal structure, and there's just nothing else to say about it. You've got to make assumptions about the causal structure, but if you believe those assumptions you're willing to trust them at least hesitantly at first you'll test them later, then you can statistically imitate an experiment. And that's the magic of this stuff. Now of course in my business I'm an anthropologist. There are many interesting phenomena that simply cannot be studied experimentally. And so this is very good news. But you have to put in this work about thinking about causal structures have alternative dags have open transparent communication about them, debate them. Okay. Let me transition into a series of examples which all use the backdoor criterion in a particular empirical example to illustrate something that's about how we report statistical estimates. A bad thing about how many of us report statistical estimates I've done this to is called the table to fallacy. What is table to table to refers to the convention and many scientific fields that the second table in the paper is a set of coefficients from a regression, or some other kinds of statistical analysis, often with p values and the like, and the table to fallacy refers to the interpretation of that table in which every coefficient is interpreted as a causal effect on the outcome. It's a lot of fallacious and you have already learned that and now I want to show you that you've learned that. And not only that but you've learned how to interpret those coefficients. If you have a dag, or some other causal model doesn't have to be a dag, but a dag's enough. So let me show you this that there's a great paper from which this phrase the table to fallacy comes and I cited at the bottom. Table to fallacy, I think it's a three page paper. It's worth your time. Curl up with it and bedtime tonight, something like that everybody should read this paper I think they use this example and I'm just taking all these figures from this paper. And so, where we're thinking about the research is about HIV, human immunodeficiency virus, and his relationship to causing stroke. For people in my generation HIV was a big deal. Right, this was we all had friends who had it. It was, I guess, anti retrovirals are very effective now but people my generation we all learn system examples using HIV because it was such an extensive epidemic. So this is a colored in red the causal query. There are other things that we believe also influence these variables. There are things like age and smoking behavior which also are known to influence stroke age is a major risk factor for stroke as you all know. And smoking, you're not surprised. It's also a major risk factor for stroke. But these things are also related to one another. So age also influences your probability of acquiring HIV, because your immune system declines with age, smoking also suppresses your immune system and makes it more likely that you'll be HIV positive age also is related to smoking. Some places it's positive and some places it's negative but some age group smoke, more than others. And so you've got arrows all over the place. Sorry, this is the real world. Right, this is what it's like. So, given this causal diagram we can use the backdoor criterion to get an unbiased estimate of the effective HIV on stroke. We can. And you can probably see it you can trace it with your eyes. Yeah. There are two backdoor paths. There's the backdoor path through smoking to stroke. Right. You just this is what you get good at this you just do with your eyeballs right do the calculation with your eyeballs. I mean, my mom used to call it use your eye brain Richard. And then there's another backdoor path through age, right, where HIV, since age is a common cause of both, this is like a confound path it's a fork, you can see that yeah. So we've got our two backdoor paths they're both forks. One passes through age in the middle and one through smoking. So if we condition on agent smoking we can block both of these backdoor paths, and we can get an unbiased estimate of HIV and stroke conditional on this dad right you got to believe the dad but for the sake of this exercise it's conceptual exercise we're going to assume this tag is true. There's obviously other stuff that might be going on, but it'll just have more structure. Yeah. So, well, you could run a regression predicting stroke as a function of HIV, which is the thing of interest, and then agent smoker added as control variables. So what this does in a in a regression model is we stratify by agent smoking, which is what conditioning means in the bag. I pause for suspense. Yeah. Okay, now the thing is, if you put a table table to as it were, of all the coefficients from this regression model in your paper. What's the proper interpretation of each of these coefficients. And the thing that won't surprise you I don't think is that the interpretation of each is different. And this is in general true when you add a variable to an analysis in or as a control variable in order to make a causal inference on some other variable possible. Typically the coefficient on the control variables are not interpretable as causal, they may be confounded themselves, or they will simply be partial causal effects, which are not the things that help you interpret the causal effect of that variable. So let's go through each of these and see what I mean. So the coefficient on HIV, this is the thing we're after, and we just arrived at this is the causal effect of HIV on stroke, holding other factors constant. Yeah, so that's the goal of your research and you got that right. Good. What about the others, the coefficient of age is only a partial effect of age, after removing the contributions through both HIV and smoking. Age influences lots of stuff in this model, it has a direct effect on stroke, it has an indirect effect through smoking, it has an indirect effect through smoking and then HIV and it has an indirect effect through HIV and directly ages everywhere in this model. So the coefficient on age, the only thing left since we blocked these other paths, the way to think about this is, imagine the goal had been to analyze the causal effect of age on stroke. And we conditioned on these other things we would have blocked front door paths, there's a front door path through smoking and there's a front door path through HIV. And so we only end up with this arrow which is not the causal effect of age, the causal effect of age acts through all the arrows. Now, if you want only this arrow, then that's the right thing to do. But this is this coefficient is not the causal effect of age. Of course, the causal effect of age is a weird thing because you can't manipulate age, right. Age is just time, it's stuff that happens to you that's the cost of effective age, but we're used to that weird proxies. Yeah, it's, does this make sense, what I'm saying about interpretation. So there are lots of epidemiologists who will actually refuse to report the coefficients on control variables because they don't want readers to make the mistake of interpreting it as the same as the coefficient on age is the same as the coefficient on HIV. And that's what the table to fallacy is about. It's about the interpretation mistake is not about the table, the table's not the fallacy it's the interpretation. So smoking similarly, it's not the causal effect of smoking, because smoking acts through HIV as well and in fact the effect through HIV could be bigger than the direct effect. It's possible. And so it's only the direct effect, which is only a partial causal effect. So again smoking and age or control variables their coefficients are not interpretable as total causal effects of those variables. And people do this all the time in all kinds of fields right and biology this is rampant multiple regression, some random effect structure table to hits, and then you've got a discussion section that interprets every coefficient as a cause of the outcome. Yeah, and maybe that's okay, but you would need a dag in which none of the control variables interact with one another by interact I mean have arrows. Which is kind of implausible in a biological system, I think, but I'm willing to indulge your your dag bring me your dag. Okay, let's make this a little spicier. So, there's likely to be a bunch of lifestyle issues which are confounds between smoking and stroke. Right. I don't need to list them. You can imagine them right. So, we don't have the measure but we know they're there. That's okay. We can still get an estimate of HIV on stroke in this model using backdoor criterion. Why, because we can condition on. We can condition on smoking. So this you hear biases any causal estimate of smoking but that's not what we want. You with me I'll say it again. You biases any causal estimate of smoking on stroke but that's not what we want. We want an estimate of HIV on stroke and this doesn't hurt us there. We're still going to condition on smoking because it blocks this backdoor path. We're still going to condition on age because it blocks this backdoor path. What's going to happen though is the existence of this confound will radically change the interpretation of the coefficients on smoking in age and not and now neither will be causal at all. Okay, wait, yes, it's able to fallacy hang on I was supposed to explain it with the diagram. Yeah, so let me trace it for you. So, here we go, get my pin back out. So let's think about this, the thing with smoking now right so let's think about ages, it, you get the one for age it's most complicated you'll get the one for smoking. We've got this path now with the coefficient on age, we condition on age this is a path to smoking we condition on smoking, but notice that smoking is a collider on the path up through the confound. So when we condition on smoking we open this confounding path for age, and now the coefficient on ages contaminated by the confound up here that we haven't measured. Good times young fun. Yeah. So, but it doesn't hurt our estimate of HIV to stroke if this is the correct dad, but it means that the coefficients on aging and smoking, neither are causal because they're contaminated to some unknown extent by an unmeasured lifestyle confound between smoking and stroke. And neither is causal now, but if you put them in a table and report them in exactly the same way with p values and everything. As the coefficient on HIV, many readers will interpret them as total causal effects. But they're just control variables and control variables, your statistical model is designed. This is what I say on the next slide so let me go to it, your statistical model is designed to estimate a particular causal effect. And for different queries about the same dag, say you want the causal effect of age or you want to cause effective smoking you need a different regression that uses the backdoor criterion uniquely for that question. Does that make sense, because you'll start your backdoor pass in a different place, and you may condition on different variables. And so for any particular regression, only the causal query of interest is interpretable or justifiable, and the others may be contaminated by. Well, you can interpret them, but the interpretation depends upon the dag. So there is a way to do this responsibly as a reader and a researcher. But typically it's, you know, typically there are causal models and papers and typically all coefficients are reported as equal. This is the table to fallacy. Okay, all of this comes from the backdoor criterion and this kind of analysis. If it holds for more complicated causal models with differential equations and things like that. You can get the same sorts of effects, where you're conditioning on something nearly to block a confounding path. The parameter estimates on the confounding path are not necessarily total causal effects. Okay. Now, in the remaining time and I think this is the last bit and then I won't say much about base today but that's okay because who cares. I just had his time. And let's talk about bad controls. And this, this won't take but maybe five minutes do a few examples. So, as you've probably seen by now, adding control variables can actually create problems and estimation. And this is the thing that I think it's unfortunate that people aren't taught this in their first statistics class right so you get your, you're introduced to multivariate regression and it's just like oh my God, you mean I can just add variables and block confounding. Okay, as you can add variables and create confounding. And that's what colliders do. And I think I've just showed you that this happens right a couple of examples. So there are, there's a great paper called the crash course and good and bad controls and I just want to show you some examples from it and as a way to nudge you to go read the paper to or at least look at the examples. So, this is just a slide to give you some notes on things that I don't think you should do there are a set of heuristics right so how do you choose your controls. Often people will just add anything in the spreadsheet right you only live once so now go for it and are any variables sometimes people test for collinearity and then they add the ones that aren't highly collinear this is also not valid. Right there's nothing in the back door criterion about collinearity. It's not a logical criterion. It's a weird biology tradition I don't know where it comes from exactly. And sometimes people will say, any pre treatment measurement or baseline variable this is also not principled, although it's a lot better than the other two there's some logic to this but there are pre treatment things that can cause confounds to. So let me show you some examples. So here's, you know, the fundamental dag X causes why we start here. I'm going to keep the X to Y thing is our focus and each of these and just run through some examples of cases where the control variable Z is bad and you should not condition on it. Yeah, you may have three variables in your data set but you should not you should resist adding Z, and, and the back door criterion can tell you why. So here's one that's called M bias in the literature because this kind of looks like a letter M. Yeah, if you draw it like this. And the idea is, there's a potential control variable Z, it could be measured out at baseline. And it's not, it doesn't is neither directly caused by or or an influence on X or Y, but it shares common causes, a common cause with X and a common cause with Y. And now if you condition on Z, since it's a collider. It opens this path across the top. And now it contaminates your causal estimate. So this is why pre treatment variables are not safe, because they may have some complicated structure with unobserved variables that goes on. I think political scientists worry about this quite a lot because I won't go into the details but this is related to social network analysis. So social networks are sort of inherently selected, and they can in bias is a big concern in that literature. Here's one that we call post treatment bias usually the idea is Z is a, what do you call this mediator. Yeah, on the path between X and Y sorry I've never had a psychology class I just have to like pretend that I understand these words, and and there's a version of this with with and without this confound. There's a version with the confound now without the confound if you didn't know Z was a mediator if you add it to the model it knocks out X, right and makes it look like X is not a cause of why but X is a cause of why. If you change X it'll change why you see if this is a pipe if you change X it'll change Z and that'll change why X is a cause of life. But if you condition on Z you block that and you learn the wrong story, you got to get the diagram right when there's a confound between Z and Y. It's even worse. Now you get even the wrong estimate of Z. Right so you don't even get the mediator effect right. Nevertheless, if you just put X if you just measure the association between X and why you get the right estimate. Yeah, and this is called post treatment bias, and it's a concern in the, in the political science literature. This is a great paper, and I just put this slide here to show you in case you want to collect these things this is Montgomery at all 2018. And it's political science examples but their experiments and this is why I emphasize this post treatment conditioning on post treatment variables is a really routine way to ruin an experiment. It's a great paper that there are two parts to it the first part they survey top political science journals and they show that about half of them are ruining their experiments by conditioning on post treatment variables. And the second half they explain why this is bad and the kinds of things that come from post treatment conditioning. It's a great paper I highly recommend it, even if you're not a political scientist. Yeah, because you'll recognize the structure. Through. Sorry I know this is going fast but we've only got 10 minutes of quality time left together today. So, and I'm going slowly and saying, I love this material but three hours is a lot of concentration for me. I'm suffering more than you I promised right. So selection bias this is the collider bias example it's exactly a collider structure that's all it is so I won't spend any more time on it conditioning on the collider induces selection contaminates the causal estimate. I don't condition on colliders. Here's the subtle sneaky collider bias this is also a collider bias, but now, why is not a cause of Z, but Z and why share a common confound and so it's effectively a collider. Because you condition on Z you open the path through the unobserved variable. It's like Christmas today isn't it. So all these great presents I'm giving you. A couple of quick examples that that are interesting and related to our final two moms and so on to moms and pure bias. So, which I'll have to rush through in the last 10 minutes here. There's this thing where you have something a variable that's a descendant that's that's caused by the outcome. If you condition on this, it distorts estimates to and this is called case control bias. So you're finding the variation in the outcome and this biases your estimate of the causal effect on why because there's less for X to explain. This is bad news, you shouldn't do this, but you need a dark causal diagram and no not to do it. A more subtle one is why I called the precision parasite. And this is live in the in the two moms example, Z is like birth order. X is the mom's family size and why is the daughter's family size. The reason be one is different is because when you, when you condition this it removes variation in X. And then you get a less precise estimate of the variation of X on why it doesn't bias estimation but it makes estimation less efficient. This is why I call it a parasite. And it gets even worse when you have a confound between X and Y as we do in the two moms case. Now adding Z exaggerates this bias. And that's what we saw in the regression examples that I showed you. So this is another kind of canonical example of a bad control. Maybe you have a confound, and you want to add Z it actually makes the compound bigger. Okay, let's transition to graph analysis now, which you've just seen in the previous slides or examples of bad controls and good controls. So let's return to the pure bias example now and think about analyzing this tag with your new found powers of the separation. So just to remind you where we were before. The X is the social category, and we're interested in whether it has a direct cause on why potentially through discrimination, but also influences the field or subject or department, the individual applies to E. And then there are these unobserved qualities of the applicants that influence both the probability that their application is successful and their choice of subject. Now, the problem, of course, is that there is when we condition on E, it's a collider on the path here from X to E to Q to Y. And so conditioning on E is something we need to do to estimate the direct effect here which is the target of interference. But when we do that we create a confound. So E is a bad control. We create a confound through Q which we have not measured and therefore cannot add to the model to block this confounding path. This sort of thing is extremely frustrating in the literature like this. And as I asserted earlier, there are lots of problems which have this kind of structure where individuals are selected into some competitive arena, they are selected by some other project or they are selected by some other process. There are hidden qualities of those individuals was influenced both their entry into that process, and what happens to them once they're there. And we want to understand some partial causal path in this process, but we cannot do so because of the hidden qualities. So the causal in estimates or change in estimates by adding E could be self deception, because of the confound path Q, or it could be the truth and we don't know, unless we can do something to estimate these potential Q variables. We want to summarize the pure bias literature and things that have a similar structure. And what we get from the analyzing the graph here to understand why the regression models behave as they do. Controlling for E opens the non causal path as I said it's not a backdoor path but it is non causal it's not a backdoor it's a front door. It's sort of the sub backdoor path through either gets opened up so backdoor pass aren't the only threats and that's, that's the general theme of this section of the workshop here graph analysis section is that backdoor pass are kind of the first thing you analyze but there are other threats, and this is one of them here. Bad controls extend beyond. How can we estimate. Not all is lost because of course we can estimate the causal effect of X on why the total causal effect through both past Q is not a confound if you don't condition on E. However, this is not what anybody wants to know. Right. You cannot sell any audience on the idea that the total causal effects is what is of interest to social policy or reforming that policy. What we cannot credibly estimate is this direct path from X to Y, at least not an observational study, or without some data that helps us estimate Q. I'm going to come back to that example of your bias example in the next section to follow this one. This is the other example now the example of two moms and do some further graph analysis on this. So as I said the theme of this section is there are more than backdoors and when you analyze the dag or the causal model. More generally, in your context you can potentially do more than that. So here's an example where in the two moms case, it seems hopelessly confounded, we've got the regressions, adding, adding the variables to the help. In fact, it hurts. And we have a confound between the treatment, which is the mom's family size, and the outcome, the daughter's family size. However, what I want to show you is that if you're willing to make stronger assumptions about the functions in the tag to make something called a structural causal model, instead of a directed acyclic graph, then you can analyze the graph and potentially do a lot more. And here's a structural example of that, which is fairly general in a number of literatures. So the kind of takeaway point, just in summary is that not every statistical analysis is going to look like a regression model or multiple regression model. Sometimes you need other things and you can derive the structure of your statistical analysis from the structural causal model itself. Let the causal model drive. Don't assume that what you're going to end up with is a regression, because I'll show you in this case what you end up with is not a regression. Okay, here's the dag from the two moms problem. I just want to remind you, we're interested in the causal effect of mom's family size on the daughters and we've got their birth orders here and there's almost certainly some measured confound or a whole suite of unmeasured confounds in MD like income and educational background cultural background. Now let's make stronger assumptions let's assume that this is a linear system, the easiest sort of system to analyze. You can also analyze nonlinear systems but I want this to be the simplest example possible. We put coefficients on these packs, which represent the causal effect of each variable on the one it points to. So in the case of the birth orders. First born daughters that increases their fertility by some factor B on average and we say that that affects the same, both for the mom and the daughter. But after this effect little M is the causal effect of moms for family size on the daughters that's what we're trying to estimate and then there's this confound, which means that we can't use a regression to estimate this by just regressing D on M. But what can we do. Well, it turns out in linear systems. It's a fact that the covariance between two variables in such a system can be calculated as a chain of products of the past coefficients. So, the simplest case, the covariance between B one and M is going to be B times the variance in B one so you say the intuition you want to get here is you start with the variable B one. So the originating cause the or cause, and then that determines how much variation there is and that variation gets multiplied or shrunk by the past coefficients on the way to the variable of interest in this case they're just one arrow. So there's just B times the variance of B one and that is the covariance that you would observe in a data set between these two variables. We can do the same for the covariance between B one and D. We can't calculate the covariance between M and D because it depends upon this, this confounding past and we don't know the strength of anything on this confounding path and we can't measure it so you can't calculate covariance between M and D which is what we want. We mean you can't estimate M, but here's the trick we can estimate the covariance between B one and D. All right, because the confounding path is not open along that. And so the coverage in B one is deep by the same rules is B times M times the variance in B one at the start. So now we know two co variances in this graph, we have two unknowns B and M. And this is a system of two simultaneous equations and we can solve for him to give you the intuition of this. The second equation here, I'm circling it with my cursor, the covariance between B one and M is B times the variance B one solve that for little b. And so little b equals this covariance divided by the variance, which is a regression coefficient those of you who've had a kind of traditional linear modeling course. Take the second equation and substitute that solution for B in here for B, and then solve for him isolate him on the left. And what you get is M equals the covariance between B one over D, B one and D divided by the covariance between B one and M. And this is a valid unbiased point estimate of the causal effect M. Of course, we need more we need a confidence interval around this and there are a number of procedures for doing this in the next section the last section of this workshop material. I'll show you a general way to get some sort of confidence bound on these estimates, but you could also bootstrap here there's a there's a wide range of procedures that that people use to do this. The point is that this is not a regression estimate in a traditional sense this depends upon simultaneous equations and their analysis. Okay, so let me try to summarize a bit we'll zoom out and then talk about estimation more. The whole point of this business is when you have a causal model. You have to put in a bunch of work to do that and scientific conceptual work to do it, but you get a tremendous amount out if you're really to analyze the graph what you get out are the implications of the model, which are not apparent to almost anyone just from stating the assumptions themselves. The causal models are DAGs, other than when you analyze DAGs you can say very general things, and not all models are even linear equations like the example I just gave, gave, but all causal models are generative and so they all have causal implications and when you analyze them. Using whatever tools are appropriate for that form of model, then you do the same sorts of things with them that we've done here. We use them to derive tests of the model structure and to derive and to derive an appropriate statistical procedure for challenging the model with data. So, last part of this and in many ways, most expansive part of this ties together all the other things. We need to actually have a robust way to fit arbitrary causal models to data, and there are a number of ways to do that. My preference is what I call full luxury Bayesian inference, and this is just my, my playful term. It's Bayesian inference, which is just probability theory really. And the full luxury part is a bit of a joke and say that all luxuries have costs so you get a tremendous amount out of this approach, but you do have to do some work to make it work. And that's the price of the luxury. So to motivate this I want you to think about the world of regression or general general generalized linear models, more generally. So, in previous parts of this spring school, you've, you've heard about generalized linear models in general linear mixed models GLMMs. The thing about all these models is they're just regressions. And just like cats are all cats, that's the point of this meme. Whether it's a small domestic cat or big tiger they all behave the same there's different sizes and if you were shrunk 200 your size to both of these cats would eat you. And this is what regressions will do to the fact that GLMMs are bigger and they can accommodate fancier effects doesn't change the fact that they're fundamentally non causal machines. They don't understand causes. And if you're not careful they will eat your analysis, just the same as a standard GLM causal analysis about much more than regression we need some generalized framework for confronting our causal models with data. That is not in what I've sometimes jokingly call econometrics jail, where you think of everything as a regression problem. There's some procedure and state. So, we do have general frameworks for doing this and the the one I prefer because it, it lets you just write the causal model as a statistical model is Bayesian inference. Bayesian inference, all I mean here is that it's probability theory sometimes people introduced Bayesian inference by saying, Oh Bayesian inferences distinct because it has priors. What's the difference. The difference is the Bayesian inference is just generalized permissive probability theory. It lets you do what you want as long as it obeys laws of probability. So, for example, you can take a generative model, and then you can program it as Bayesian network and Bayesian inference can extract information from the data, if there is such information in the data and if there's not it'll tell you that too. The generative model itself is neither Bayesian nor anything else frequent tester or whatever there are other paradigms beyond beyond those two. The generative model is a generative model and Bayesian analysis is just the use of probability theory laws of probability which everybody agrees are correct to extract information using the implications of that model. So, but the truth is, with full luxury Bayesian inference is that luxury has its cost it has a price tag. But what you get out of that is that if you've got a complicated causal model and you're not very good at graph analysis or any other kind of analysis. The probability theory can actually analyze the model for you and automatically recognize the complications of its structure for the evidence. I'm going to show you some examples of this in the slides to come. But the price of all that of that luxury the full luxury is that you have to do the programming. The computation can sometimes be quite challenging. Some causal models in fact, even if in principle, there's an available causal estimate in practice there won't be, because you will not be able to get stable numerical estimates of it. So just just the theoretical possibility of a causal estimate is no guarantee that you can derive one. Okay, let's let's take the case where it does work though. Okay, we're back to the two moms. And what we're going to do is, is take the dag for the moms, and I've drawn it here at the top and I've just flattened it so it fits on the slide better but let me rehearse this for you real quick so this is the structural causal model for the two moms. And I've drawn that unmeasured confound now is an explicit variable you it's a common cause of m&d. We haven't measured it. And then we have these past past coefficients B and B and M. Now what we're going to do is in order to practice full luxury Bayesian inference you need to translate this structural causal model into a set of functions that it implies now remember every node implies that there's some function, which takes the arrows and enter it as an input, and then outputs the variables that are emitted as arrows out. So you can think about this is as for every variable, we can write down a function that determines it, according to the structure of our dad. So let's go one at a time and just work through this. So first there's m which is mom's family size. This is a function of all the arrows that enter it, which are B one and you remember we haven't measured you, but the model says it's a function of you. And so we write it. So we have M is a function of B one and you, and I've just written this function we haven't said what it is yet but it's going to be a linear function you know that. But that's not the important part about this it's just to say that this is the structure M is a function of some function f M mom's function. Now the daughter same sort of thing. The daughter is now a function of three things those the function of you and be to just like the mom it's metric, but also potentially of moms family size that's the research question here and so now we have the daughters family size is and we have the function and its inputs are MB two and you. And then B one and B two are the birth orders, and they're both a function of whatever determines birth order in this example which is some unnamed processes right there are essentially random in this graph, because there are no arrows into be B one and B two. And the same for you, whatever determines the unmeasured confounds like economic background work. It's not modeled in this deck. And so there's some function which creates a distribution for you just as there's some function to distribution for the birth orders. But it doesn't have any inputs which we've measured in this graph. So this is the essential tactic that you use to program what's called a Bayesian network in this. And then once you have this version of it you, what you, you make these functions like FM FD FB and F you you make them non anonymous you choose probability distributions for them. So that there's, there's some stochastic process that takes these inputs and outputs the family size. So in this case it's a linear system. So we have normal distributions for the family sizes from ramen daughter. I made the birth orders for new lease which just means there's zero one indicators of whether the woman is a first born in the family first born daughter. The unmeasured confound we can't say anything about its scale I just make it a Gaussian variable it's just a set of z scores essentially on some latent scale, some of the psychologists listening will understand what I mean there is the standard latent variable. And then of course to make, make probability theory function you've got to define a measure for all the symbols and so we have priors at the bottom. Full blown status course I say a lot about priors and how they don't have anything to do directly with the analyst beliefs, but they have to do with scientifically known constraints on the variables. I don't have time to rehearse that whole argument now is just to say that you, you figure out the appropriate priors using scientific knowledge just like you figured out the dag using scientific knowledge. Both of these processes you can say are subjective in a sense, which means they depend upon expertise, but they're not about your beliefs, you don't have beliefs about the parameters you have beliefs about the observable data. And in my course I teach you how to connect statements about parameters to knowledge about the observable variables. Okay, if you want to run this is what the code ends up looking like. If you use something called a probabilistic programming language like Stan, or this is from my our package three thinking package. But all of these languages essentially have you write statements like this where you're writing out these function definitions, and whatever however these functions work here you have linear combinations that determine the average family size. So the code it works and you can go to the repository at the bottom of the slide and running. What happens when you're running. Okay, so remember before I showed you that if we just run ordinary regressions on this trying to predict daughters family size with the moms. It's biased and that's that's unsurprising because there's that confound you and adding be one makes it worse, because it's a bias amplifier. Add control and be too well that that reduces noise but we still have the bias and so you now have now have a more precise biased estimate. When you use the Bayesian network the Bayesian network tries to estimate the use you're basically you're telling it because you programmed the day to assume there's a confound. It averages over the unknown confound it concludes that nothing can be set. Essentially, there may be a moderate positive effect of mom's family size on the daughters in this particular sample. Again, this is just one sample but this is the pattern you're going to see on average if you run the code over and over again. But no strong conclusions can necessarily be made it may be a stronger moderately sized effect. It could even be slightly negative. So it's not getting tricked by this in a very very large sample you're going to take the code and increase the sample size bigger and bigger. This point will move towards zero in a very large sample. Okay, so that's an example of full luxury Bayesian inference. And you can do the same same sort of general thing with the pure bias example but now the issue is we don't know q so we can't de confound this. But what if we had some other proxies of it. We don't know your test scores, letters of recommendation anything. You might theoretically think matters, each of them is imperfectly associated with q which is something we can't observe directly because we don't know your test score is not what necessarily helps you get accepted well maybe that's a bad example because you've listed on your application but it's, it's other kinds of things activities that are not part of the application, which may give us information about what is simultaneously impacting your ability to write a good and your choice of field. So I'm going to say we'd say you have two proxy variables just R one and R two, and they're influenced by q. But they're not q. Right. So, what if we just added them to the regression equation in this example so remind you we did this before. Many, many slides ago in this workshop. We were looking at your model, predicting why using X, and we get this this coefficient that shows evidence consistent with discrimination but the problem is of course this is both pass right it's the total causal effective X is measured by this model here right given that this is a simulation with discrimination in it. And then we add e evidence of discrimination seemingly vanishes now no strong claims can be made, perhaps there's discrimination, perhaps there's actually some advantage of being status X in this system. If we add our one in our to it has barely any effect it nudges it slightly towards the correct answer which is closer to this that there's discrimination. It's not this far over. I think I assumed it's a right about here about point eight minus point eight is the true effect I forget you have to look in the code that's at the, I give you the URL at the bottom. What if we do the Bayesian network version of this full luxury base and we just take this tag and program it as a statistical model. The code would look like right so we've got why is now come. It's a factor of E, and X, and this unobserved variable q. Again, we have a measured q but it's, it's a variable and so you can just introduce it to the Bayesian model as a variable, and it has some unknown effect h as well. And then there's a model for q down here q is some. We don't have any cause of q and our model so q is just some latent z score thing a standardized normal, getting like a standard psychometric latent variable. It's arbitrary. Right, so there's no loss of generality, and then our one and our two are caused by q and so on average, their mean is q, but they, they're not q they're not copies of it. And then we run this and it does not just towards the right answer. It's not magic because our one and our two don't measure q perfectly. So we can't totally remove the confound, but it gets us some distance further away. And now, although you shouldn't be too excited about the zero is excluded from the integral, but again there's nothing special about zero being excluded from the interval the point is that we have moved in the right direction. So if you ran this example over and over again you, this is only one example of one particular simulated sample, but this is the pattern you get with all examples. Okay. Let me try to summarize this I know this is a bit much here and there's a magic going on behind the scenes. What I want you to take away is that Bayesian inferences extremely useful, because it lets you focus on the causal model. You're going to wrestle with weird estimators and other sorts of things. But it's only as good as the model you put into it, it's just probability theory. It's not going to do any magic if the model is bad. There's nothing there if the sample is bad. It's not going to invent information. And then of course you need a robust numerical algorithms and do the Bayesian calculations and these days they're really capable numerical algorithms for doing this like the stand language. Or the touring libraries in Julia, which can do, you can feed in your causal model as a Bayesian network, and it will take care of the details for you in a wide variety of cases, but there's still challenges here so I don't want to tell you everything soft. Sometimes you have to fight with the machine. However conditional on the model in principle, no other method is going to find information in the data that the Bayesian approach won't. It's just that the Bayesian approaches is computationally challenging quite often. Great thing about this about full luxury base is that routinely in research we have problems like missing data and measurement error. Right so remember the retracted nature paper example I had not so long ago. Most studies of an observational nature have missing data and many, many surveys and experiments do as well. And, and many many almost everything has some kind of measurement error and measurement error is not benign. The effects of measurement error depend upon what causes it and the effects of missing data depend upon what cause it, but you can add causal models of measurement of missingness and measurement to your dag to your structural causal model, and then program them the same way you would any other set of assumptions in the statistical and therefore honestly pull information out of the data so that you're not overconfident, and also so that you don't miss inferential opportunities remember this isn't just about accepting a burden so that your, your inferences are less powerful that's not how this works. Remember in the two moms case graph analysis showed us that we could do something that could not be done with traditional forms of analysis. Okay. Final thing that I didn't want to finish this content without mentioning. There are common sorts of scientific problems like social networks and phylogenies, whether they're genetic phylogenies or linguistic cultural phylogenies. These things are never observed their, their social constructions and that doesn't make them less useful, but there is no real social network for society there is no real phylogeny for a set of languages. There are ways of taking high dimensional data sets and summarizing their connections with some minimal causal structure imposed on it. And these things are never observed in data and they must be inferred. And you need a causal model of those observations and so in all of these situations, if you cannot directly analyze the social network you cannot directly analyze the phylogeny you must infer it and you will have uncertainties about the structures of these things. And I think that I talk about in my class, there are many books and papers about these things. Nevertheless, there are still lots of older analytical methods for working with networks and phylogenies which ignore this fact. And I would caution you, well to stop using those things, because they're, they're creating problems for you. All right. Yeah, so, sorry, I forgot this slide was coming so I wanted to come back to this slide just to say that reminds you this is a case where there's a lot of missing data but the problem isn't that that there's a lot of it. The problem is the cause of the missing us and the cause of the missing us in this case is one of the is the one of the primary variables itself. We can't observe this because writing strongly co varies with population size. And so we just don't have evidence about what non literate societies what their religions were like, and therefore nothing can be done to fix this. This is not a problem that statistics can solve. But it is in a sense a problem that graph analysis can solve because it tells us that that these data alone are insufficient and therefore it tells us we should look elsewhere. But that's progress, knowing that the model of interest cannot be answered by the data available is scientific progress. Okay, let me bring this to a close. You've been very patient. The whole point of this is to say that to do statistical inference responsibly and productively, you have to put science before the statistics, and that means thinking about a causal model that precedes your statistical model and analyzing those causal models, so that the statistical procedures serve them. The causal inference requires a causal model that is distinct from any statistical models who use. We analyze these causal models for their implications. And that is now that analysis is logical. And so the implications of the model means that every objective observer will reach the same conclusions about the implications of the model. And these implications can be used to design research, you know which data you need to collect, you know, you know how much you need, so on. It can be used to test the structure of the causal model, and it can be used to design an estimation strategy in this workshop I mainly focused here on three, but there's a lot of literature on one and two as well. And then the additional challenges you need some robust, robust numerical framework for doing estimation, and I propose Bayesian inferences a solution to that because there's the minimal friction between focusing on the causal model and doing the estimation. There's only one kind of estimator and Bayesian inference and that's the posterior distribution, so you as a scientist, a scientific analyst, you can focus on the causal model instead and not worry about weird things like analyzing custom estimators. It greatly reduces the amount of work in practice, although it often creates computational headaches, nothing's free. And I've said this a few times this last bullet pointed bottom of the slide I haven't really proved this to you but I've asserted it, and there's a literature on this that both descriptive and experimental research or no exceptions to anything I've said. Descriptive research depends upon causal assumptions you can't describe the target population unless you can describe why the cause is not the why the sample differs from the population. It also goes for comparisons among samples, right, even if you say you're just describing the difference well, you've got to think about sampling, and that the samples are caused by behavior researchers experimental research. I hope I convinced you there are things like post treatment bias which you should really be concerned about, but also lots of experiments are imperfect. You can't make patients always take their medicine. You can't use your causal inference and I've really only introduced you to some initial stimulating hopefully ideas about it. Just as a sampler. When we compute treatment effects, the examples I've used here the models are really simple and so you can summarize treatment effects or causal effects with a single variable. And really this is not true in any applied context and the reason is because treatment effects depend upon the distribution of the other variables in the population. And so calculating average treatment effects so called marginal effects for things requires additional steps and this is something that again that I teach in my course. So the population is the is the big version of this where we, we've got some sample for the population that we know may differ radically. It's not a random sample of the population and often that's good because non random samples can focus on small groups and get good robust data on them. But then when we calculate the effect for the population there's a secondary statistical procedure that's required and this is called post stratification there's a big literature on this. In cross cultural research, for example, but also in political science and public polling, many other areas of the sciences. Partial identification sometimes analysis in your graph tells you that it simply is not possible to get an unconfounded estimate of the causal effect of interest, but that doesn't mean you stop. Because it is possible to do a sensitivity analysis and say how much of the total causal effect is plausibly caused by the confounding pass. So partial identification, and this is often extremely productive because you're able to say things for example that in order for the causal effect to be totally explained by the confound the confound would have to be, you know, such and such strength. And then you can use additional scientific literature to talk about the effects that are on the confounding path and how strongly might be in general. And this pushes the research along in a in in a good way. If you need a causal effect you may really really need it now and just because you can't get an uncontaminated one is no reason to stop doing the research. And finally of course research design responsible research design depends upon having a causal model before you collect the data ideally, if, if the analysis of cause and thinking about statistical analysis begins after the data arrive. Then you're going to potentially encounter situations that cannot be repaired. And so having the causal model sketched out before you design your research is of course the best thing. And this is of course how experiments are designed right why are you doing an experiment because in some principle you've done a heuristic causal analysis and you know you need randomization. The same would go for observational studies. If you really hope to get causal estimates and observational setting, you need to think about the confounding structure before you collect the data. Okay, I think it's often useful to give people workflows to think this through. And, and of course it takes time to learn how to do these things and you have to work examples and fight with the computer and so on that's how everybody learns this stuff it's just time invested. But here's a cartoon version of the causal workflow for full luxury Bayesian inference and it's not too wrong. So the step one, derive candidate causal model or models using and I put it in scare quotes here science. I don't want you to think that there's some single method for doing this, use your scientific expertise do scholarship scholarship is the better work here. Talk to your candidates communicate your dags to them. Listen to their criticism, and so on. To program the candidate causal model or models as a generative simulation. This is a way to see what the, the data would look like, if this was what was really producing the phenomenon this is way to explore its implications for people who are not so confident with algebra and do calculus and other such tools it's a very general way to engage with your theories. And then you can design your research and validate your statistical analysis using the generative simulations, using the implications of the model. And then, once you're sure that your statistical procedure works on the generative simulation. That's a minimum standard right for a statistical analysis is that it works assuming that the model is correct. And then you're ready for step four to confront your model with data. And sometimes the conclusions will be not what you were hoping for, but that's still a result. And the reason is because steps one, two and three justify the answer in four. They make it reasonable, and, and something that you can argue others people should believe. And then step four, which unfortunately many, many people do without steps one, two and three than any kind of debate could happen there. Right. So this is why I say we celebrate wins and losses equally. Everything is progress for the community. And then step five we revise and repeat this loop goes on until the heat death of the universe. And jokingly now here's here is my what I call the full sadness non causal workflow. So at the beginning of this I talked about causal salad which is this joke term I like for the helpers skelter unplanned way that people use traditional statistical tools to imitate but not actually perform causal inference. And you could also make these kinds of workflows that people implicitly follow, of course if it's health or skelter there's no workflow. But it's nice to think about these things and I hope you'll see that I'm poking fun, but I don't want to blame people because this is a sociological problem, right. We, we learn our scientific procedures in, in communities, and these things are often quite implicit. So I don't mean to blame anybody here but I think you'll see that there are problems and that I hope you recognize some of these problems in this list so first item. Find or collect some variables that are conceptually but not necessarily logically relevant to a phenomenon. This goes on in tons of research where you've got an introduction that says oh there's this phenomenon it's super important. We've got some data that kind of sounds like that. What does not happen is there's a causal model that shows that the measurements that things they've measured are actually logically related to the phenomenon, given some causal process. In step two you probe the data anyway you can, anything goes, because if you can get an asterisk, then you can get it published. In step three, you tell some hopeful and actually typically causal story about what the asterisks the significant values about what they mean. But this is the first time causal storytelling appears in this it should have appeared much earlier and justified the analysis in the first place. This is an analysis that is then interpreted as causal rather than a than a causal procedure that allows us that permits us to interpret statistical results. And then fourth of course very important you should never state the assumptions that license your story is if you do this opens you up to pure criticism. Yeah, very bad. I think you get to revel in your really truly magnificent H index, because nothing else matters. I'm being cynical here I'm joking with you. Obviously, not all scientists are only after their age index. But this is a set of implicit procedures which serves metrics and does not serve the accumulation of valuable scientific knowledge. I promise you I'm almost done here. I want to say something though about this sociology bit because I think one of the, there are lots of selfish reasons to adopt a properly causal workflow where you focus on causal models first and there are also moral ones. And I'm sorry to talk about morality. No one wants to hear hear that in the lecture but but I think we have to see that that science is a moral system, and our behavior is bound to one another through ethical obligations. And what a causal flow does is it constrains your behavior in productive honest place. Here's a very recent paper in fact I think it came out like two weeks ago or last week even. And what it is is it's a survey of researchers and animal cognition, asking their beliefs about the state of research in their field and the way people analyze data and do studies and so on it's very interesting if you're interested in animal cognition research I give you the citation at the bottom. I'm just going to show you a couple of the survey results from this to highlight this point I'm trying to make. So here's what they ask people when submitting a paper, I and others, and then on the top here. They're responding for themselves. They say they make weaker claims and warranted make appropriate claims or make stronger claims and warranted 8687% of people said that they make appropriate claims for themselves. I'd say that this, the 5.8 of people who say they make weaker claims and warranted, I want to be your friend. These people are good people I like you I like your dark energy that's exactly how I am. And, but for others, the respondents said well yeah but more than half of people in our field make stronger claims and warranted. These two things cannot simultaneously be true. Right, so you, and I don't mean to pick an animal cognition research because it's not unique in this in this way it's just, you can see that this is a self serving bias. If you think that most people are exaggerating their claims. But you're not, then you're probably exaggerating your claims to. And the thing about the causal model approaches that it's going to constrain you to say exactly what assumptions license the claims you're making. It makes it harder to make stronger claims that are warranted, because you have what warrants your claims is the causal model. Yeah, I hope that that makes some sense. One more thing from this from this paper. QRP is performed by, and then we have myself and other animal cognition researchers what's a QRP QRP is a questionable research practice, is there things like p hacking hypothesizing after results are known. And of course is also questionable research practice because that's not usually what is meant by this this is meant by these, these sorts of statistical procedures which are in a sense normative, but actually not licensed by any statistical philosophy, things like p hacking, for example. And so, again, we get this mismatch where people say by themselves. There are rarely 70% of respondents say they never only rarely engage in QRPs, things like p hacking, dropping out liars and the like. And we've got some honest people up here though, you know, cheers to these people. But for other researchers animal cognition, 50 they're saying 50% of their colleagues sometimes do and about 30% often do. So again, these two things cannot simultaneously be true. And again, it's not the causal, the causal workflow is going to solve all problems with questionable research practices. But when your statistical procedure is derived logically from the generative model for your research context that precludes the possibility of doing many of these things. Yeah. And that's a big advantage. There's an ethical dividend that comes from this. Okay, I'm going to stop there. But I want to stop with a couple of recommendations about where to go next. And the first of these is the book on the left I think this is a wonderful book it's thin. And there's essentially no statistics in it called the causal causal inference in statistics a primer by you to Pearl who's of course one of the greats in the modern growth of formal causal models and his colleagues. And I highly recommend this book again it teaches you how to think about simple structural causal models tags and analyze them there's essentially no statistical content here it's really focused on the connection between your science and writing generative models. And I think you can download the whole PDF from Pearl's website actually if you Google around a bit. And then there's my book, which I recommend with no shame, because this is a book that I wrote as a labor of love for the scientific community for my students and my colleagues. I think it's a great journey, trying to do better, and, and justify the analyses I was doing. And it has strong connection to base of course but bases in service of scientific models represented as causal models throughout the book. And it deals with many of the more elaborate uses measurement error and missing data problems, social networks, phylogenies, and the like. But that I hope you found something valuable in this presentation. And for those of you in life, I'm always available to talk to you about these things and those of you aren't like sick, you're free to send me an email. And if I can be of help for your particular problem, I will try to do so. Thank you.