 Let's get the machinery working and the recording on welcome back There's a lot to get through here, and I don't want to rush So I'm going to move at a measured pace and it'll just take however much time it takes but this is content that I Personally find deeply confusing and spent a decade or so trying to sort out of my mind And so I'm sympathetic to the idea that it's got a natural pace to it And we let it unfold and we can't just rush through it. So Let's continue then from what I didn't finish last week Which was causal inference and we're not going to finish causal inference today either. No one will ever finish causal inference causal inference is an unsolved problem Hume told us that right? We're never going to solve causal inference But you can cause solve causal inference by assumption That is if you can assume a dag is true Then you can say a lot of really powerful and important things about when science can work And that is a great great achievement, and I'm trying to give you an introduction to that framework so I had gone through a number of examples last week of cases where confounding arises by conditioning on variables or confounding is rather solved by conditioning on other variables And there is a framework that unites all these examples, and it's called the backdoor criterion If we want to make a causally valid Inference about the effect of some treatment or exposure on an outcome Then we need to shut all of the so-called backdoor pass into the exposure What does that mean? Well if there are arrows entering pass entering the back of the variable? I'll give you some examples here So and and since there are really only three Different ways that variables can meet in a dag you learned them before I'll remind you on the next slide what they are and There's a specific way to open and close each of them by conditioning. You got all the tools you need If you can assume the deck, right? so And this is progress. It doesn't win Hume's war But it does give us a lot so to remind you here what I call the four elemental compounds Really there's three here, and then there's the descendant which lurks in all of them Right you have to be aware of the descendants. It's always there so they are the fork in the upper left the fork is Sort of the most basic thing that we're usually taught of as a confound It's a common cause of two things that creates a spurious correlation between them So z is a common cause of x and y it generates a statistical correlation between x and y unless you condition on z So this is a path Information flows from x to y until you close the fork at the metaphor breaks right there Right no one's ever closed the fork break the fork. Yeah, that works better break the fork by conditioning on z The pipe is a flow of causation with mediation by z So x and y are correlated not because x causes y but because x causes z and z causes y if you want to block the pipe You condition on z if you don't want to block the pipe Sometimes you don't then you don't condition on z and then finally the collider my favorite the collider is terrifying but powerful the collider is When z is a common result of x and y and then x and y are not Correlated with one another until you condition on z and then that opens the path and lets information flow between them It's like the light switch remember the light switch The light is on and you know the electricity is working Then you know if the switch is on or not right That's what a collider does and then the descendant is lurking in all of these if there's something like a in this graph here That is a descendant of any variable that if you condition on the descendant It's like weakly conditioning on the variable itself because a has information about z And so if you condition on a you're partially blocking the path there so it makes sense so Conditioning on a descendant of a collider is like conditioning on a collider just a little weaker depending upon how strongly correlated the descendant is with its parent So let me give you some examples now We're going to want to assemble these things together into bigger causal frameworks and any Dag no matter how big is just made up of those four things. That's all it's possible in a dag There's nothing else you can do by the logic of dag's now nature can do other stuff But we're not going to talk about nature right now. We're just talking about dad's okay That's enough to understand. So let me give you just the most basic kind of confound the classic confound We've got some exposure e like say education e is for education and some outcome w wages Right, why do people go to school? Well, I went to school because I want to learn how the universe works But most people go to school to make more money, right because they're rational So a lot of research effort is put into figuring out what the returns on education are and So we're interested in this direct path What's the causal influence of education completed education on wages? But there are a lot of confounds unobserved confounds you that mean that a simple regression can't tell you that Yeah, this is the basic confound. What kinds of confounds neighborhoods family environment Personality characteristics that have nothing to do So if you're if you're a determined and hard-working person, you're more likely to finish school and you're going to make more money Right, but it has nothing to do with the education Yeah, and likewise if you're super lazy, you're less likely to finish school and you're less likely to make money But it has nothing to do with the education effect. It's just a common confound those sorts of things lurk in all of this so In this how do we de confound? This dag using the backdoor criterion. I want you to see is there are two paths from E to W in this graph The first is a direct path the one that we're interested in estimating From E to W. That's the front door From education. I'm sorry. Isn't that my terminology if you don't like it blame you to pearl And Then there's the other one from E to you to W and that's the backdoor path Because it enters the back of vegetation and notice to think about these passes you can walk against the arrows Information will walk against the arrows. No problem Arrows are about causation. They're not about statistical information flow And that's where confounding comes from it comes from the fact that information flows in a Bayesian network against arrows and with arrows But causation in the real world only flows with the arrows and we want to know the direction of the arrows Does that make some sense? So these backdoor paths You can walk against the arrows. They're still pass So how do we shut? Path number two well look at path number two and tell me what kind of path is that? It's not a question. I want some audience participation somebody know That's not the collider no, but thank you for participating It's the it's the fork. It's the opposite of the collider right because the other way it's like This is the fork and how do you close the fork you condition on you which is unabsorbed Right, so we can't condition on it here. We're screwed. Yeah, we need some other So this is a sad story right now. We need to sad drum bone sound if anyone can make one Right there, but this is this is a good thing that that can tell you is that you cannot In this design if this is the true causal Network you cannot get an unbiased Unconfounded estimate of the effect of education on wages unless you can measure the confound and that's important right that would end a Bunch of newspaper articles right there Right imagine like the second line of newspaper articles a new study of da da da unfortunately It's hopelessly confounded and cannot possibly give us a legitimate estimate of the cause of influence end of article Right lots of scientific articles would end right there And that's an achievement we'd waste less time But you can think of it as a scientific program you realize this and then you set out to measure those confounds Right, but if the graph tells you you have to do that then you find a way to do it There's another way to solve this problem that I'll teach you later in the course. It's called an instrumental variable But we're going to need more statistical tools before I can tell you what that is and it's not the back door criterion Isn't what solves it. It's something else called do calculus which sounds even weirder Okay Here's an example that I terrified you with last week right or thrilled you with I should say you weren't terrified You were thrilled. It's like a roller coaster. Yeah, and this is the grandparents parents and grandchild Effects of education example where we have a neighborhood common neighborhood confound in parents and their kids if we want to estimate the direct causal influence of grandparents on their kids that is We should note that there are three paths from grandparents the kids in this graph. There's the direct path There's the path through parents And there's the path through parents and neighborhoods Do you see that you can trace it with your eyes? Yeah, you get good at this. You start doing this at grass. There's sometimes there's lots of paths Yeah, your computer can find all the paths very quickly for you and the in the notes I Tell you about an r-package called Daggerty which will automate this for you and I'll give you some examples How to do it? So now here's the thing if we condition on p that closes the second path Which is what we want to do right if we want the direct effect of grandparents on grandkids We want to close the path through parents parents are a confound and so since that's a pipe There's a pipe going from grandparents to grandkids. You close the pipe by conditioning on parents, but That opens the other path when you condition on parents Parents are a collider on the other path on path three They're a collider between grandparents and the unobserved neighborhood effects that path was closed Until you cleverly conditioned on parents and now it's open So you're ruined one way or the other in this graph and again There is no way to get a valid causal inference unless we can measure the neighborhood effects or use an instrument or something like that This isn't achievement. This is not this is happy news because we know we were being fooled right formerly We were we were fooled and we were Recommending the wrong policy intervention because we were confounded and overconfident. This is an achievement to calibrate ourselves So what nature is actually like? Yes, right Because there's a collider along the path colliders are closed unless you condition on them Information doesn't flow until you condition on the collider pipes are open forks are open Until you condition colliders are closed until you condition Yeah, so there's a it's it's the two arrows entering P means that nothing flows there to get stuck Unless you condition on P and then information flows happily through and you're confounded again. Does it make sense? Yeah Yeah, a little bit. Okay. It's like the light switch thing, right? It's there's no statistical association You can't learn anything about the electricity By knowing the switch unless you know if the light is on Right, so the light turns on only if the switch is on and the electricity is running You get no information about the electricity by knowing the switches on until I tell you the light is on or not That's conditioning on the light finding out about the light So that's like your parents are like the light when you know the state of the parents And you know one side of the collider you get information about the other side That's why it opens the path. It lets information flow through that help this takes again I told you I've been a decade fighting with myself and feeling miserable for not understanding things in this literature So don't beat yourself up if something confusing as I keep saying you're feeling confused It's because you're paying attention and I thank you for your attention Okay, something more fun. Let's build these up as I told you any graph of any size is just Composed of these little elemental confounds glued together into terrifyingly large and wonderful natural structures. Yeah, and Here's a fun one. This graph can produce lots of fun things. What is going on here? So we've got some exposure X and some outcome Y of interest We want to know the causal effect of X on Y the blue path in this graph But there's lots of other things going on in confounds not only do we have confounds, but we have confounds and confounds Yeah, this is nature. This is why you know, we're all in rapture fire, right? And so we've got an unobserved cause of X on the left and then we have three measured variables a b and c there are covariance So imagine you had given this data set and you've got a column for X and a column for y and columns a bc And then your supervisor tells you to run some regressions and tell you what tell me what's going on In the absence of the causal graph, it's terror here I mean you just throw them all in the multiple regression and report some confusing coefficients is what most people do in journals Are full of papers like that. Yeah, you can hold do a whole productive career that way here I'm giving you the graph and I'm saying what do you need to condition on actually and what should you absolutely not condition on? And so I'll let you think about it for a moment. Let's think about it The backdoor criteria criterion is sufficient to figure it out in this case What's the procedure you find all the paths from X to Y and And then for each of them you figure out whether you need to open it or close it All right So before I turn to the next slide and reveal paths for you Maybe you can see what they are. How many paths are there from X to Y in this graph. I Gave you one for free. It's blue There are three exactly you guys good so there there it is. There's three. There's only three right you can go Direct X to Y and then you go up to you and then over a and then down again or you go up to you and down to B Again, just three paths. Yeah So what do we need to condition on? Condition on either A or C you could condition on both, but that's a dozen You just need either A or C and nothing else if you condition on B Everything is ruined because it's a collection and that path is closed already Until you include being your oppression So this is why I've been annoyingly echoing over and over again Like you don't just want to add things this kind of causal salad approach for people just say and then we control for a whole bunch of stuff So oh that that's reassuring That's great because that generates confounds as often as it removes them you have to be very careful especially you've got The the covariates themselves Flow into one another in any any interesting system and so you just throw them all in the model now You've got colliders of colliders who knows what's going on right you need to think very carefully about how this works Yeah No the path the path through B is closed You can't go through because it's a collider Unless you put the information about being in the model then information will flow through because if you if you know C And you condition on B you get information about you even though you haven't noted it and then you get information about C So it flows all the way through In a In the empirical details of a system so the question was is there any I think if I understand your question Is there any reason to prefer A over C are they completely symmetrical here in this purely heuristic view or we're not worrying about things like measurement error a Batch of stuff that I call residual confounding their equivalent and it won't matter in reality It'll nearly always matter because one of them might be measured more precisely than the other and then it then you pick the one That's measured better. Yeah, but you may not know that but in in the reality This is true and that's it's a good prompt for me to say something that I want to say over and over again as we keep going through the course is that This causal inference business is useless without a sufficiently robust estimation procedure You there's these two cooperating halves of this business You got to be good at actually getting the information out of the data and respecting measurement And you got to have a causal framework that you're thinking with and either alone is not sufficient, right? They're both necessary and the estimation stuff is sometimes really important like when we get to instrumental variables You got to do that, right? And that's no joke, which is why it's in the second half of the course, right? We're gonna use mark-up chains and and the multivariate distribution to do it. It's gonna work It'll be like magic. You'll be amazed but you know about 50 years of statistical research were required to produce the model I will show you so it's no joke. This is no joke either But we put them together and we get a lot they enhance one another a lot Brett is that a hand or a scratch? If we had it yes, if we could measure you we could condition on you But we haven't I'll keep using you means we haven't noted it. I'm sorry, Lauria On both Both B and C well then if you condition on C you block that path. Yeah, you can block it. Yep Okay Waffles you remember waffles waffle house beginning last week. Yeah, okay. Does everybody want waffles now? Should have brought my waffle iron in actually made waffles So it's true. There's just a correlation between Waffle houses per capita and divorce rate in the United States state-by-state. I showed you that beginning of last week Let me assert a causal network here Which is not totally silly about how the variables we've seen for the divorce example connect to one another and then ask you What do we need to control for to remove dysphoria correlation between waffle houses and divorce? So rather to estimate the true causal impact of waffles on divorce Yeah, what would it be? So let me explain this graph to you very quickly. This is same elemental compounds you've seen before at the Far right here. We've got w2d. That's the path of interest. We want to estimate any direct causal effect of waffles on divorce There's a back door into waffle houses from being in the south How does being in the south cause waffle house waffles waffle house who started in the south as a southern business and it mainly stays in the southern states It's never gone across the Mississippi. I don't think those are a few on the west of Mississippi. Just not very many Being in the south is correlated with lots of other things as well And so there are other arrows coming out of the south Notice that this is where the science comes in if you know about the variables there aren't errors entering south Right marriage doesn't cause a state to be in the south Because you know what the variable is it's a geographic location you get information about causation from that It gives you meaning you do this all the time when you do statistics, of course It's just this is a way to formalize that knowledge So we get causal errors coming out of the south into age of marriage There's there's pressure normative pressure in southern states having lived there myself To get married at a young age rather than shacking up That's what it's called. It's called shacking up when you live with your partner when you're not married in the south Lots of places where that's not approved of and so that's pressure to get married at an earlier age There may be a direct effect of the south on marriage rates itself Because it's a valued institution And then I argued in the lesson last week that there's a plausible causal effect of age of marriage on marriage rate Because if there are more young people and people get married more when they're young then that'll increase the aggregate marriage rate Just as a side effect Finally there may be direct effects of agent marriage or marriage rate on divorce and that bottom triangle is the thing we studied last week Remember that the classic confound and we saw that conditioning on marriage rate conditioning on agent marriage showed that marriage rate has no plausible direct impact on divorce or at least not much So I ask you what do we have to do in this graph? To estimate the causal impact of waffles or s Yeah, so there are multiple answers you could do a and m you'd have to do both because there's a there are a lot of past What are the past? How many pasts are there? Well, all of them start with w when we go backwards is backdoor pass remember we come out the back door of waffles This is where the metaphor pays off right Okay, so you go to south and then we can go Straight down to a and then over to d we can go over to s and then to m and then to d We go to s to a to m to d Right all these pasts It is sufficient to condition on s and it will close all of them You could condition on a and m to that'll do the same job, but or you could just condition on s So it'd be sufficient just to have a very simple regression here Where you just put the location of the state and the model and that should if this graph is right Remove the whole spurious effect and tell you if there is any direct causal impact of Waffle House on divorce I leave it to you and your own fun at home to do this with the data set and see what happens. Yeah, Brett That's a valid path but it's blocked by a collider Yeah, yeah There's a collider there, right, but if you condition on that collider you could open it. Yeah, but that's why you do it on a Yeah, yeah, exactly. Yeah Now I think it closes all the back doors I think so It's in the graph but the past the pasts all the past coming out of this will be blocked Right, you can't go through A is a pipe So if you condition on you close A's in the middle of a pipe Yeah, yeah, yeah in this graph Waffle House is always a mediator for south that that's okay, though It still has a causal impact right mediators cause things Yeah, yeah conditioning on A and M should do it and conditioning on S should do it. Yeah, make sense This is fun. You you should have fun at home Just draw up a crazy dag and then it's like a party game, right? Have your friends draw a dag for you and then you have to tell you all the back door pass Yeah, if you're drunk, it'll be better. Yeah Familiar with one That's madness. Yeah, that's madness. You should never do that and I know that's taught In this building and you should never do it. It has no it has no statistical legitimacy. That's all I'll say How do you select the coloniarity arises from the causal graph? Yeah, so you can't read it raw It's the it's it's the independence is the conditional independence is that matter There is no logical framework in which looking at the colonial area between pair what paralyzed colonial year We're back variable. So anything interesting about causation That's just a fact and this I think this framework will tell you how to deduce that if you buy this That's true, and I know that's been taught you should forget you ever learned that and never do it If you see it in a paper, you should tell the editor that it should not be done Was I would you say that again because it was grammatical, but yeah, it sounds crazy Yeah, yeah Yes Yes Being in the south is like a treatment that produces waffle houses Well, no, I mean there's a lot more causal steps in between but that's that's basically true There was this economic development story and waffle houses are independently owned businesses that buy licenses and all this other stuff Sure, but they're confined within the south almost entirely Yes, for South Enos, and that's why exactly Well after you condition on South here right you block all these other paths if you get a lot of coefficient out of waffle houses that should address that right I Don't follow that If if this is the only absent some residual confounding assumptions now There's just this graph and you're just blocking pass When we will talk about measurement error another interesting kinds of residual confounding later in the course And then maybe we can look back to this and talk about it. I think in reality Yeah, you can end up studying different stuff, but given this graph No, you're just blocking pass and pass or blocked and that's it And then we tell stories absolutely about South Enos Have you gone to college in the south? I know all about South Enos I was dabbing before anybody knew what it was Okay, I don't think people dab anymore That was a thing in Atlanta when I was going to college in Atlanta freak Nick and dabbing Okay, so I Want to get to overfitting today? so Good news. There's just great. There are a number of packages both in our and other other programs you you give them a dag You just write the arrow network in there examples of this in the chapter and then you can have it Algorithmically tell you things about the graph and one of the things you might want to do these graphs is you shouldn't trust them You can't test your dad. You don't have to accept it blindly like it. It's a religious eat it In the way you test the dad as you inspect the implied conditional Independencies that is the structure of a causal network implies that some variables are independent of others After you condition on other variables and this is just a logical thing You can ask for any dad your your computer can spit out all the implied conditional Independencies and then you can test them using regressions. Yeah, so If you want to test this dad, what are the implied conditional? Independencies First so this is how you do it this package Daggerty The full code for this including specifying the dad is in the chapter a is independent or d separated D in d separated the d and d separated needs dependency separated. It's a really awkward way to say independent This is what it says in the papers in this literature. So a and w or d separated conditional on s So a and w don't be sure there should be no correlation between agent marriage and waffle houses after conditioning on the south Right, you can see that in the graph right you block that path It's a fork right a fork from south D should be independent of s after you condition on everything else This is a hard one, but that's a prediction of if this dad is true. That should be true in your data and then finally Em should be independent of w that is marriage rate should be independent of waffle houses after you condition on the south This is like the first one the top one Block those paths. So you can test these and see if they're true And I encourage you to give this a go with the waffle house data set and play around with it Parts of the dag might be wrong right if one of these is wrong doesn't mean the whole thing is wrong You just have to think and you have to use science And this is a good lesson that the answers require data, but the answers aren't only in the data Right the answers are also in the metadata and the things that we know about the variables and the causal Background that gives us the motivation to study the phenomenon in the first place So this brings me I want to summarize this part of The material the good news is causal inference is hard, but possible Right. So there was this Scottish fellow David Hume Preserved here in bronze with his very shiny toe. You can go touch his toe. You haven't done the visit yet You should go touch his toe and does toe is worn because people touch it all the time and Definitely go touch his toe Take a journey and touch his toe so Hume I Would say did more to sort out causal inference than any of any other thinker and and his major contribution was to say Look correlations not enough and lots of philosophers thought it was if you Experiences enough and his his lesson was no experience is not enough you have to make assumptions to make causal inference and there's no way out of that and The whole approach to causal inference afterwards has been the Hume in view of things We haven't really solved Hume's problem here But I think Hume would be pleased with what we've built because we've taken this to a really elaborate logical extreme So that we can be disciplined about what we do Given a dag you can demonstrate whether a study is capable of making a causal inference and that's a huge victory Also what I'm really excited about is it shows that experiments are not necessary To do causal inference experiments are great. Why do experiments do experiments close all the backdoor pass by? setting of variables values You play God and you set them and that closes all the backdoor pass because you remove all the causes on that variable Right, you're the only cause on that variable in an experiment a good experiment Right a bad experiment and lots of experiments are bad. We'll talk about that when we do instrumental variables Most experiments are actually like instrumental variable problems, but Because your treatment is is only correlated with the actual treatment, right? You have this intent to treat is the thing that happens but We don't need experiments if we can use the backdoor criterion and other tools of due calculus to figure out when causal inferences are Legitimate and this is great because some things especially in the human sciences cannot be done experimentally They're either impractical or unethical or both Yeah, I think I study human evolution and most of the questions I want answers to cannot be studied experimentally Absolutely cannot so if the experiments are necessary. We're done We should shutter the Institute and move on right but we we know that lots of progress can come from Observational studies, but we can be more disciplined about it So Also the thing about experiments is you have to choose an actual intervention in an experiment And this is a drawback as opposed to an observational study So why because interventions change a lot of variables. They don't just change your target variable They change a number of things. So I ask you to imagine for example trying to experimentally manipulate obesity You can't do it directly You can't there's not a lever on a person that lets you adjust their body weight You have to do it through some change in exercise or diet And which of those you choose will have different causal effects on other things in their lives You won't be studying obesity. You'll be a studying the network of effects that ripple out from the intervention you chose Yeah, and that's true of all experiments and on an observational study though. You can study obesity directly If you can close the back door pass through other things like activity Yeah, does that make sense? This is a huge advantage of observational studies compared to experiments okay As I mentioned there's a lot more than just the back door criterion and later on the course I'll give you an example of two things which are part of this framework of thinking about causal inference But you they don't rely upon closing pass. They rely upon stuff. I'll explain later exploiting covariance structures and The first is called the front door criterion is a case where we use the existence of a mediator to remove the Confound between the exposure and the outcome. I know psychologists have probably seen this before right It's called the front door criterion in this literature and the other is instrumental variables, which are super popular in both behavior genetics and in economics and I'll show you how to use these as well And I'll just leave them as mysterious weird diagrams for the moment like hieroglyphics, but I'll show you how to use these Again lots of assumption is needed, but you can't break on through to the other side All right, that was a long walk to that point right, but I got there All right, you shouldn't be overconfident I'm going to move quickly through this there's a section in the chapter where I Recite all of these cautions as well. You shouldn't get cocky assumptions still necessary Dags are models. They're heuristic models and then there's all this residual confounding stuff We'll talk about later on like misclassification. What if you classify the outcomes wrong, right? So this is the kind of measurement error measurement error is always there You could put measurement error and an extreme form measurement error is missing values You can put these into a DAG So we'll get there near the end. They're going to show you how to run measurement error models and get that in This is important if like me or an anthropologist The record is devastated and measurement error is just part of your life. Half of your measurement is error I think about a carbon date anybody or archaeologists in the audience, right? So radiocarbon profile. It's like all error So you got to deal with this and be responsible about phylogenetic trees. What is that? It's a big nest of error and You just have to deal with these things to be honest about it But we can't get that information about that uncertainty into the model. That's residual confounding and Last thing I want to say You shouldn't let DAG stop you from making a real model if you had a real model you wouldn't need a DAG What's a real model? It's a real causal model nonlinear dynamic system model So like physicists don't need DAG's why because they got real models Right people who do pharmacokinetics don't need DAG's why because they have real models of the system And they get those models have causal implications They're not just multiple regressions and they use those models as statistical models as well And that's where we should be driving towards right so they can hear an astronomer. You don't need a DAG Right, you've got a real model of the system to make predictions with That's it. That's what you want to drive towards and push yourself past this dependency on the heuristic thing Okay, and with that let me talk about astronomy All right, so I think I have I have 30 minutes to give you the intro To overfitting which is a companion set of problems to causal inference so this is Nicholas Copernicus who was a Polish astronomer and lawyer, I believe and Ecclesiastical lawyer if I remember the history right the best kind of lawyer and He's very well known of course for Arguing for the idea that the Sun lies at the center of the solar system And then you know, there's this whole heroic story of a revolution in Understanding and the overturning of the Catholic Church's views on things and Galileo had this fight with them But that was long after this right so Copernicus publishes a book dies The much later Galileo has a big fight with the church over it it comes much later What's missing usually when this story is told is that Copernicus's model was terrible just really awful No better than the Ptolemaic model. They made exactly the same predictions They were using the same data and they made exactly the same predictions and they used exactly the same Fourier series approximation system Copernicus use circular orbits It was only Kepler later who just who realizes oh wait orbits aren't circular We can solve the whole problem if we just allow ellipses But that comes much later Copernicus is still things like circles the music of the spheres Right, there's going to be these circles out there in space And if you're committed to circles you can't make the solar system work unless you start stacking circles on circles And so Copernicus had epicycles too This part is left out because it makes it a lot less romantic right about it this way This was not a huge victory It was a an equivalent model in terms of predictive accuracy But with the sun at the middle instead of the earth and so people are like yeah big deal Right, you're going to make me fight with the church over a model that's it just makes the same predictions No, thank you Right now, of course, this is an achievement still to show that you've got equivalent systems There is a way in which these models are different however Even though they make the same predictions the Copernican model doesn't need need as many circles I forget how many fewer but slightly fewer You need more epicycles in the Ptolemaic model than you do in the Copernican model It's simpler And this is one of the things that Copernicus argued for in his book. He says yeah, it makes the same predictions But it's simpler and therefore it is more beautiful Right, there's this this argument that that simpler things are more likely to be correct And scientists will often invoke this thing called Occam's razor named after an eccentric monk William of Occam And Having done, you know a full 10 minutes of research on this all I could ever find For a quote for this was somebody else saying that he said this thing in latin of course because that's what monks spoke That plurality should never be positive without necessity This is not really a fully developed scientific research program And people invoke it heuristically we need something more statistically substantial if we're going to really decide between models based upon their complexity And I don't think The focus on simplicity alone is really sufficient Usually when we're comparing models we have to make trade-offs between complexity and accuracy The the Copernican case is misleading because you've got two models which are In their predictive accuracy the same or they're fit They're the same they fit the solar system the data we have at hand exactly the same But one simpler than the other usually that's not what we're deciding between we're deciding between models Which are more complicated but make better predictions Hopefully they fit the data better and models which are less complicated but fit the data worse That's usually what we're trading off in science. And so Occam's Principle is incomplete. It's only one side of this prefer simplicity, but what about the accuracy part? Don't we care about that too? How much loss of accuracy should I be willing to tolerate for a unit of simplicity? William Right and williams long dead so he will not tell us So let me tell you another allegory I want you to think about the voyages of you lissy Sorry, I was a classics minor and so there's all this stuff bleeding my head From having taken years of latin and greek and read about stories about monsters and stuff So Ulysses probably needs no introduction So let's just get to the point in his journey where he fights a monster in a whirlpool So he gets near sicily over here at some point and there's this monster Two monsters silla and paribdes that eat most of his crew. I think the story goes and Here it is pictured, right? and So We have on the left that's cryptus the whirlpool of cryptus which sucks ships down to their certain doom You have to avoid that when you're sailing between this narrow straight And then there's the mini-headed monster silla on the rocks which gobbles up sailors You can see some sailors taking a wild ride right there Number two And I want to use this as a admittedly silly, but hopefully it'll stick in your memory metaphor for How complexity and accuracy trade off There are monsters on both sides And they have different properties to them and we can characterize these properties And we want to choose models which manage to navigate between these two monsters, but the monsters are always there Information theory requires that they will be present Before I get into the details of that I want to say in the wilds of the sciences the kind of standard Method for choosing which variables you'll include in regression is something that I want to call stargazing It's called stargazing because you run a regression and then you pick out the asterisks And you keep the you keep the coefficients you want a model where everything is significant, right? And uh, uh, you should never do this again You should not do this. There is nothing about p values now You know, I don't like p values anyway because they don't answer a question that I think is relevant But whether you you use them or not The design of p values Is not to solve this problem. This is not what they're designed to do in this so they do a bad job at it All right, if you were going to choose a model that made good predictions, it will often include terms that are not significant And a model that makes good predictions will sometimes exclude significant terms as well The statistical significance is not a criterion about explained Uh predictions. It's just not what it's about. It's about controlling type one error rate. And that's it It's not about predictive accuracy So, uh, and I should probably shouldn't have to say but I will five percent arbitrary an arbitrary five percent is arbitrary It's like it's like a scientific superstition that we care about five percent, right? It's it's bizarre It just has to be because certain bony fishes had five Partiliginous rays in their fins and they got up on the land. They started walking and they turned into people And so it's this, you know, five thing That's the only reason Uh, and I know it's it's bizarre, but it's true and if aliens ever colonelized the earth and save us They will say this goes, right The bony fishes the tyranny of the tetrapoda. I call it in the text Um, so don't stargates, please Uh, okay. What are our goals for the rest of this week? I want to help you understand these two monsters They're not actually called psilocaryptus. They're called overfitting and underfitting And they're they always happen in statistics and you have to worry about them I want to introduce you to a phenomenon called regularization which is the procedure of Teaching statistical models to expect overfitting and guard against it This is something that's done in machine learning nearly always and done in the sciences only rarely But we should do it more Cross-validation and information criteria are tools you can use to cope with overfitting They don't solve it, but they measure it They help you estimate predictive accuracy out of sample They estimate the overfitting risk of a model And they help us understand the relationship between model complexity and the monsters on the cliff And I want to emphasize throughout all this material that Finding a model that makes good predictions is actually a different task than finding a model that gives you valid causal inferences You can have a nonsense model that makes really good predictions But if you then decide to do an intervention designed upon your understanding of it, you'll be in big trouble So think about like how netflix predicts your viewing habits, right? So they've got you may have heard about these competitions, right? Where machine learning people tried to use the netflix database to make good recommendations for people No one understands how those advice systems work, right? Now you may say that they're not very good Yeah, I agree. They're not very good like amazon 2 like why are they recommending, you know So you buy a lamp and they're like, would you like another lamp? No, I actually won't enough today. Thank you But that aside nobody understands how those systems work They're just big engines that may spit out predictions Uh, you that works you can make good predictions about understanding anything about the system in the basic sciences We usually want to understand the system because we intend to intervene Um, that's a distinction. Nevertheless, these tools are very useful Their companions the causal inference is just not the same thing. Okay Let me give you an introduction. I think I can I can Excite you about this topic for your return on friday, so Think about a contest between different models. Uh, these models include different, uh, structures and combinations of parameters and and variables as a race between horses Each each horse is a model and on any given race that is any given sample One horse will win will do the best will fit it the best Um, and the distance between the horses gives us some information about the relative performance of these horses On average across tracks You want to make a bet on the next race not this race, right? You just lost money on this race Okay, i'm sorry But now how are you going to bet on the next one and the quantitative differences in the finishing times are information That you want to use and this is something where all we've got To to predict the next race is the performance on the current race or the past races And this is it's a positive information But it's what we've got but you imagine that the finishing times won't be exactly the same on the next race So what can you do with this what you should not do is merely always choose the horse that runs the fastest, right? Moving past the horses for a second There's this basic problem with parameters is that if you add a parameter to a model it will fit better It'll fit the sample better There's an asterisk at the bottom which is incredibly important that will come too later in the course But for all the models you've seen so far in this course every time you add a parameter the model will fit better Guaranteed and so you cannot use Fit to sample as a measure of anything useful All it does is tell you how many parameters your model has That's all it does And the basic problem here is there are two things going on The are two monsters One is underfitting models that are too simple don't learn enough from the data Models that are Then there's overfitting bottles that are too complicated Have a bunch of predictive variables that don't actually matter and extra parameters They learn too much from the data. They essentially encrypt the sample with increasing accuracy Until you if you have one parameter for every data point you can completely encode your data set It'll be encrypted in a different coding scheme and it can spit it back out as predictions exactly Right. It's just an encryption scheme That is not what you want, but in the limit you could do that. You could have a parameter for every data point and then your goal Yeah, but it'll make terrible predictions. And that's what I want to show you So what we say is there's just goal to learn the regular features of the sample Those are the the regular features are the features that will generalize to other samples coming from the same process And this is the struggle is to figure out what those regular features are So you can't just use the most complex model because it will tend to choose irregular features of the sample as well The asterisk at the bottom Is that in multi-level models that they do not behave this way you can add parameters to a multi-level model? And it will reduce overfitting. In fact, that's why we use multi-level models They're more complicated, but they're less likely to overfit and that's why I always use them So Some of you know I have this project that I finished that has 27 000 parameters And it overfits Very little because of that right all of this hierarchical structure that reduces overfitting Okay What's the problem with parameters? Let me give you a toy data set. It's real data, but this is a toy example to help you understand Again, I'm an anthropologist. So the examples are going to get boring to you if you're not an anthropologist But or maybe they're exciting humans evolve once right and we're trying to figure out how that happens and One of the things that's interesting about humans is we have big brains and So if we look at body mass against brain volume for humans and other extinct bipedal apes The hominids there's this scatterplot here for just a curated example Of specimens afarensis africanus and the lower left and habilis and voiciagral fenses are gaster and then ourselves sapiens at the top There is some association between body mass and brain volume bigger species didn't have bigger brains But we're a real outlier on this. What's the statistical relationship? If you wanted to fit a model to this, what would you do? And how would you decide if if the model is good or not? The most common and abused measure Of uh, how good a model is how much variance is explained? It's this thing called variance explained r squared r squared in a linear regression is just the uh relative Difference between the variance in the outcomes Minus the variance in the residuals or one minus the ratio the variance in the residuals to the variance in the outcomes What's the variance in the residuals remember the residuals are those leftover line segments after you make your prediction to the to the uh, there's the model's expectation And there's the actual data point that distance is a residual What's the variance in residuals? That's the amount of noise still to be explained And then there's the amount that you had originally. That's the variance in the outcome So if there's no variance in the residuals r squared is one Which means you explained it all Congratulations, give yourself a hand, right? Uh, and what I want to show you is you could always get to r squared equals one. It's a trivial Uh, first I'm going to show you how to do it and then you can be famous All right, um I mean this is a bit of a joke, but I've seen nature papers where people have display regressions with r squared equals one In a graph in a nature paper So, uh, there's a lot of statistical naivete out there even in the best journals Okay, so let's start with something simple Simple linear regression. Actually, I don't think this is a bad model This is a place I would start I mean I can do a little better because there's going to be allometric scaling here. We've talked about that actually solutions to To the second week or up and I talk about allometric scaling in solution to one of those Hopefully you'll find it fun. I imagine I assert that people are cylinders And then you can figure out a lot about the relationship between weight and height from that fact Just from the equation of a volume of the cylinder. Okay, so Uh, linear regression here isn't bad. This is yield linear regression between body mass and brain volume R squared is 0.5. We've already got half the variation in the data Just with a straight line. That's really good. There's something going on here, right? But why stop there? I I showed you how to use parabolas. Now you're drunk on power right and so You turn down the lights and cackle and and fire up the parabola and what you get from this is even better You've gained a few extra percent of r squared now Um, you see curves down now. It leaves us out But it works better for the others because it does curve things. It does curve down for the non-human species at the end Yeah, you with me so far There's no reason to stop at a parabola. There's nothing special about parabolas We can go all the way to a six-order polynomial here And then we run out of data points Let me show you what happens as we progressively march through higher order polynomials Here are the two I just showed you the linear and first order second order, right? We go from r squared 0.51 to 0.54 Put in a cubic Cubic, uh, paladomas can turn twice Good times. This is even better. Now we're up to 0.7 Right 0.69. This is pretty good. We got a big jump here when we put in the cubic Probably it's cubic brain evolution is cubic. I mean you get heavy enough your brain collapses Right, that's exactly what happens And uh What about the fourth order polynomial? This can turn three times Now we're up to 0.82. This is even better. Wow. Actually you get heavy enough. It's going to skyrocket. We should try to get heavier This will make us all happy to break Obviously, this is absurd, right? This is an exercise in absurdity We go up to the fifth order polynomial. This is getting really good now. Now. We're up to r squared 0.99 We've almost passed through every point exactly here. Yeah, almost. There's just one little point down there. You see that's just off the curve And the end and then finally we reach nirvana the singularity All variance has been explained because we have a parameter for every data point r squared is one You can publish this in nature Right. No, of course you can. This is absurd and I show you the Zero line here to show you that there's nothing stopping this thing from presenting predicting negative brain volumes, right? Clearly you wouldn't accept this, but if all you do Is choose models based upon r squared. This is the trap you're walking into It's obvious here because you know what the data mean and you would never do this in this case That's why I use it as a lesson But in a bigger data set you can't look at the data this easily in a multiple regression It's not going to be obvious what's going on and the same hazard exists This is the problem with parameters. This is the monster of silla You're walking into its jaws here So let's think about underfitting and overfitting in a principled way now. What does underfitting mean? I'm going to use it to mean that the model is insensitive to the details of the sample It's overly insensitive. That's why it's underfitting So let's take the linear regression for example We can repeat the linear regression Deleting one species at a time and draw a bunch of regression lines on the graph And that's what I've done here. I give you the code to do this in the text. This is easy to do And so We've got one two three four five six seven species. We can get seven regression lines one Leave in each one out, right and then fitting on the other six and draw a bunch of lines We see the lines don't move very much There is a line that deviates a lot when we drop home a savings. That's the line at the bottom When you drop home a savings you get a big drop, but the other ones don't make much difference This model is very insensitive to the sample. Yeah That's underfitting Well, we don't know if this is underfit actually but this is this is how underfit models behave They don't care about the details of the of the sample because they're not learning much from it Maybe too little overfitting is the opposite This is when the model is incredibly sensitive to arbitrary details of the sample As like this fifth order polynomial. I think this is the fifth order polynomial or the sixth order If you delete any one point the whole curve just flies out control All over the place because this is how this is why I I tried to urge you not to use polynomial regressions Back in chapter four is because the whole curve moves whenever any parameter changes And it just flies all over the place and they're very hard to control So this is a classic kind of overfit model And what you want are regular models. You want the regular features of the sample How do you achieve that? Given we don't have crystal balls There are multiple strategies and you can use them all together The first thing is some sort of regularization procedure and in Bayesian statistics One way to do that is through the prior prior distributions those those prior predictive simulations I was forcing you to do earlier they regularize inference because they're skeptical of impossible relationships Yeah, so they they they create regularizing force on your inferences You can be even more aggressive than those and get more regularization I'm going to show you with some simulations how this works In the non-Bayesian approach you'll you'll see things called penalized likelihood Which is mathematically identical in the models you've been using so far to use it a prior Yeah, but machine learning people use those all the time Scientists seem to have some problem with with doing this But why do machine learning people use it because it makes better predictions when you regularize and that's why they always regularize Cross-validation is what I just showed you Cross-validation is a case where you drop observations You fit on the remaining ones and then you predict on the ones you left out So you're testing the model's ability not to fit things but to actually predict things From the same process and I'm going to show you how to do that as well cross-validation doesn't solve Doesn't induce regularization, but it tells you if a model is overfitting and lets you compare models On on a performance measure that matters Right fit to sample isn't what matters. It's fit to the future Right, so you don't have the future. So you fake it you leave out some data and you treat that like the future Yeah It's not perfect, but it's certainly better than using our square Um information criteria is is a theoretically based approach that measures the cross-validation performance That'll just be cruelly in it It is a way to use information theory to say in theory What the predictive accuracy of a model will be at example and it works It's amazing that it works as well as it does and I'll show you how that's developed These are things like the aka information criterion Then finally science that's the good part. So lots of science involved here. You need iterative learning in groups That's science Okay, I've got one minute So let me get to a point where I can I can set up something exciting All right So the road we're going to journey now when you come back on Friday is we want to get to cross validation And the information criteria called w a i c the widely applicable information criteria, which has replaced a i c a i c is is is a hero of a past war and should be buried with honors But it is completely replaced by w a i c now and The journey to these Uh approaches requires some setups because I want you to understand why we're making the choices that we do And I'm going to give this to you in a fairly heuristic fashion. There's a lot more detail in the chapter Um, the first thing we have to answer is how we're going to measure accuracy And this is not a small problem. There are lots of bad ways to measure the accuracy of a model Like receiver operating characteristics And things that you should not use there is a Actual gold standard for measuring the accuracy of predictions and I want to motivate that for you first because that's what we want to develop with Um, and then once we've got it, we have to measure distance to the target that is we've got a model It's an approximation of some true process that we don't know If we've got multiple models and they're all approximating it. How do we decide how close they're getting? This is not a trivial question actually and I'm going to show you there's again a principled way To answer this question that comes from information theory Um, and then we actually need a practical way to estimate that distance once we've decided in principle what we should be estimating And I'll show you how to do that as well um, and then Uh, I want to show you how to develop these instruments like cross validation and w a i c So with that I'm going to put up this slide and say on friday I'll resume right here with a crash course in information theory. You'll love it. All right. Thank you for your attention And I'll see you on friday You