 What's this thing they do with the Hollywood movies now where they do a teal shading, basically the whole environment. I can't not see it now as every Hollywood movie. Okay. Hello everybody. Welcome back. We got a lot to do with the causal tear and finishing up chapter five. So I'm going to get right into it. When I left you last time, we were looking at ways to plot posterior predictions from a multivariate regression. To remind you there's lots of options and you're not constrained by anything except your imagination and the bounds of decency. So our goal with posterior predictions is to see what the model thinks about cases we have observed, the ones we use to train the data and potentially cases that we have observed. So usually when we do the posterior predictions, we're looking at what the model thinks about the cases you use to train the model. You'll notice it doesn't exactly predict them. That's not bad. This is worth saying. It is trivially easy to make a model that will perfectly predict the sample. That's really, really easy. How? Give it a parameter for each data point. You're done. Right? So, but that's not our goal. Our goal is to do better out of sample. And so next week, we're going to spend all of next week on that particular problem and that framing. But for now I just want you to keep in mind that if you notice that the model is not exactly retro-dicting, that is the word, the sample, that doesn't mean something's wrong. But the pattern of error in the retro-dictions can inspire your imagination as a scientist, given your contextual knowledge about the topic. So that's the idea. When we get to multi-level models later on in the course, the whole point of multi-level models is that they do not fit the sample perfectly well. And as a consequence, they make better predictions out of sample. So you actually expect them to have a particular pattern of mis-retro-diction on the sample. That's a feature, not a bug. So we'll come back to these issues as we go. So when I say check model fit, you don't want it to be exactly right. But the pattern of mismatch, of mis-retro-diction is informative. And sometimes you'll notice that it's completely, totally off, and it means that your computer did something wrong. Or you did something wrong or both. And that's part of what we're diagnosing. There are malfunctions. And we want to keep all of the uncertainty in this. So we use the whole posterior distribution to make predictions. Predictions have uncertainty. And that uncertainty arises from the uncertainty in the parameter estimates, which is embodied in the posterior distribution. So it's the same story as always. Take samples from the posterior. For each set of samples, compute a prediction. That gives you a distribution of predictions. And that's what you want to put up on the graph. That's a rehearsal of what we did on Wednesday. Okay. Here's another way to plot the posterior predictions, the posterior predictive check, which will get us to the same lesson about some states. So now what I'm looking at, these are the distribution of residuals for each state. And these are posterior distributions of residuals for each state. And I've ordered them by the magnitude of the residual. So what does this mean? So think about states that have a so-called negative residual. This is a case where the state has way less divorce than expected by the model. And these are the states at the bottom of this graph. I'm showing you here where each row, this is another one of these so-called caterpillar plots, or dot and line chart, something like that, where each row is a state, and the horizontal axis is just the magnitude of the residual. You'll notice there are error bars on this because there's actually a distribution. These are predictions. And so at the very bottom we have Idaho, which has way, way less divorce than the model thinks. Why? Because the model didn't know something about Idaho that everybody who has lived in Idaho knows, which is that there are a lot of members of the LDS church. And divorce is difficult in that cultural environment. There are other states which have less divorce than the model thinks as well, like New Jersey and Minnesota, North Dakota, and Connecticut. And Utah actually comes in down there, not around the same as New Jersey. There are positive residuals way up at the top as well. So Maine has way more divorce than the model expects. That's way up at the top. Each of these is some interesting feature. And if you've lived in any of these states, you probably have a rich story you could tell about why that's true, but then you'd have to test that again against the data set. And I want to caution you, of course, that there are a whole host of rich dangers in this exercise to explain everything. And you do have to exercise some self-restraint about that. And this topic will come up again and again in the course. If you've got a fixed data set and an infinite amount of time to make models of it, eventually you will find something interesting, but it will probably not be real. And this will come up next week, call this the Curse of Tip of Canoe in the book, that there's always a pattern in every sample, and that doesn't mean it's real. So if you stare at a pattern of numbers long enough, you will find religion in it. And this is what human societies do. And science is a human society, and it will find religion in every sample. So you have to guard against your natural human talent and your superstition to some degree. And we'll talk about how to do that. I think the first order in junction, though, is just to be honest about what stage you're at. Where did this hypothesis come from? Did it come from the data set by looking at posterior predictions? Or was it something that came from an external theory before you saw the sample? Just being honest publicly about that distinction is the first step, right? Otherwise, you risk doing these things that the psychologist will recognize because psychology is much more mature about this than biology at the moment. Things called p-hacking and harking. Harking is the best hypothesizing after results are known. Hark. It sounds like... Right? It's the sound of harking. It can get you promoted. So final example, correlations remain in the residuals in this model. And so it continues to be true, interestingly, that waffles per capita is correlated with divorce after accounting for all the other predictors. So you take the residuals plotted on the right here and you plot them against waffle houses per capita, which I will assert is proportional to waffles per capita. That may not be true. There is still a positive correlation with almost all the posterior probability for the slope being greater than zero. I still continue to believe that waffle houses do not cause divorce. And so there's still something else that has to do with the historical pattern here, right? Some historical accident going on. This is the danger that no matter what you control for, there may still be spurious associations for a number of reasons. And we're going to spend the rest of the day talking about some of those additional reasons, maybe. Let's talk about, before we move on, to the dangers of adding variables, which is what most of the day will be about, let me introduce you to something I'm very fond of, which is understanding statistical procedures and causal inference through synthetic data. So by synthetic data, I mean fake data. But synthetic sounds a bit more noble, right? Or simulated data. And we do this so that we understand how a data set arises through some generative process so that we can understand how our statistical methods didn't work on that. This is, in my experience, it spells a lot of superstitions about these things. So if we wanted to manufacture spurious association, here's what I think is the simplest example of it. This is in the book. It's box 5.15. We're going to simulate 100 cases where there are two potential explanatory variables. These are called X variables in applied statistics, right? And the joke in statistics is that you know someone's an applied statistician because they call the outcome variable Y. That's because they have a predictor variable called X. A theoretical statistician has exactly one variable, and it's called X. And that's how you know someone's a theoretical statistician or an applied statistician. That joke is only funny to statisticians as it should be. But it's basically true. So you know I'm an applied statistician because I have X explanatory variables, and a Y is an outcome. So we have two X variables, one I call X real, and the other is X spur for spurious. I simulate X real. It's just a random normal, unit normal. So this is being explained for a variable. What causes it? It's not part of the model, but something causes it. And X spur is partly caused by X real, and this is meant to be shown by the path diagram at the bottom. They're X real in the middle, and it influences X spur, but there are other things that influence X spur. It gets its own error. It's simulated normals with a mean on X real. So each value of the spurious explanatory variable is you take the X real and you add a little noise to it, and that's how you generate it. So they're correlated, but they're not the same numbers. Makes sense? So X spur has another cause as well. But one of its causes is X real, and then Y is just like X spur. It's correlated with X real because each outcome variable, each outcome value Y was centered on X real, but then you added some noise by simulating a normal deviant from it. Makes sense? Yeah? So these path diagrams may or may not be familiar to you, but they're very common in causal inference, and the structural equation model uses things that look the same but mean something slightly different. There's no, these path diagrams don't necessarily mean that things are linear, for example. They just mean cause. So now you put these in a data frame, and I encourage you to analyze this using your powers of multiple regression to see that the classic problem arises. You get spurious correlation. X spur will easily predict Y, right? But if you put them both in, then X spur gets knocked out. That is, its beta coefficient is close to zero, straddle zero. And this illustrates the case and shows you how it works. So there are a number of boxes in the book that go through these kinds of simulation examples. They're usually in optional boxes, but maybe they shouldn't have been optional. They're kind of the core of understanding how you can learn about these methods and how they work and help you understand the limits of statistical inference as well. Because often you can easily construct causal scenarios where it's just impossible to figure out what's actually driving the outcome. It's not hard at all. Okay, so let's, I think we've got one more example of a case where regression does save today before we get to the cases where it blows up the world. So, mass dissociation. This is a very common feature. It's a little different than spurious correlation. Often there is a real meaningful causal association between some outcome and some other and some predictor variable or multiple predictor variables. But the problem is if you don't have both of those predictor variables in your model, you can't tell that either of the causes is a cause. And the reason is because they have opposite influences on the outcome and they're correlated with one another in some way. And so I'm going to show you some examples of this. I call this mass dissociation. It's pretty routine in natural systems and even in experiments. It's easy to get it in what you think are highly controlled laboratory experiments. It tends to arise, as I said, there's some other predictor that's associated with the outcome in the opposite direction. And then the two of them in the system tend to cancel one another. But if you can measure them and they're not perfectly correlated, you can pull these things apart. You can also get it when there's just noise in the predictors. This can also mask association. So measurement is fundamental, right? And getting the measurements right is job one. And sometimes what the model thinks about and the cause is is just which variable has been measured most precisely. And then it's entirely your measurement technology that is influencing your causal inference and not what is actually driving the system. That's something to keep in mind. It's called residual confounding in statistics. I think it comes up again next week. Okay, so let me give you the data analysis example quickly from the book on this. Since I'm an anthropologist, I think a lot about the evolution of brains. And so interesting thing about primates is they have pretty big brains for mammals, wastefully large brains, in a sense. And it's not clear what they do with the mind. This is a continuing mystery in the field. Endless back and forth debates about why primates have big brains. And it never seems to be any resolution because it goes back and forth forever. Brains are puzzles because they're incredibly expensive. They require a lot of energy, especially to grow. So this may be one of the reasons that primates grow up so slow is because if you're going to have a big brain, you have to grow up slow. Because your body growth is slowed down by the fact that you're spending a lot of energy on the brain. So we can do comparisons across primates to see what their relationships are between brain size and energy demands. And a nice thing about primates is since they're mammals, when they're young, they get essentially all of their nutrition from their mothers, from milk. And then the composition of milk is, well, that tells you something about the energy demands of the offspring. So let's take a look at the associations and these things as an example of thinking about how we deal with inferences about associations and causal relationships among variables. So here's one of my favorite primates, Lemur. The milk energy is about 0.5 kilocalories per gram of milk. You'll have comparisons to that in a second. And it's one of the ways that encephalization is measured in mammals comparatively is to talk about the proportion of the brain mass that is neocortex, which is, to many people's estimation, the part of the brain that's interesting. Why? Because humans have a lot of it. Let's face it, that's the only reason we measure it is because humans are different from the other mammals in that we have a lot of this neocortex. There are lots of mammals with bigger brains than us, but proportionally they have less neocortex. So we get excited about neocortex because we have a lot of it. That's basically it. It's not quite that shallow, but it's almost that shallow. But these are the data available. And it may be right that there's something interesting about neocortex. I'm using it right now. So 55% of the lemur brain is by mass as neocortex, which is massive for a mammal, by the way. Really extraordinary amount of neocortex for a mammal. In humans, we're 75% neocortex by mass, and our milk is slightly more energetic per gram, 0.7 kilocalories. Lots of entertaining milk. And then another one of my favorite monkeys. I've chosen all my favorite primates here. Sebas, very energetic milk, more energetic per gram than humans, slightly less neocortex, about 70% neocortex by mass. So if you collect a big sample of primates and you know these variables and you'd like to say, okay, is there something real going on here? Are species that have more neocortex for brain mass, do they need more energetic milk? Is that what the evolutionary pressure is? Some comparison across species that can support this hypothesis. So let me give you sort of the minimal analysis here with three variables that you'd have to have to look at this. These are all data that come from a colleague of mine, Katie Hind, who's now at Arizona State. And she put all this data together by combing through various other published papers. And then she just kindly emailed it to me and then I put it in my book. So Katie did all the hard work and then I made some terrifying graphs. So this is what's called a pairs plot. And this is useful for looking at the bivariate associations between variables and data sets. So in the top row, you're looking at Keele-Crowley's per gram on the vertical axis against the other two variables on the horizontal axis. So you'll see right away that if you just look at bivariate scatters between the sort of variable interests, we're going to try to explain the energy in milk. There doesn't seem to be much. It's just like a random cloud against the other two variables, which are the log body mass. You need to control for this because obviously milk energy is going to be used for growing the size of the animal as well. So you think you might want to control for that. And neocortex percent, which is the variable of interest here as an explanation. There's not much in any bivariate between either of these two. Now I want you to notice that log body mass and neocortex percent, however, are pretty strongly correlated across primates. You don't need statistics to tell you that probably. You'll see that there's a nice diagonal. Trust your lying eyes. In this case, they're correlated. So what happens when we start doing models? Well, that was the animation to highlight those two. Sorry. I have all this fancy animation. I forget that I do it. So we have top row. Then we look at the correlation between the two. Sorry. So a little bit of statistical housekeeping to worry about. And I put this in here also just to caution you about lots of automated regression tools. There are missing values in these data. Those of you who have done science for very long, you've encountered this. There are often missing values. If you do questionnaires, sometimes people just don't answer some items on your questionnaire. Why? They skipped it. They were thinking about something else. They were offended by your question. Whatever. In this case, it's because for a number of the species, no one's ever measured their neocortex percent. We just don't have the data. This is true for a lot of lemurs is that no one seems to care about them as much. I love lemurs, but other people seem to think that they're boring. They're wonderful animals. So we don't have as complete a sample for that as we do for, say, the apes. Apes are equally wonderful, but people love them way more. So we have good data for apes, bad data for small body primates. So what do you need to do? What we're going to do right now is drop cases with missing values. At the end of the course, we're going to return to these data. I'm going to show you how to do something better than that. But for now, we're just going to drop them. What I want, I'm making you do this manually in this course. Most tools, automated tools in our or any other program will do this automatically for you and silently. You will get no warning that there have been cases with missing values. It will just return results for you. This is really dangerous. And we'll talk a little bit why next week. But the first order of danger is that you would like to know that this happened. But it changes the sample size dramatically. And then people start comparing models with some variables added and removed. And there's different numbers of cases. And now you're not comparing the same data sets. And lots of really bad inferences can be generated that way. And it's a historical mistake that for some reason applied stat software decided that it would just automatically delete cases with missing values and not tell the user about it. It's a really bad historical mistake. So I'm here to complain endlessly about this thing. We'll come back to it. And again, you can do better. It's not hard to do better than this. But the lesson for the day can get across anyway. So here's just to show you that in the bivariate scatters, the relationships, there's some relationship, but they're not particularly strong. So on the left, Kiel Kraler's milk energy against neocortex percent, slightly positive, but look at that bow tie. Right? Really nothing to bet on there. This would be worth publishing just to say that, look, there doesn't seem to be anything going on. And that should be in the literature. That's the sort of thing you want to say. Log body mass has a weak negative, weak and highly uncertain negative relationship with milk energy in this data set. It's driven by, as you can imagine, some of the heaviest species, great apes, are the ones driving that line. So now what happens when we put both of these variables into a model? And what I want to show you is both of those relationships get stronger owing to the fact that both of those predictors are correlated to one another and they have opposite effects on the outcome. This is the classic masking problem. So you need both of them to get an inference about the associations. So this is the multivariate regression. All we're doing here is just like the previous one. The actions in the linear model, there are now two slopes and two predictor variables. One is neocortex percent, that's N sub i, and the other is log body mass, which is the log of N sub i. Yeah. And in code, this is what it looks like. Probably not confusing. Yeah. Yeah. Notice I've pre-coded the log mass. You could also put the log function in there and it would work, but I prefer to do the calculations ahead of time. And then we get results. And what are these results? Well, I'm not going to ask you as usual. I don't want to ask you to stare at this table of coefficients because that's the worst way to understand what the model thinks. Instead, we should visualize the predictions. It's much better. So let's do that. And let's compare them to the bivariate models from before. The bivariate regressions are shown in the top row of this slide. And now counterfactual predictions, something I introduced on Wednesday for the multivariate model. These are cases where we hold the other predictor variable at the average across species and then we vary the predictor variable of interest. So for neocortex percent, we imagine a bunch of impossible primates who all have the same body mass. Right? And then we adjust their brains. So these species have never existed. It will never exist. This is why it's counterfactual, but it tells you what the model thinks. That you visualize the slope. So now, this relationship has gotten a lot stronger from these two consequences. Again, it's the math and why, because body mass has an equally strong and opposite relationship with the outcome. And it's correlated. These two variables are correlated with one another. So they disguise one another's effects. So they properly measure the association of either with the outcome you need them both in the model. So this is the victory for multiple regression. This is the building up. We're going to bring it all down in a few slides. But this is a genuinely good thing that multivariate models do for us because the world is complicated and these sorts of things are routine in natural systems. All of this stuff co-evolves in species. And you can't do experiments on this. You can't rapidly grow new lemurs in petri dishes with any correlation of trace you like. These sorts of relationships are kind of immune to experimental manipulation. So there's no way to design yourself out of this problem. You need theory to nominate how these covariances might arise through evolutionary processes and then you look at them jointly inside these models. Does this make sense? Yeah? Okay. I really like this dataset by the way. I don't know if anybody else finds milk and body mass exciting. But milk is a really amazing annotation. It's one of the reasons mammals have ruined the world. Right? After the dinosaurs went extinct. It's this portable energy source that lets you raise vulnerable offspring in all kinds of habitats. It's a really great thing. So as an evolutionary biologist, I find it exciting and I want to convey some of that. It's sort of boring sets of numbers. Milk. It seems kind of boring and institutional. But it's actually an exciting evolutionary problem. So this is just to show you again how we make this plot. The code's in the book and I encourage you when you go through the chapters to execute this code and get a feel for it. Here I've not done the shading. I've done the dash lines instead which is often a more visually pleasing view especially when there's lots of data points you want to superimpose as well. Okay. Let's move on to another synthetic example. Let's, these mass associations you can synthesize them as well in the box in the book which I repeat here to show you how to do it. Skipping over the code which just instantiates the mass association as before. You'll have two predictors. They both influence the outcome. They're correlated with one another. You get a masking effect. You want to put them both in the model. So this is a synthetic data set that illustrates that problem. It's demonstrated by this path diagram on the bottom. X positive is positively associated with the outcome. X negative is negatively associated with the outcome. This is sufficient to give you a masking effect. And this will happen in all kinds of natural systems. Okay. So I've been building up regression so far but I keep promising you that there are bad things about it as well. It's very dangerous to just add predictors. The story so far is, oh well I haven't got predictors, I just want to keep adding them and see what happens. And that'll help me uncover the truth. If there's a model that's got more stuff in it and has a higher r squared, we'll talk about that next week. Then that's the truth. And unfortunately it's not like that. You can find really powerful associations with slopes far from zero. The posterior probability is all above zero or below zero and it can be a lie. It can be a total lie. It can be totally spurious. And putting extra stuff in the model is not necessarily a guarantee to avoid these problems. I want to help you by showing you some examples of that as well. And to anchor your emotions about this maybe it's useful to think about regression is a kind of oracle. And we want to consult the oracle. It can see things that we can't. It's like the oracle. That's the oracle of Delphi, by the way, on the right. This is famous painting, the oracle of Delphi. Regression does magical things. The oracle of Delphi, if you will. The oracle of Gauss, whatever. What I find most amazing about it is it has this way to automatically focus its attention on the most informative cases conditional on the model. And it will automatically ignore the ones that aren't informative. That's great because I find that very difficult myself just looking at data tables. This is a really great service that it provides. However, it's not a kind oracle because it's like all those demons and genies in classical folktales. The problem is that you have to phrase your question or wish very, very carefully to them. Because if you do it wrong, they will mess with you and they will burn your kingdom to the ground. And that is what regression does. If you do not ask exactly the right question and understand the implications of your question in a precise way, the problem is it will answer it exactly. And that exact answer will burn your kingdom to the ground. So let me show you some examples of that particular question. This is the summary of the caution. You don't want to just add everything. That's why the sink is there. This was put in everything including the kitchen sink. This is nearly always a bad idea. One of the first reasons which I think is well appreciated but I'll show you some examples in the remainder of today is something called multi-colinearity that predictors are highly correlated with one another. It makes the model hard to interpret. There are also something that I think is way less intuitive and less often communicated. I also want to give you an example today. These things called confounding colliders. I beg your indulgence I'll get around to explaining that. Why they're called colliders. And this is the standard term in the literature. If you Google causal inference colliders there will be lots and lots of papers on this. It's big in the medical literature. There are other reasons which I won't say as much about today but will come up again and again is models with lots of predictors you lose interpretability. They may be really good at predicting even but it's not clear that we learn anything as basic research scientists from them. I think this is a basic antagonism that's hard to escape. It may be true that there are great machine learning techniques that can make really good predictions even out of sample. We understand how they work and so as scientists we don't learn anything about the system from that. My job I think the public pays me to understand how the universe works. Maybe that's a bit elevated but they pay me to understand where humans came from literally. I think that's what it says in my contract. So it's not enough just to make predictions about which lemur has more milk energy. We have to understand things. That's the whole point and this is the hardest thing about it. Interpretability matters. You could also lose precision and next week as I keep saying it's all about this thing called overfitting which I think is the most important problem in applied statistics. So let's examine the first of these issues, multi-colinearity with a synthetic example. Well it's sort of synthetic, it's real. The data are fake but it's an example we could do if I brought a measuring stick in here with people in the room. This would just take too long to collect the data. But we could do it with your data and it would work out. Guarantee it. I used to do this when I taught undergrad stats. I would do this with the students. I would have them measure their legs and then we'd do this example. It just takes a whole bit to do that because once you get people with tape measures young, excitable people with tape measures touching one another's legs everything goes downhill. I hadn't thought about that so it's had to occurred to me. So anyway, that aside I love this example. It always works. So let me try to deliver on that promise here. Three variables which I assert are strongly associated with one another. I wouldn't say they're necessarily causal. The cause of the joint cause of these is something we haven't measured which is growth factors, common growth factors but they're definitely associations. So I assert that if you know the length of someone's left or right leg doesn't matter which, you can predict their height with very good accuracy. It's a fantastic variable. In fact, legs are better than arms in this. So however it turns out if you run a regression where you put both legs in the model it looks like neither predicts height. And this is an interesting feature of regression that will serve as a caution for you in interpreting regression output to understand what regression models mean. It's a reminder of what we did on Wednesday. So here's the synthetic data. These are fake individuals where there's a proportional relationship between leg length and stature or height, it's being called here. But the legs aren't exactly the same length because that's true. I've salted them in Samara. Most of us are legs are slightly different length. This was the thing with getting undergraduates to measure one another's legs. It takes a little while. They think it was error. But no, most of us are legs are not the same length and someone would have a crisis about that. You have to reassure them that it's normal. And so on. Again, thanks I hadn't thought of before I started the lesson. So here's the Bavarian regression. I'm not going to talk about the code again. You guys have pros at this already, right? You're masters of putting in these awful math models. They're awful but you really understand the assumptions now to type them all. They're right there. So we ignore the table for a second. Let's just look at the dot line chart at the bottom of what's called the marginal posterior distribution of each parameter. Marginal because it averages over all the other parameters. So you're just seeing that side view silhouette of the multi-dimensional hypersphere that is the posterior distribution. Yeah. And so the intercept's not really interesting to us in this story but look, BL and BR, this is the slope for the left leg and the slope for the right leg. Giant. One's in one direction, one is in the other and there's huge uncertainty intervals on both. What's going on? You know that you can predict a person's height from their leg length. So why if you put a model with both in? Why isn't that better than a model with one? The answer is this is the oracle. The oracle is telling you the answer to exactly the question you asked. The question you asked is not the one you really wanted to ask. Now I'm being an oracle, right? So let me unpack that for you. What multi-variant linear models do is they answer this question. What is the value of learning the left or right leg once you know the other one? That is the question you actually asked when you're programming this model. That's what it means to put both the left leg and the right leg in a multiple regression is to ask once I know your left leg is there any additional value in learning your right leg? The answer is no. Not really. Just a tiny amount. That's simultaneously true of the other leg. So then the model tells you that well, you know, I don't know. There's this huge uncertainty about knowing either one. It's answering a specific question which is about the sum of the two is what I want to show you. The reason is because this is the posterior distribution of the slopes and you can see that there's this what we call a ridge. These two parameters are really strongly correlated with one another. So the model is telling you is well, if the right leg is important the left leg isn't. There's a positive effect of the right leg, then there's a negative effect of the left leg. It's almost exactly balances in fact and vice versa the other way. The only thing that you can estimate in this model is the sum of these two slopes. The sum of the two slopes is very precisely estimated and that's why there's a very tight ridge because the data do determine the sum but they don't determine either one and the reason is because there's an infinite number of combinations of any two numbers that will give you the same sum and that's all the model can do is say well, you asked me to consider an infinite number of combinations of these slopes. I did that. Here are the highly plausible ones and they form a thin line in the posterior distribution and then you asked me to plot the profile of each and I did that and they look like this. But they form a narrow ridge of high correlation between one another and that's the thing you have to notice. So when you see coefficient tables like this with these really wide things, you might want to look at the binary relationships between the parameters and see if you've got some mutual information issue like this. So in the lower right here what I'm showing you is the density of the sum of BL and VR. We just take the paired samples from the posterior distribution for both of them and then we plot the density of that sum. This is really well identified. It's positive, it's exactly right. I know because I fake the data. I know what it's supposed to be and it has a very tight credibility interval. The sum is identified because that's all the model can do. But once you learn one leg, you don't need the other one, right? You just need one of them. That's the idea. So algebraically, we won't do a whole lot of algebra in this course because it's really useful. Algebraically what's happening is the two X variables are essentially the same X variable. So let's just rewrite the model that way. This is the model you programmed into the computer. There's just one explanatory variable X sub I and you've written this model. It should have two coefficients. So what this means algebraically is that the model you're asking is alpha plus some sum of beta 1 and beta 2 times Xi. The number of coefficients is identified but each one is not. It can't be because there are an infinite number of combinations of beta 1 and beta 2 which will give you the same sum. The sum can be identified and it has been. Where is it? It's that ridge. That's the sum. But the individual ones can't be identified and that's what the model asks. This happens a lot in applied regression when you just start dumping things in because you'll have highly correlated predictors which are all stemming and this arises. I first encountered this in ecology because field ecologists would come back with data sets and ask me what to do with them and they would have a bunch of stuff that was just driven by moisture. So they've got plots and they're measuring things and there's a whole cluster of variables that are just because it's moist in that spot or not. And any one of those variables has the same information as any of the others. You just have to choose one. You put them all in the model. Everything has giant confidence intervals and of course it does. Moisture is one of the most important variables. Okay. I just showed that again. This is just a summary. The model did what you asked. If you plot the leg total on the horizontal axis against height, you can see it's super predictive. The total length of your legs is a great predictor of your legs. You want to think about it that way. Does this make sense? Okay. Back to milk energy very briefly. There's multicollinearity of this kind not as severe, but of this kind lurking in the milk energy data set. Let's look at two other variables of the data set, leaving body mass and neocortex aside. Milk energy and then components of the milk energy. Milk energy comes from different sources in milk. It can come from fat and lactose. Those are the main sources. Milk is mostly water with fat and some other things that are micro nutrients that are very important too. But the energy, the kilocalories come from fat and lactose, and different species balance it in different ways. What I want you to see here is that milk energy is definitely associated with both of those variables. It's most strongly associated with fat, which has higher energy density, no shock. This is why butter is so good. Butter, ice cream, all your favorite things in the world. Yeah. Lipids. And they burn bright. But you'll notice across species fat and lactose are really strongly correlated with one another because they trade off. You can put more of one. You have to take out some of the other. There's limited amount of energy the mom has to build the milk. So I'm not going to belabor the details of this example. It's in the book. The question of either of these, fat or lactose content with the outcome, they're strongly related, one positively, one negatively with milk energy in bivariate sense. If you put them both in they have much weaker relationships. It's not that it goes away totally, but if you put them both in the same data set, your inferences about the influence are greatly moderated for exactly the same reason as with leg length. This is left leg and right leg, but they go in opposite directions because you can only put more fat in the milk. You take out lactose. At least mom takes out lactose, not you and vice versa. But it's structurally like the leg length example. So there's a whole section about this in the book and I encourage you to take a look at it, but none of the modeling technology is new. Does this make sense? Yeah? Okay. Another thing to worry about post-treatment bias. This is another reason not to just throw things into models. The language post-treatment bias comes from an experimental metaphor, but this is dangerous not only for experiments. In fact, I think it's super dangerous in cases where we don't have experiments because we don't even know what the treatment is. We just have to label something as treatment. It's a variable of interest, but we don't actually manipulate things in natural systems in the same way. So let me walk you through in the experimental version, but those of you who don't do experiments, you should be paying close attention. So here's the headline, of course, with all these examples. If you thoughtlessly add predictors, it's a bad idea. And post-treatment bias is a case where you control for the consequence of your treatment. And what this does in a multiple regression framework is it knocks out the influence of the treatment because of course the outcome is more closely associated with the consequence of the treatment, the way the treatment actually affects the outcome is through a mediator. And so if you control for the mediator, the treatment has no effect. But of course the treatment did have an effect because it created the mediator. So you don't want to control for consequences of the treatment because that stops you from inferring the effect of the treatment. I'll say that again. You don't want to control for consequences of the treatment because it stops you from figuring out whether the treatment actually had an effect. Does that make sense? Now you might want to do the mediation analysis to see how the treatment had an effect. But that involves multiple models. But just because you put in the mediator the consequence of the treatment and it knocks out the treatment, that is not evidence the treatment didn't work. The treatment worked. Does this make some sense? I have some examples coming, as always. So bear with me. Here's the example from the book and I'm not going to go through the code of it just to say this is section 533 on page 150. I imagine a greenhouse experiment, this is when I was doing stats consulting at UC Davis, this actually happened to me. Someone had a greenhouse experiment and came to me and asked about this. They were doing this experiment and they were confused because the treatment didn't seem to work and here's why. They were looking at the effects of antifungicidal treatments on plant growth. UC Davis does lots of agricultural research. It's big money and California grows like, I don't know, 80% of North America's food now or something crazy like that, so no exaggeration. So it's a big deal and so I did a lot of consulting with people on these sorts of things. So these greenhouse experiments are serious business and so there's a mediator that was measured. There are two things that were measured. The first thing that was measured is the initial height before the treatment is applied of the plants. You might intuit why you'd want to do that, right? Do you want to subtract that out? Yeah. And then there's a fungicide treatment and there's a control group that doesn't get the fungicide treatment. There are also groups that got... There are all kinds of control groups, actually. I'm leaving out some of the details. There are different ways. You can just spray water on them. All kinds of things happen. So there's lots of control groups and then you measure whether which plants get fungus. So what happens is if you do a model where you put in the fungus status it looks like the antifungicidal has no effect. Why? Because it had an effect. Because... But the effect on growth is entirely mediated through whether the plant gets fungus or not. Whether the plant gets fungus or not is actually what matters for its growth, not whether it was sprayed. You see the distinction? There's a section in the book and this is all confusing to you. Take a look. I get questions like this. I'm like, oh yeah, just take the mediator out. And then there's lights and aww. Suddenly there's illumination and we understand things. That makes sense. It is worth doing the model where you put the mediator in because that is a way to test what the mediating effects are. Well, it'll drop out the treatment and that's useful to do. I think psychologists are more clever about this than biologists. It's something to do with training in the wrong word. It's just training. Different trainings emphasize different things. My experience psychologists are savvy about this. Let's think about observational systems where it's a lot trickier. Let's think about a case where there's no actual treatment because you haven't manipulated anything and if you tried it would be highly unethical and the police would stop you. These are all kinds of sociological or anthropological research where we're interested in some status of individuals. Some status could be race or gender or geographic origin or any number of other things and the man wants to know how this affects job placement and educational choices in subsequent income. We're interested in explaining income inequality. This is a massive project in the social sciences. The thing about these sorts of treatments is it's typically true in these data sets that once you account for education and career placement that the status whatever you're interested in explains none of the variation in income. Well, duh. The effect of that status is mediated through how it channels individuals into occupations and how it affects education. There may be no direct effect of the status, but of course it isn't having an effect. You don't want to conclude from models like this of your favorite status. Pick your favorite status. Watch cartoons. Whatever. It has no effect on income just because once you account for job there's no correlation left. That is not the right inference to make because it's a causal pathway just showing that there's some more approximate cause doesn't mean that there isn't a more ultimate cause. Does this make sense? This is a routine problem Now, these are variables where we have strong, potentially wrong, but strong theories about how they work. There's lots of research in observational systems where the variables are sort of murky and we just put them all in a regression and hope something shakes out. In that case, what's the treatment? What's mediating what? The path diagram isn't obvious and you could draw lots of arrows between boxes potentially. It's very hard to do much in those systems. Okay. What is a collider? A collider is some variable which I will call X. There'll be an example coming up. Multiple examples coming up. Some variable that's influenced by two other variables. So let's call the collider X and let's call the other variables Y and Z. Your research interest is to understand the relationship between Z and Y in either direction. It doesn't much matter actually for the example that's coming. What you should not do is condition on X and cause by X. Sometimes called a descendant of X in a tree. You should not because it will lead you to a bogus inference about the association between Z and Y. So if there's something that's caused by two variables and you control for it you're essentially looking at a subset. You're inducing a statistical selection bias on the cases that inform you about the relationship between the two other variables. I know that's confusing. Every time I teach it I have to go back to my notes and remind myself why it makes sense. So let me show you some examples. It helps you understand. The first example is going to be a case where there's actual mechanical selection bias. You only get the segment of the population that is conditioned on X, conditioned on the collider. The collider gives you two subpopulations and then you only get them run to regression on one of them and you reach the wrong inference about Z and Y. But you include X as a predictor and you get the same problem because statistically it's the same as not having the data in a sense because you're getting a subanalysis. It's not exactly the same, but almost the same. So here's a comical example that I love. Those of you who are in my department, you've heard me say this already like this example. Apologies that it's basketball, but everybody knows basketball now, right? It's international. You go online to the NBA website, which has lots of free statistics you can download and try to prove this with a data analysis. It is quite difficult. So here's data from 473 NBA players in the 2016-2017 season. This is an analysis that was posted by Matthew Hahn, who's a literature biologist on Twitter. He's basically doing statistical trolling with this, but it's a really nice example. So here's the scatterplot on this graph. The average points per game for players against their height in feet. It's in feet because it's North America and we still use ancient imperial measurements, right? And there's no correlation there. In fact, there's a negative correlation in those data. It's very weak, but there's a negative correlation. This flies against everything we know about basketball and how it works. Being taller is an advantage, right? And now if you know much about it, you've looked at the x-axis and you know it starts. It's almost six feet, right? So these people are kind of tall, right? So this is like I'm 5'11", the calibrate, right? And I'd be a terrible basketball player, well, for other reasons. But I have nothing to do with my height. But we've already gotten rid of most of the human species, but never be on this graph. This is already a field of giants, right? Or the average height is getting up to 7 feet tall. But what's actually going on here is that, of course, you figured out is that there's a collider, of course, otherwise we wouldn't be talking about it. There's a selection process. In order to get to be an NBA player, you have to be sufficiently good at, well, sinking balls. That's how it works. You've got to make field goals. So here's the causal diagram to think about to identify the collider. We're interested in whether height predicts field goals, but height also influences whether you make the draft. Because height is influencing some underlying variable skill, which we don't have measured here, and it's influencing both of these things. But these are the variables we have. So after conditioning, in a sense, on the draft, that is, we're only looking at the subpopulation that makes the draft, there's no correlation between height and field goals. But that doesn't mean in the whole population does this make some sense? And this is conditioning on a collider. This is a case where nature has done the conditioning for you, in a sense, or the professional athletic association. Because they don't post data on the people that didn't make the draft. I'm not on that website. Does this make sense? Yeah, this happens a lot. Pick your favorite example of how these things work. Other examples are a little less benign and happen in research quite often, I think. Here's let's think back to the example that started this week about divorce and age. So you can also get this. Here's a case where we're going to have the whole population but we'll condition on the collider and get the same kind of bogus inference. So let's ask a basic research question which is addressed in the literature. Are older people less happy? That is, does age make you sad? And there's actually a big literature on this. This is a compelling question. No, and it has welfare consequences, right? It's a serious issue. And changes, secular changes in happiness with age. So imagine, for example, we're in a case where we already know that age is positively associated with being married. You might think, why? Well, because it's years of exposure. The more years at risk, that's epidemiological language, but the more years of exposure the more likely is someone to get married. And so it's true in all aggregate data that older people are more likely to be married. And then there's a relationship between happiness and marriage. And what the literature seems to think, this may or may not be true, is that it flows this direction, that happy people are more likely to get married rather than the other way around. You get that from longitudinal studies of individuals. Are they happy before or after? What's just causing which? Look at the time series. Right? Uncomfortable after. This is all confounded by age. So it's a big, it's hard. And again, you can't do experiments. You can't randomly assign people to get married or not. You just can't do it. Interpol will step in the way. So now we want to make inference about this other path. Is there a relationship between age and happiness? What you should not do is condition on marriage status. If you believed this was the causal diagram, you should not condition on marriage status. It seems like a smart thing to do, right? Throw in a control. And this is exactly what reviewers would tell you to do. They think, you should control for marriage status. And your response should be this diagram and to tell them to give them a citation about colliders. That's what your response should be. And so to put some meat on this, here's a synthetic example. So let's just do a simulation where the path diagram on the previous slide is true. Here's a case where let's assume there's no relationship between age and happiness at all. I just simulate a bunch of people and I assign them happiness at birth that never changes. It's destiny, right? And it's like horoscopes. You're an Aries. You're happy. That's how it is. And then you're stuck with that for the rest of your life. And then in each year of the simulation two things happen. Everybody gets older. And the five happiest people over the age of 20 get married. And that's it. That's the whole simulation. And then I'd simulate 200 years. People die when they reach 100 years old, by the way. They go off to some colony, some place, or something. And then we can assemble the data set and we can run regressions on it. So what happens when we control for marriage status and there's no actual relationship between happiness and age in this simulation? And the answer is you get an invalid inference. You end up concluding that people get sadder with age. And it's because you're conditioning on the collider and it induces. Basically it gives you it's subsetting the population. So I don't have, I have two minutes. I can do this. So here's the puzzle path diagram you end up inferring which is not true. This is not actually negative. So if you look at this dot and line chart at the top, there's a coefficient for age standardized standardized age and it's negative and very certainly negative. Your model is really confident it's negative. It's nowhere near zero. It's also really sure that marriage is associated with happiness, which happens to be true in these data. That's correct. It doesn't go this direction. That happiness causes marriage rather than the other way around but there's definitely an association. The negative association, the spurious association between age comes from controlling on marriage though. Because it's a collider. So what's actually happening? It's hard to see on this graph because of the projector but the married individuals are actually colored red. So this will be, when you look at the slides, people watching this online can see what I'm saying and know why can't the people in the room see it. Because for some reason this projector doesn't do color. But the red dots are married individuals and the black ones are unmarried. And it's true that in unmarried individuals at the top of the graph the points are red. Once individuals start getting married and they start turning red at the top for individuals over 20 marriage is associated with being sadder because the vertical axis is happiness. So if you subset the population and you take out all the married people and you do a vibrant regression between age and happiness it is negative. And that's what conditioning on the collider does. It says let's take out all the differences between people that are due to their marriage status. Now in the remaining population in the residuals is there a relationship between these variables and there is. But it's completely spurious and this is why you don't want to condition on colliders because you make the wrong inference. Yeah? Okay. I hope I've given you enough, this is complicated but I hope you do enough to have nightmares about it. This is my job is to give you terrors about the dangers of causal inference. So, yeah. Additional nightmares. Causal inference is hard trademark. There are whole books about it. Here's probably the most famous contemporary book by Judea Pearl who's a statistician at UCLA. There are all kinds of other things that I haven't talked about like residual confounding which one way it can arise is just the variable that you have most precisely measured is the one that will have the strongest relationship with the outcome. So you might worry about that. If the real cause is hard to measure then it will be very hard to establish that it works. Then you need theory. You want to leave other stuff out and put the actually theoretically not anything. Of course in reality when you really intervene in a system real treatment effects affect a bunch of variables at once. So you have to think about that very carefully for cause. And then if there was someone from the Santa Fe Institute in this room they would rightly say but Richard in complex systems of course the whole classical view of cause that you have on these slides is obviously bogus and I would agree. Absolutely I think it's true and in highly complex nonlinear systems you can't think about cause like we've been thinking about it today. It's a much more complicated problem and I'll just leave that to give you terrors. But in the end there's no secret weapon and to encourage you we have learned things about the universe and we've done so despite the fact that causal inference is so hard but it isn't made any better by believing that statistical methods solve the problems. They don't. You always need theories about the system to solve the issues. Mirror associations won't do it. There's no time to talk about categorical variables but the whole last part of chapter 5 is about how to use multi-aggressions to code categorical variables through things called dummy variables. I leave this as your homework. It's not complicated work and most of you are familiar with it before. Apologize for not getting to it. So now I'm going to skip a bunch of really dull looking slides about monkeys and get to the homework slide. Sorry, that was nice. All right. Yeah, so I know it's like tons of stuff. It's all in the book. I know it all looks cool. See this? Just read that. Just read it all. It's all in the book. The last thing there was to show you that OLS can be interpreted as a Bayesian procedure because of course it originally was. That's what Gauss was one of the inventors of it and he used a Bayesian argument. This came up, I think. Did you ask this question? I think this came up before on Wednesday. So homework. There's always homework. Please do the first three hard problems at the end of Chapter 5. This is these are data analysis questions about urban foxes which are foxes, not people in the UK. Feral foxes that live in cities and forage and you will do some multivariate regressions on them. Please, you guys are turning in great homework. It's very, very good but sometimes it has no name. If you would put your name in the file that would save me some cross-referencing with the source email. That would be great. Next week we are downstairs because the Institute Graduate School is going to use this room to do paleo lectures. I think it is. For next week if you want to start reading ahead we're going to do Chapter 6 which is going between the whirlpool of underfitting which is not learning enough from your data set and the monster of overfitting which is learning too much from it. Have a good weekend and thanks for hearing those questions.