 and get started. Happy New Year, welcome back everybody to Statistical Rethinking. I'm Richard and this week we're going to do multiple regression and sort of the foundations of a framework for doing causal inference. And I think the best thing to do to get into that is just get into the first empirical example. So I went to college in Atlanta, Georgia which is in the southern United States the former Confederate States of some of you know American history and there are many things about this out that I like and those are the parts that I like to talk about and one of those things is Waffle House Waffle House is exactly what it says it's a house of waffles and also hash browns and other greasy food and it's always open there's no like opening or closing hours it's literally always open that's the whole principle of the business is that when other things are closed you can always come to Waffle House look for the yellow globe sometimes at freeway exits there are two Waffle Houses there will be one on each side because it is so important to the commerce of the southern states other things that are present in the south which are not as nice include hurricanes, tropical depressions that wander northward and intersect with the Waffle Houses in various times and from here in the comfort of Europe we look on in horror at these tropical storms but having been near these things myself I can tell you they're no joke Waffle House because it is associated with tropical storms invests in disaster preparedness as a business their model is to be always open and they really mean it even when there's a hurricane they are often the only thing that stays open during a hurricane this one closed they are open and the lines outside of a Waffle House just after a tropical storm are really something because it will be the only place you can get Waffles right they're so reliable in fact that the United States Government actually has something they call the Waffle House index which is an index of how bad a storm was if the Waffle House is closed that was a really bad storm and then they bring in more trucks and relief and such this is for real this is Craig Fugate who is until recently Director of FEMA which is the Federal Emergency Management Agency in the United States and they really use this Waffle House index internally in FEMA in the southern states obviously you can only do this in the southern states because there are no Waffle Houses outside of the south and it has spread among the southern states and has very high densities within the south there are other things about the south lots of things about the south which are reduced in Craddock in the United States Hurricanes, Waffle Houses and also divorce the southern states have the highest divorce rates many of the states in the south and this sets up a bunch of interesting and I want to pause it spurious correlations between Waffle House and many many other things basically anything that is the isocratic about the southern states will be correlated with Waffle House so if you are naive and really eager to make some causal inference about Waffle House like you are out to get them to close their business you could find any bad thing about the south and show it is correlated with Waffle Houses in a regression for example divorce Waffle House is ruining marriages the fact that it is always open these guys are going to Waffle House late at night having Waffles I don't know but in a regression it is quite robust actually I show here on the horizontal Waffle Houses per million people you got to do it per capita there is a saturation effect and then the divorce rate among the 50 United States and you will see that there is a very strong correlation here in fact statistically it is quite hard to get rid of it nevertheless I believe this is spurious I do not believe there is a causal influence on this but nature is full of stuff like this and usually it is not so ridiculous it is hard to know but correlation is commonplace in nature it is not rare there is a whole website which specializes in fishing through longitudinal data sets to find spurious and ridiculous correlations I encourage you to visit it the spurious correlations website the URLs on the bottom of this slide here is a great example you can correlate the divorce rate in Maine over time with the per capita consumption of margarine it is a very nice correlation correlation is 0.99 right must be causal right? no of course not because correlation is commonplace in nature lots of things will generate a high correlation between two variables even if they have no causal relationship with one another so what are our goals this week well mechanically I want to introduce you to multiple regression most of you know a lot about multiple regression already you probably have published multiple regression analyses but I am going to teach it to you all over again as if you didn't know anything about it and I have the goal of both building it up and breaking it down the good parts of multiple regression is that it can show us plausible true causes of things it can remove spurious correlations like the case of Waffle House being correlated with any number of things and it can uncover mask associations that you wouldn't see otherwise just in regressions with a single predictor variable but it can also do bad things adding variables to models can do just as much harm as good you can actually cause spurious correlations by adding variables to multiple regression I don't think this is often taught so I want to really hit this quite hard this week and you can actually hide real associations as well by adding variables so you need some broader structure to think about your decision to add variables simply throwing everything into multiple regression is a recipe for doom and we do not want to do this so making decisions between the good and the bad is going to require that we have some framework for doing causal inference and so that's what I'll try to give you as well this week and this means directed acyclic graphs I'll tell you what that means when we get there often abbreviated DAGs right it's like a dog if you're a certain part of Ireland it's how a dog is pronounced DAG and things like forks, pipes, and colliders which are the components of these graphs and how to understand them and the goal is to learn something called the backdoor criterion which is a criterion by which you can figure out if adding a variable is warranted by the causal inference you want to make okay let's go back to divorce not to Waffle House let's leave Waffle House aside we'll loop back to Waffle House maybe at the end of the week and I'll bring it back in but let's stick with the divorce rate example Waffle House doesn't cause divorce but something does why does the south have high divorce rates and there's a lot of effort put into figuring this out actually because the divorce rates are so much higher in the southern states a place which is actually more religious in the country as well so there's something that scratches in our head about this there's lots of things that are correlated with divorce rate so for example marriage rate here's another variable we could get for the same 50 states the rate of marriage is also correlated with the rate of divorce could be well I mean you can't get divorced if you haven't gotten married right so it could be true but it could also just be a serious correlation and how do we go about figuring this out there's no reason that a higher marriage rate has to cause more divorce right a higher marriage rate could indicate that it's a society that views marriage favorably and that means you have lower divorce rates and it doesn't necessarily make any sense so this could just be a serious correlation how are we going to figure this out there's another variable that is also correlated with divorce rate but in the other direction and that's the median age of marriage states where people get married younger like the southern united states also have higher divorce rates right so which of these two correlations is plausibly causal what I've showed you here are just two bivariate regressions like the models we did last year remember we did simple regressions on height we just had one explanatory variable I'm showing you on this slide or just models just like that and you run them with the kind of code I showed you in chapter 4 what we want to do now is put both of these horizontal axis variables in the same model and understand what that does and why that reveals that one of these is an imposter almost certainly so what is multiple regression for it's for answering this basic kind of question what is the value of some predictor variable of knowing that predictor variable once we know the other predictor variables right so all these predictor variables are correlated to some extent with one another and with the outcome typically but they have partial correlations which are revealing of the additional information above and beyond that correlation structure so let me walk you through this in the context of the divorce rate example we've got two questions in a model the first is what is the value of knowing or learning the marriage rate of a state once you already know the median age of marriage of a state do you get any additional predictive leverage from the second variable once you know the first and then the other direction as well what's the value of knowing median age of marriage once you already know the marriage rate does this make some sense once I show you the full examples hopefully it'll come across so let's do our first dag hopefully someone here knows knows who Brent is dag stand for directed acyclic graphs these are heuristic tools for drawing causal models they're not mechanistic analytical models but they're incredibly useful for disciplining your thinking eventually you want some much more mechanical model they're called directed acyclic graphs the directing stands for arrows the edges have direction to them they can be bi-directional they don't have to just be one directional but they have to have some direction and those directions indicate causal relationships that something influences the other thing they're acyclic meaning that you don't have loops in causation now of course nature does have loops in causation but those loops in causation happen over time in a time series and you can represent that in a dag but the dag gets really big because you guys have like t1 and t2 and so on and they get huge you can't have any cycles and then it's called graphs because they are graphs graphs just mean nodes and edges the nodes are variables and the edges are causal relationships these are different than statistical models statistical models don't have directionality the associations in a statistical model including a Bayesian network which is what you've been doing so far you didn't know it but congratulations you're doing machine learning everything's machine learning linear regression is a machine learning model these things are all Bayesian networks Bayesian networks don't have directionality to them they learn statistical associations and conditional associations but they don't have causal information about direction what a graph like this does instead is it posits that something influences the other thing and this makes a difference and you'll be learning about those differences this week so let's look at the context of the divorce rate example we have this graph here with our three variables A is the median age of marriage in a state M is the marriage rate in that state and D is divorce rate here's what I posited is a plausible graph of the causal influences here median age of marriage influences both the marriage rate and the divorce rate how does it influence the marriage rate well if people get married younger it's available to get married at younger ages right so you'll get a higher marriage rate in states where people get married younger just as a consequence of the fact that they get married younger because there are more people alive at younger ages in the southern states at least there are other states where that's not true but in the southern states that's true and then median age of marriage influences divorce rate because sure but there's a lot of research that suggests the reason the median age of marriage is a causal influence on divorce rate is because young people make worse decisions so now we don't know that this is the true case but this is a widespread conjecture in the literature to explain this correlation is that if you get married older you make a better choice another explanation would be when you're younger you're rapidly changing in your personality and your desires so you can make a perfectly good match at that time but then you grow apart and that's no tragedy it's just life so regardless we're going to pause if there's an arrow the question is this arrow from M to D from marriage rate to divorce rate is that arrow there is there a causal influence of marriage rate on divorce rate as well and that's what we want to test with the multiple regression we want to tell the difference between I should say something about the past before I go to the next slide so we're going to look at these kinds of graphs these dag so we're going to talk a lot about past what's a past? a path is some direction you could go following from nodes to arrows to get from one variable to another when you're following a path you can go against the arrows imagine yourself just walking on the lines and you don't care about the arrows, causation does but you don't, you're a physical tourist you're going to walk along these dag so if you've got a path from A to D there are two paths in this graph there's a direct path which is the direct causal path we'd like to know that that's a direct effect and then there's the indirect effect of median age of marriage which goes AMD you see this, these two paths on the graph in bigger graphs there are lots of paths between two variables and that's life that's why they pay us the big bucks to do research so we want to tell the difference between these two dag the one on the left where there is a direct path from M to D and the one where that's deleted because there's no causal influence really of marriage rate and multiple regression can in principle tell us the difference between these two things but bider regressions cannot just knowing that you have some association between marriage and divorce doesn't tell you the difference between these two dag why? because A influences M and A influences D that generates a correlation between divorce and marriage rate even if marriage rate doesn't influence we're going to look at the graph on the right here so A is a common cause of marriage rate and divorce and so marriage rate and divorce end up correlated this is like waffle houses in divorce so something causes waffle house and also causes divorce being in the south is what it is I'll do this regression for you later it also causes waffle houses and it also causes divorce indirectly and so it's a common cause and it ends up being correlated even though there's no causal relationship between them this is one way to think about it is A is a confound between M and D if you're trying to infer the causal connection between marriage rate and divorce rate then age of marriage is a confound I'll have a precise definition of confound later that has to do with how these graphs work but just bear with me I don't want to overload things right now you ready for this? is it good? a little bit of notation on the bottom here we want to know the conditional association between M and D in this vertical bar I'm going to call it a pipe that's what it's called in typography you should read it as conditional on the association between M and D conditional on knowing A once you know A there's no extra value any extra association you can between them too and that's what multi regression can tell us so what does a multi regression model look like? it looks like an ordinary bivariate regression but with extra terms in the linear model so you already know how to do these really you just have to make more stuff in the linear model let me walk you through this to remind you it was last year after all and we did this last a linear regression a kind of Bayesian network where there's an outcome variable which is assigned a Gaussian probability for each observation with some mean which is conditional on things we know about each case we call those predictive variables and some standard deviation which is typically constant it doesn't have to be and then our mean mu sub i the sub i means it's conditional in each case i and then you'll have slopes times predictor variables in each case now we're going to have two we're going to have m sub i the marriage rate in each state and some slope to measure the partial association of that with the outcome after already knowing agent marriage and symmetrically for agent marriage we've got a slope for agent marriage and then we have the median agent marriage in each state not too shocking okay we've got to talk about priors so we're going to have to think harder about priors now once we start doing these sorts of models and I'm going to spend some more time doing prior predictive simulations with you so it helps a lot with linear regressions to standardize the variables because it makes the priors easier to set set them reasonable what does standardization mean it means you subtract the mean and divide by the standard deviation the variables are now z-scores psychologists know z-scores your life is z-scores biology depends upon what your tradition is in biology whether you use z-scores or not so if you make all your variables z-scores it makes things a little easier not in all cases but in this case it does so we standardize divorce rate and marriage rate and median agent marriage the consequence of this is that remember the meaning of the parameter alpha we need a prior for alpha is it's the expected value of the outcome when all the predictors are zero when you standardize the predictors zero is their mean yeah so now alpha has the meaning of the expected value of the outcome when all the predictors are at their expected values now we can set it what should it be? Zero because your outcome is also standardized and the mean of a standardized variable in the whole sample compared to all the others has to be zero it's just by definition it has to be true the regression line has to pass through the expectation of these things of all these things so we have a very strong prior expectation that alpha should be zero so we'll give it a Gaussian prior with a tight standard deviation even tighter than this would be good as you'll see it'll be zero it's like a mathematical destiny that it has to be zero as a consequence of standardized variables if you hadn't done that what are you going to do? What is alpha? I'm not an expert in divorce rates what's a median rate? You don't know prior to seeing the data and priors are about expectations about the parameters before you see the sample you want to set them based upon background scientific knowledge not upon peaking don't use the data twice slopes this is a little harder this is where I encourage you to do some prior predictive simulation you don't want the slopes to produce possibly strong relationships to start so if you use flat priors here that's going to produce some incredible nonsense in the text I'll show you how to simulate anything you like let me walk you through the code just a little bit to figure it out what we do in chapter 4 we just used random number generating functions in R to simulate priors there's another way you can do it you can fit your model here's the quap model for the multiple regression just a single regression between divorce rate and agent marriage to keep it simple there's this you can run that model there's this function in the rethinking package called extract prior this samples from the prior instead of the posterior all this does is use our norm it reads the formula and it says oh you said that BA has a normal prior with 0.5 and then it samples from that and gives you samples from it it's a convenience function or you can use extract prior and then we can just pass it to link remember link we used it to generate predictions now we're generating predictions here but from the prior this is the prior predictive simulation what does the model think before you gave it the data this is a way to see if the model is done what does the model think before you give it the data and then you can plot the regression lines but you can think of it this way remember the posterior is full of lines how many lines an infinite number every line that can exist is inside the posterior distribution but it's rated in a different plausibility and in the posterior it'll be incredibly truncated down to what the sample justifies but in the prior well that's what you plug in and so if you want to see what you plugged in is ridiculous or not you need to simulate from the prior for simple models you can get away with not doing this because the sample will overwhelm even incredibly ridiculous priors like flat ones but with more complicated models even large samples won't overwhelm the prior so you've got to get used to doing this we're going to practice this when it's safe because you want to be good at it when it's not safe right so what does this end up looking like if we have this prior on the slope Gaussian prior centered at zero with a standard deviation of a half these are the regression lines you sample I forget how many this is 50 regression line samples from the prior walk you through this plot we're looking at standardized median age of marriage so the two here means two standard deviations right so that's the vast majority of the outcome space by definition almost all the possible observed median ages of marriage that you're going to see and then same for divorce rate two means two Z scores out two standard deviations out make sense if your regression lines don't live in this space then the prior is bad because it's impossible if your model thinks prior to the data the median age of marriage is going to give you more extreme slopes than this then it goes out of the observable range of divorce rates then the prior is bad I think that's a simple scientific criterion for a good prior now within this we can quibble about how tight we want this to be this prior allows really strong relationships it allows for median age of marriage to govern almost all the variation in divorce rate which is probably not true but this prior allows that I think that's too strong and when we get to the overfitting chapter I can make a more forceful argument about why this is probably this prior is not tight enough for safe work but we'll move forward with this you want to think about this as the flattest prior you can justify scientifically a typical frequentist analysis consider impossible results as perfectly plausible possibly just crazy strong lines that basically go straight up on this graph will be sampled from a prior that's flat remember a flat prior thinks that infinity is a perfectly reasonable slope and it's not I pause it here's our model now armed with our priors we're going to be doing more prior predictive simulation interactions so just to summarize it we've got the probability of data up at the top yielding linear model as the second line now with two terms linear means additive remember so really this mu makes a plane because it's like a two-dimensional linear surface but these are additive you keep adding terms together where there's some coefficient, some parameter times the predictive variable then we have priors there are four parameters in this model now there's alpha, two slopes and then I give it this exponential prior now I'm going to start using exponential priors more habitually instead of uniform priors because they have nice properties which we'll talk about when we get to chapter 10 I think I can unveil why I like this one way to think about this is it's got the right constraint exponential distributions are always positive numbers that's one way to think about what's good about it the other thing is it's skeptical of really really big values and you're setting the value you put in there is setting the average value that you expect the standard deviation of the F okay here's the crack code you've seen this before and no mysteries here and we run it and we get table of coefficients, yay and one lesson of this course is that these tables are hard to interpret but you're going to see them and it's worth thinking about what goes on here you're probably already gazing in there and you're looking for p-values and they're not there huh what do you want to look for? well you want to look at the mean that's the posterior mean of the posterior distribution of each parameter as I promised, alpha zero right, had to be just had to be had to be zero and then bm the average rate it's slightly negative but the standard deviation is about twice the size of the absolute value of the posterior mean itself so if you look at the 89% compatibility interval over there it goes from minus 0.3 to plus 0.2 maybe it has an effect maybe it doesn't, it could be either direction according to this model, it just doesn't know what to think about it but there's no consistent relationship in the multiple aggression between marriage rate and divorce age of marriage however it's minus 0.6 standard deviation is the same as the other one but now the posterior mass is entirely below zero there's a reliably negative association between median age of marriage and the divorce rate what's the lesson of this, you probably already know it's that there probably is no direct causal impact of marriage rate on divorce rate it was masquerading because age of marriage is a confound between the two age of marriage is actually causal we can look at this in these dot and line charts instead this is the same information that's in that table that was just come out, in fact if you put wrap plot around the praisey call you get this plot it's what it produces and well actually this is co-F tab, sorry but the code's in the book to do this so remember model 5.1 age of marriage only it's the bottom row in each of the parameter batches there you'll see it's in the model with only age of marriage it's negative the model with only median age of marriage marriage rate, sorry if I get the estimates positive but then in the multiple aggression which is model 5.3 the posterior distribution for marriage rate moves towards zero and gets wide which is the knockout effect of putting in the common cause which is median age of marriage does it make sense this is a good thing in multiple aggression for do for us and I believe that this is actually the causal relationship here so this is probably the graph we have or if there is an arrow from marriage rate to divorce rate it's not a strong arrow it's weak but this is the connection between the two so in words now how could you interpret this well once you know the median age of marriage for a state you learn almost nothing else by also learning the marriage rate in that state and that's consistent with the graph that I have on this slide where there is no direct causal influence from marriage rate to divorce rate at the same time once you know the marriage rate there is still a lot of value in learning median age of marriage works the other direction and that's consistent with the idea that age of marriage is a common cause of median age of marriage rate and divorce rate keep in mind if you didn't know the median age of marriage in a state there is still value in learning the marriage rate because it gives you information but that information comes from another causal relationship from a direct causal relationship between the two variables and that's what our business is here in inference is to figure out the difference between those things if you just wanted to make a prediction and you didn't care about causal inference marriage rate is useful it helps you predict stuff but it doesn't help you do interventions in the world because if you wanted to go and change divorce rate in a state and you manipulated marriage rate it would have no effect because that's not how it works give people incentives to postpone marriage I should not actually make policy recommendations here actually East Germany where I'm giving this lecture from there's lots of interesting stuff that happens after 1989 at marriage rates maybe we'll talk about it in another lecture I'll put up some graphs there was a nice natural experiment in this but you have to be clear about the lesson of this that only predicting things are also inferring causal relationships so that you can intervene an intervention requires a true causal understanding of the system prediction doesn't I think this is the terror of science is that you can make really good predictions about understanding anything remember the geocentric model statistical models are not sufficient to figure out causal relationships you need something extra how do we visualize models like this the answer unfortunately is lots of ways let me give you just a few examples real quick here the text will be full of additional examples for particular models the fact is that usually the most useful way to visualize the posterior distribution for any particular model depends upon the model and the topic you're studying there's no standard command you want to execute all the time you want to think about what you're trying to communicate with the readers to make so let me give you a few examples the first one is going to be predictor residual plots which is actually not something that I recommend doing very often but it's really useful to help you understand how regression works so I'm going to do it once in this course and then never again but it's really good for understanding how regression functions it's not necessarily so good for communicating results the second are called counterfactual plots they're counterfactual because you imagine any one of the variables leaving all the others unchanged and you do stuff with the predictions and you see how the model sees things and then posterior prediction checks which we've done before but I'll show you how you do it in the multivariate context what is a particular residual plot? the purpose of these plots is to show how the association of a variable with the outcome looks having controlled for the other predictors so inside the machinery of this model how does it see the relationships among things and that's what we want to do we can calculate those intermediate states even though you don't normally ever see them and they just happen sort of magically in the calculations this is great for intuition it's terrible for analysis there's a tradition in biology unfortunately of analyzing residuals especially in life history theory I mean that never there is never any statistical justification for running a regression on residuals and you'll see this this used to be what people did before multiple regression you would do a bivariate regression you get the residuals between those two and then you take those residuals and put them in as predictors in another model you should never, never do this why? because it gives you the wrong answer but it's useful for understanding how the model sees things but it gives you the wrong answer because it gives you the wrong uncertainties that tradition and it's taught but you should never, ever do it what should you do instead? a multiple regression analyzing residuals is what multiple regression is designed to do you don't analyze the residuals I know you'll see it in journals but it's off-label use what's the recipe? you regress a predictor on another predictor so in this case that's going to be age of marriage on marriage rate and then you get the association between the two and then you can find the extra variance that's left over after that association and then you, those are the residuals and then you want to look at the pattern of relationship between those residuals and the outcome this is useful for understanding how the model sees things but you don't want to run it this way because you don't know the residuals I'll loop back to this point at the end let me push through the residuals first here's our first residual plot I'm going to fill this slide with four different plots here's the first one let me zoom in on it we're looking at a regression of marriage rate standardized on age of marriage standardized for all the states and there's a regression line that passes through and then the distance from that regression line which is the expected value of marriage rate conditional on age of marriage that distance is called a residual let me explain a bit for each case from the model does it make sense? it's that line the absolute distance of that line is called a residual and I've highlighted some of the states with high residuals why has a very high marriage rate it's a vacation place people go there to get married that's why it's a marriage tourism state that's why it has a high marriage rate and other states are basically right on the regression line so now we're going to take the absolute distance of each of those lines and set it aside it's still associated with each state and that's our list of residuals does that make sense? I know some of you have calculated this before and you're totally bored, I apologize now we're going to take those residuals and on the graph I've added to this slide the horizontal axis are the marriage rate residuals variation in marriage rate that's left over after accounting for the average association with age of marriage and then we look at the correlation between those residuals and divorce rate and this is nothing right so if you regress and again you shouldn't do this in analysis this is what multiple regression does internally the correlation between divorce rate and marriage rate residuals is nothing it's what you got for the multiple regression it's just showing you how it sees it inside the model this is what the model is doing mechanically inside but it's doing it accounting for all the posterior uncertainty and this is the point where I wanted to say you don't know the residuals, the residual is not a single value it's a distribution and so if you do the multiple regression all at once like I showed you then it handles all of that you can handle the fact that you don't know the residual exactly and you get the right estimates with the right standard errors but if you do it this way you don't and that's the big sin there are other sins too you definitely can never do it this way but it's really good for intuition does this make sense? point out some interesting cases Maine has a really really high divorce rate it's outside the south this is a super anomaly I think it has the nation's highest divorce rate now anyone here from Maine? I don't think this audience is probably not a likely event last I checked on this which was a couple weeks ago there's still no explanation for why Maine's divorce rate is so high because it's definitely not a southern state it's the opposite completely the opposite okay let's do very quickly now the flip side so now you can also take agent marriage you can pivot the graph from the upper left to make the graph in the upper right have agent marriage regressed on marriage rate there's an average line, there's a new set of residuals these numbers are different and then we can put those on the horizontal axis down here against divorce rate and correlatum and now there's a strongly negative right so after you already know the marriage rate there's considerable value in knowing agent marriage but the reverse is not true does this make some sense? was this helpful? I hope you see what's going on yeah this is one of the things that we mean when people talk about statistical control and of course the word control comes from the design of experiments where you're actually setting the values of variables that you think are plausibly causal in observational studies like this one, we don't do that there's no ethical intervention by which you set the marriage rate of a state I assert that would be unethical to do that other people are less convinced about these things so we're stuck with observational studies for large amounts of very important problems and we want to make causal inferences though and multiple aggression does offer ways to do that but only when you pair them it's some clear idea about what the causal relationships are among the models and statistical control means conditioning on the information in one variable is there any valuable information in the remaining variables that's what statistical control means but to interpret the effect that happens from statistical control you need some causal framework like a DAG or something else we'll have examples later on where controlling for something actually creates a confound it can create a confound just as much as it can remove one in this case it removes a confound and gives us what I think is the right answer later on it won't is there a quick question Brett? was that in? I had a quick comment if you don't mind I once worked with a biostatistician who said in observational context you should say stratified stratified stratified that sounds nice sure that'd be nice I'd be happy to sign on to a letter that promoted that my experience with statistical terminology is statisticians are helpless absolutely helpless we've been trying to get people not to call a p-value the probability to know hypothesis is true for so long it's just a total failure and it's just anyway that's a rant I won't go off on I often feel like I'm just sort of lost at sea with these terminology and I just tell you what they mean and warn you about them but I think it's very hard to aggressively recommend new terminology because you're swamped in the use of these things and my responsibility is just to tell you what people mean when they say it and what it actually refers to right okay so what was my point on this slide yeah these models are not magic you shouldn't get cocky there's lots of other stuff going on here and last point I wanted to make about this is this is the kind of study where there are lots of potential confounds that arise simply from the fact that we're using average data for each state rather than individual cases this is a famous problem so this is a kind of data set that's actually just quite pathological very hard to be sure what's going on very quickly counterfactual plots these are cases where you hold the other predicted variables constant and you manipulate one of them and you see how the regression line changes so all the code for generating these is in the text I encourage you to work through this and get an idea the concept is fairly straightforward you're seeing how the model sees the predicted relationships assuming you could play God and set the values of the predictors to anything of course in the real world we can't do that manipulating one of these things will also change other variables so if we're right in our DAG here for example if you manipulate agent marriage you'll also manipulate marriage rate but not the other way around counterfactual plots don't account for that they just let you play God and set them to any value you like and see how the model works that's really useful for understanding what the model thinks but it's not causal inference so we'll loop back to causal inference again in a later example finally posterior prediction checks the goal here is two-fold the first is to figure out if the approximation of the posterior worked sometimes it doesn't and the posterior prediction check will tell you that because there will be a very poor match between the posterior predictions of the model and the raw data so that's what we're going to do we're going to compare the posterior predictions to the raw data and we'll assemble this to one another your computer failed or you failed or some combination of the two and that's very useful because things fail and it's not your fault the universe is hostile to human life we're all struggling to exist fighting entropy it's all fine the other thing is it can inspire you to look at the cases that don't fit well and try to figure out what you need to make more robust causal inferences in the system so let me give you before I move on to the next example what we do here is I'm showing you on the horizontal is the observed divorce rate in each state on the vertical is the posterior distribution of predicted divorce for each state with the points being the posterior mean and then the line segments being the 89% intervals of the mean and the diagonal dashed line is the unity it's when they're equal hits it exactly right and that's like for average states it does that because that's what progression does it does really good at averages but then there are states where it's making bad predictions like Idaho let's focus just on Idaho, I mentioned Maine before so I'll pick on Maine again but Idaho Idaho is this is a state that no one in Europe knows where it is you just know it's out there it's got mountains right it snows in the winter and it has a very high divorce rate no actually it has a low divorce rate much lower divorce rate than is predicted for its average age of marriage there's a low average age of marriage in Idaho and a very low divorce rate and that's why there's a mismatch between what the model thinks so observed divorce it's on the left of zero it has a low divorce rate the model thinks it's going to have a divorce rate that's more than one standard deviation above the mean it's getting Idaho really wrong and anybody who's lived in the US or especially the western US knows the answer to this what is it about Idaho the answer is the Latter-day Saints there is a large population of members of the Church of Jesus Christ of Latter-day Saints locally known as the Mormons in Idaho and they have those members of the Church have a very low divorce rate that's what explains it Utah as well is another state with a very high frequency of that Utah is less of an outlier though okay so in any particular case looking at the cases that are predicted badly can give you ideas help you understand what's going on in the system alright let me spend the remaining 15 minutes talking about another good thing that regression can do one good thing can do is it can reveal various correlations like the thing I just showed you statistical control or stratification that's a better term can tell us the difference between whether there's a path actually from one variable to another or whether it's just pretending to have that path because of some common influence on the two variables another thing it can do is when there are two predictors that both influence an outcome but in different directions they will mask one another and you need both in the model to see that either one of them matters or rather to get the total actual causal effect to both of them a real example from a primate dataset and again this tends to arise you have two predictors they're both real effects on the outcome but they act in different directions on it and they're correlated with one another and as a consequence in nature they hide one another's effects and if you don't measure both of them then you'll lead you to believe that neither one of them matters as much as it actually does so I should also say last thing I note here another thing that really matters here is noise in predictors which is something we'll deal with much later in the course can also mask association but it's mechanistically a different effect so this won't deal with everything so this is something called residual confounding if you have a lot of measurement error because your measurement fixture is bad you cannot see that it's actually a causal effect that's probably pretty straightforward this is called residual confounding in fact the predictor that is measured most precisely will often show up in such studies as being the truly causal one even if it's not I think there's a lot of this that goes on alright three primates here one lemur and two monkeys so on the left we have you lemur fulvas which is really one of the more magnificent of the lemurs in my opinion we can argue about this at lunch but I think it's a great one if everybody likes the one that looks like a raccoon this is a better lemur and we're going to be interested in this data in the association between milk energy how energetic the milk is that these primate species get to their young and how brainy they are by a particular measure is their neocortex and why are we interested in this because well I'm an anthropologist and why are anthropologists interested in brain evolution well I'm because humans have big brains that are conspicuous about our species it's very narcissistic as it feels but it's also it's a general topic that's of interest beyond primates and human evolution as well and primates are mammals like all mammals they feed their dependent offspring with milk and some mammals have really highly energetic milk seals for example seals basically ooze butter to their offspring and a good image for you and why because they can't carry their offspring with them while they're foraging so they plop up on the ice, they ooze some butter and they go back in the water and they get some fish that's the sea of life, it's glamorous and primates in contrast carry their offspring on them almost all the time most of them, some really small ones like alligators don't necessarily but most of them do and as a consequence the energy density tends to be lower it's definitely more closer to the average for mammals human milk is not particularly energetically rich so then we have homo sapiens in the middle we have our 0.7 kcal in our milk and we're 75% of our brain mass is neocortex which is the wrinkly part and then finally another very brainy by some measure is the brainiest primate is sebas highly energetic milk more so than humans and it's 68% neocortex by brain mass so if we had a bigger sample of primates we're interested in the connection between these things because the hypothesis is if you have a bigger brain you have to grow that in your offspring so you have to give them more energy so can we see a signal of selection on milk energy from braininess do humans put more energy in their milk than lemurs because they're brainiers so here's a data set that I received from my colleague Katie Hynd I know some of you know Katie and she kindly sent me all this data we've got a sample of primate species and we've got neocortex percent of their brain mass is neocortex and we've got the chelocrylate per gram in their milk and we've got their body mass of the moms and what I want to show you is a pairs plot these things are there's a particularly strong correlation in these data if you look at the intersection of log mass and neocortex percent you'll see that there's a diagonal upwards line that's a positive correlation between the magnitude of the body mass so when you take a log of a variable think of it as now magnitudes like the strength of an earthquake it's not the absolute value but it's the exponent on it the body mass that is strongly correlated with neocortex percent these things are correlated bigger primates have our brainier primates in the sense that they have more neocortex in particular not just their raw brains are bigger that's true too you see no particular strong correlation between either log mass nor neocortex percent and chelocalories per gram for milk it's much cloudier there which is unfair we'll do the real thing in a minute so what's going on if you just did bivariate regressions not much would happen we'll do that in a second here's what we call the necessary sermon on priors again we need to do some prior predictive simulations here to figure out what's going on again you can standardize these variables that helps a lot that helps us with alpha what I want to show you on the left is I think if normal zero one is pretty flat but actually conventional priors would be like normal zero one hundred you'll see priors like that all the time and of course if you're not doing the Bayesian model it's essentially flat because you're saying before you see the data anything's possible and then we sample regression lines from this prior and it's crazy it's like a Jackson Pollock painting and this is not a good prior now again this sample is small even the silliest prior here but we're practicing when it's safe because it won't be safe at some point in your life and you want to practice ahead of time so all we need to do to get the regression lines to live in the outcome space we can contract alpha because we don't want it to move from zero that much we know it should be about zero and we need to bring the slope parameters down so they're tighter as well and so we're back to this normal zero point five and if you standardize to predictor in the outcome a slope prior of normal zero point five we'll leave you in the outcome space for the most part it's not a terrible prior okay here are the bivariate progressions of each predictor on our outcome of interest which is the energy density of milk, kilocalories per gram there's a slightly positive relationship between neocortex percent and kilocalories per gram but only very slightly and you'll see that bow tie compatibility interval is pretty wide and there's a slightly negative relationship between log body mass and kilocalories per gram now what's going to happen when we put them both in the same model here's the multiple regression model nothing too shocking there's two terms in the linear model now two slope priors basically the same we run now look at the marginal posteriors down in the table below for neocortex percent it's consistently positive far above zero all the posterior mass is above zero and now for log body mass all the posterior mass is below zero these are very strong consistent relationships now that both are in there let me show you this graphically at the top are the bivariate progressions bothered against the data and then at the bottom these are the counterfactual plots holding these things holding the other predictor constant the bivariate, the horizontal axis in each plot you'll see the multiple regression sees both of them as important predictors of milk density of energy density of milk, sorry but only if they're both present and this is the masking effect one is positively related to the outcome the other is negatively related to the outcome and they're correlated with one another so bigger primates tend to have bigger brains bigger brains need more energy and less well it's not that you need less it's that they have longer developmental times and the milk energy density goes down as a consequence get more water is what you do so these are antagonistic effects but they're correlated in the same species but there's enough they're not perfectly the same they don't have all exactly the same information so in a multiple regression you can pull it apart this sort of effect can happen a lot in a lot of data sets it could arise here's the simplest way it could arise I give you the code on the left to simulate a fake data set which exhibits this masking relationship this is how I learn these things I figure out if I can fake the effect then I understand it we don't publish our fake data but we definitely work with it since we're sure we understand what's going on in DAG form what you see here is at the middle of the top of this DAG we've got an unobserved compound there's some common cause of body mass in and neocortex percent in some life history optimization variable ecological or whatever it is pick your favorite thing in the primate evolutionary literature and then the cause of influence of each of those of body mass and neocortex percent on energy density one is positive and one is negative and this relationship is sufficient for the pattern that we see in the actual data where this happens so this sort of stuff can happen a lot if we could measure you that's what we'd regress on because that would tell us what's actually going on but we can't instead we have the individual life history characters and we end up regressing on those and we get very very confused that's how I explain the primate evolutionary literature it's very very confused does this make sense? I've got three minutes so I'm right on time I want to end by saying some useful things about categorical variables often we have predictive variables which are not really continuous they're they represent discrete unordered categories things like country you're in, like albion for example but gender or species you want to conclude those on the right hand side of a regression formula but obviously they're not continuous albion I guess you can some people are too much albion I should stop talking but these are useful variables to use because the mean can vary by these categories but you can't enter them as continuous values so what do we do and I'm sure most of you know what you do but I want to say something that's a little bit different and unconventional here there are two general approaches the first is to use a dummy variable what most automated software is going to do it's going to take your categorical variable and it's going to make it into what are called dummy variables I'll show you what this means on the next slide the other strategy which I think is nearly always superior is to create something called an index variable I'll also show you what that means I might run out of time here but then we'll finish up next time so what's a dummy variable you take your categorical variable and you code it into a series of 0, 1 indicator variables calling these indicator variables is a little nicer they're called dummy not because they're dumb that would have a B in it right because they stand for something they stand in for it it's a dummy, a decoy variable and so for example in the callahari height data there's a column of mail and it's coded 1 and 0 right 0 means not mail that's all it means and 1 means mail so height varies by sex at least in humans and so you could include it in a linear regression of height in that data set and this is what the structure would look like the linear model it looks just like a continuous predictor but because it's been coded using 0 and 1 effectively all it does is turn a parameter on and off and it adjusts the mean it effectively makes two intercepts one for not mail and one for mail female is the only other sex or gender in this data set so that means not female and then so alpha is the intercept for females in this data set and alpha plus beta m is the intercept for males does that make sense the problem with dummy variables we set this up and we'll resolve this next time is that as you get more and more possible categories you need a bunch of them and the linear model gets long and messy and then you have to pick priors for every one of those differences from the baseline category so like with seasons if you've got winter spring summer and fall you need three dummy variables one of these will be the alias that category and that'll be alpha will be its expectation and then the others have their own results that you add in if you have a lot imagine it was country I don't know how many countries there are in the world but there are a lot you need that many dummy variables minus one it gets pretty annoying very fast and hard to manage and then you have to set priors for those things and the consequence of this is that you end up assuming that one of the categories well in this case one of the categories is less uncertain than all the others because two parameters make the prediction for all the categories except for one so think about in the height example for females alpha is the only prior that matters but for males alpha and beta matter and they get added together and your model thinks a priori that you're less certain about the height of males than you are about the height of females but that's not true you'd like to assign them both the same prior if you really didn't know anything about the ordering of them which is not true because you're human but that would be the case a hidden thing about using dummy variables so the better option is index variables and I'm going to stop right here and this is where we'll pick up when you come back I'll teach you about index variables and this is the strategy we use for most of the examples in the book it's easier it's absolutely easier in almost every way okay thank you for your time and I'll see you on Friday