 Welcome back everyone to statistical rethinking 2022. This is the fourth lecture in which I'll extend linear models to be able to capture more complex scientific relationships, and we'll talk a lot more about scientific goals in this lecture than we did in the previous one. To give you a little bit of a reminder where we were last time, I'd introduce this idea of linear models as a set of devices for constructing approximations of natural systems. So this is the Ptolemaic model of the solar system animating on your screen in which two circles are able to reproduce the so-called retrograde or zigzag motion of Mars in the night sky. However, the model is not physically correct. The structure of the orbits of Earth and Mars looks nothing like this. Nevertheless, a system like this can produce arbitrarily accurate approximations of the pass of the planets in the night sky. The same system that is using circles on circles can construct any continuous path. So here's what's called a complex Fourier series, and you can make a musical note because the outline is just a continuous path like an orbit, and all of those circles on circles is a weird little golem that reconstructs this image. And if you have enough circles, you can draw anything as long as it's a continuous path, an orbit, as it were. So musical notes, gears, and so on. Linear models are similar in the sense that by putting a bunch of straight lines together, you can erect massive and complicated structures of arbitrary accuracy toward to some data set. Now, of course, we have to be careful in how we do this so that we're capturing the right structure and that we understand that we're merely building an approximation out of little building blocks, little straight line building blocks. But this makes linear models incredibly powerful because they can do lots of extra linear things or nonlinear things. And in the lecture today, I want to introduce you to two of those things, that is the study of categories that is discrete, nonordered types of things in the world. Linear models can handle that. The categories are often modeled with devices called dummy variables or indicator variables. And in this lecture, I'm going to focus on another way of doing it called index variables. But they're all equivalent, really. I'll explain why I prefer index variables. And then after the intermission, we're going to turn to making curves. And there are two ways I want to introduce to you to do that. The first is extremely common, polynomials. It's common, but it has a huge number of drawbacks and probably should only be used in rare cases. And then there are splines, which are much more flexible and can be extended to a bunch of related additive structures, so-called generalized additive models, which are very powerful. Okay. This lecture is going to focus much more on statistical goals. I mean, sorry, on scientific goals. But there's lots of statistical machinery that we're going to need to get us there. And we're going to be drawing the owl, but part of drawing the owl is drawing the statistical model in a way that reflects our scientific goals. And as our analyses get more complicated, you're going to see that we have to already start thinking about the statistical model and its relationship to our scientific estimates in a much more careful way. So we're going to incorporate causal thinking into how we draw the statistical models themselves and how we process the results. In order to do that, in the example I used today, we're going to have to talk about categories. And this comes up all the time in data analysis. How do we cope with causes of phenomena that are not continuous? And there are different kinds of non-continuous causes. We're going to talk about categories by which I mean discrete unordered types, like the seashells on the right of this slide. There's no gradation between the different types. They're real types. And if you count them, then you have counts of each type. And those are categories. Typically, not always, but typically what we want to do on the statistical side of analyzing categories is to stratify our analysis by each category. And that means, for example, in a linear regression, we fit a separate line for each category. Okay, a little bit about the attitude I encourage you to have towards this lecture and the ones in the future. There's a lot of complexity in this material, as I keep saying. And at some point, almost everybody feels a bit confused. And I just want to reassure you that your sense of confusion is merely a result of your attention. I'll say that again. Your sense of confusion is merely a result of your attention. There is no way to pay attention to complex material and not feel confused. So you should be proud of your confusion. And you will work your way out of it. You can back up. You can do things again. And as I mentioned in the previous lecture when I spoke about flow, you don't have to understand every piece of it in order to keep moving forward. I think this confusion battle is a real significant one in learning statistical machinery. I guarantee you that everybody gets confused by this material. I still get confused by it. And that's why I keep very detailed notes about things I've done in the past. We're in this together and we'll keep swimming forward. We're going to come back to the howl height, weight, data, and we're going to introduce two more variables in this lecture. We're only going to focus on one of them for now. That is the sex of the individual, whether they're male or female. If you look in the howl data, we're still only focusing on adults for now, individuals 18 or older. You'll see that there's a column for age and there's a column of 01 indicator variable indicating whether the individual is male. And on the right I've plotted the relationships among height, weight, and then I've used color for the individual's gender, whether they're a woman or a man. Now, what I want us to do is model the relationship among these three variables, or rather model the influences of height and sex on weight in adults in this population in the Kalahari. So in order to do something sensible statistically here, we're going to have to think scientifically first. There's a distinction to be drawn here between how height, weight, and sex are causally related to one another and how they're statistically related to one another. So of course they're all associated, height, weight, and sex are statistically associated in quite strong ways in every human population. But their causal relationships are not obvious from those associations and so we have to think a bit. And what we're going to do is build up the heuristic causal model on the bottom of this slide. This is a DAG, a directed acyclic graph. I introduced this device in the first lecture of the course. But we're going to build it up in pieces. So hang on. So the causes of these variables aren't in the data themselves. They're reflected in the data, of course. The data reflect the causal relationships among them, but you can't simply read the causal effects off the data. Let me give you a few examples and this will let us walk forward and build up a little toy causal model of these variables. Consider just a relationship between height and weight which we focused on in the previous lecture. The data plotted on the right are equally compatible with height influencing weight and with weight influencing height. And I've drawn the two DAGs to represent that on the left of this slide. An arrow going from H to W means that height influences weight and an arrow going from W to H means that weight influences height. Both of those causal models can produce the same data as you see it on this slide. But of course we have some science. We know something about human biology and it's reasonable to pick the top one here that height influences weight. Why? Because if you think about an intervention on height, if you change an individual's adult height, you end up changing their weight. Why? Because there's less of them. So they have less mass. In the reverse, you can change an individual's weight without changing their height. There are lots of ways to change an individual's weight without changing their height. One way to think about this is biologically given the way humans grow, height influences weight because as you make a person taller, there's more of them. Let's think similarly about the relationship between height and sex, a person's birth sex. So again, I've plotted the data on the right of this. These are the kernel density estimates from the sample of the heights of women and men. I've plotted women in red and men in blue and unsurprisingly like all human populations, men are a little bit taller on average than women in the same local population. These data are equally compatible with the idea that height influences sex and that sex influences height. But you'll find the first one a bit silly because you have a lot of scientific knowledge about the human animal and you know that it doesn't make any sense to say that adult height influences a person's birth sex. The causality is completely wacky, but it makes plenty of sense to say that someone's birth sex influences their height. Similarly, we have the same little paradox, if you will, with weight. There's a small mean difference in adult weight between men and women in the same local population and we find that again in this sample. And again, these data are compatible with both DAGs and the lower left of the slide. But again, some scientific background information leads us to assert that birth sex is, if there's any influence of one of these variables on the other, it's birth sex influencing height and weight and not the reverse. So let's put all these pieces together then so that there's a DAG heuristic causal model that includes all three variables. And you can read the nodes of these as statements about what is influencing each variable. So for height influences weight and for sex, sex influences both height and weight, and weight is influenced by both height and sex. And those are the relationships that are indicated by the arrows in this diagram. Of course, to do more work, we have to specify exactly how those influences work, the shape of the functions. But you can do a lot with this heuristic representation. Right, so when you pick functions, you can say like, what's the mathematical content of a heuristic causal model like this? It's just what's on the right of this slide, some anonymous set of functions that h is equal to is determined by some function, for height to determine height, that is a function of sex and then for weight. Similarly, weight is equal to some function for weight of height and sex. When we built statistical models or fully detailed generative causal models, we have to fill in these functions so they're no longer anonymous. But let's think conceptually just for a second about the consequence of an innocent little graph like this. I'm going to start animating this thing and what's going to happen is along the causal paths following the arrows, the effects of each variable are going to flow. And what I want you to see is that the consequence of this is that causal effects are transmitted down the line. That is, they don't just stop when they hit a variable, they can keep moving on to further variables along what's going to be called a path. That is, some connecting set of arrows. So blue is going to represent sex and red is going to represent height. I'm going to start the animation now. You can see sex influences height and weight. So weight gets a direct influence from sex and this is accumulating up there in the upper right above it. But the blue that hits height, if you will, you can say it contaminates height in a sense. There's a causal force of sex that works through height. Even though height has a direct effect on weight, because you make an individual taller or shorter, you change their weight, if sex has a significant causal impact on height, then part of the total causal effect of sex works through height. And this is already a sort of thing that we need to think carefully about when we do statistical inference. So let's distinguish a set of different questions, scientific causal questions. And each of them implies some different manipulation with this heuristic causal model. It's now no longer the case that we just have one extremely simple causal model and one almost identical statistical model, like in the previous lecture, now we're going to have to make distinctions. So different causal questions often require different statistical models. So the first question we might have, for example, is what's the causal effect of height on weight? Remember, a causal effect in statistics is not some mystical philosophical sort of thing. It merely means that you're able to predict the consequence of intervening on a variable. So the causal effect of height on weight means what is the effect of intervening on height on weight? That is, when we change an individual's height, what is the distribution of effects on weight? And so this question is addressed by the top part that I've highlighted in red of the dag on the right. Another causal question we might have is what's the causal effect of sex on weight? Now, this involves the whole graph. So I've highlighted the variables that are relevant and I've left height black because it's not relevant to this query. And that'll sound a little bit confusing right now if you're new to this sort of thing and that's fine. I'm going to layer in the explanation for why. But the key intuition was from the animated dag a couple slides back. Remember that part of the causal effect of sex operates through the height path. So we want to consider all the ways that sex influences weight to get the causal effect of sex on weight. That is to answer the question, what would be the consequence of changing someone's birth sex? You've got to account for all the pathways by which birth sex influences adult weight. Finally, there's the direct causal effect of sex on weight. What is there a direct causal effect of sex on weight and how can we estimate it? In this case, it turns out we need to include height in the analysis to do this now. And the reason is because we're going to use height in a sense to block the indirect causal effect. And again, if you're new to this, that's going to be confusing. I'm going to layer in the understanding of it as we go. But we have to consider explicitly consider the mediation effect as it's called of height on weight in order to isolate any direct effect of sex on weight. Okay, let's consider just two of these, though, for the sake of this lecture. These are our theoretical estimates and we're going to turn them into estimates. So the first question on the left of this slide, what's the causal effect of sex on weight? This is the total causal effect of sex on weight. And then the question on the right, what is the direct causal effect of sex on weight? How can we isolate that? To answer these questions statistically, we're going to have to model sex as a categorical variable. So this will give us a good excuse to teach you how to do, how to work with categorical variables. So let's draw the owl, drawing categorical owls here. There are several ways to do this in statistics and all of them are technically equivalent. Although in practice, of course, things that are technically equivalent are often not equivalent in practice. Probably the most common way to do this that you've seen before is the use of things called dummy or indicator variables. These are a series of zero one variables that basically stand in for each category. So if you have ten categories, typically you'll end up with nine of these dummy or indicator variables. And this works fine, but it's a big bookkeeping nightmare in my opinion. What I prefer to use are called index variables and they're equivalent, but in code it's much easier to use these. Index variables will have a detailed example in the next slides. Assign an index value to each of the unique categories. So if you have four categories, you'll use the index values one, two, three and four. But regardless of how many you have, you have an infinite number of integers, so you can do it with an index variable. We're going to use index variables because actually if the number of categories change, you can use the same statistical modeling code and nothing really changes if you use index variables. Whereas for dummy and indicator variables, you have to change the whole model to add the new dummy variables. It's also nearly always easier to specify sensible scientific priors if you use index variables than if you use dummy variables. I say some more about that in the book. And when we get to multi-level modeling in the second half of the course, those always use index variables. And so if you learn index variables now, you're not going to have to deal with that new concept when you get to multi-level models. Let me give you a toy example before we come back to the real dataset. So suppose our categories are these colors. These are typical printer ink colors, right, cyan, magenta, yellow, black. And that gives us four categories. And what we do, we create a variable in our dataset, an index variable, that really replaces each of these names, cyan, magenta, yellow, and black, with a unique integer, one, two, three, and four. There's no order implied by the one, two, three, and four. They're just numerical substitutes. And their function actually is that they're index positions in a vector, which is just a kind of list, of parameters. And so now we're going to have a vector alpha of parameters, which we'll estimate, and this vector will estimate the influence of color. But inside this vector, there are really four unique parameters named alpha sub one, alpha sub two, alpha sub three, and alpha sub four. And what the index values do is they give us a very elegant way in the code to pull out the right parameter for each case in the dataset. And I'll show you how this works. So again, repeating the alpha vector at the top here, alpha sub one is the parameter for cyan, alpha sub two for magenta, alpha sub three for yellow, and alpha sub four for black. And then say we had a linear regression. There's some outcome, like how much a person likes the color. And in the linear model for this, we simply put alpha. And then I've subscripted it with color. Color is the name of our index variable, and then bracket i. And that means for the i-th case in the dataset, you find the color index value, and say it's one, and then it'll pull out the first element at the alpha vector. I'm going to walk you through this in a detailed way when we get to the real dataset. So this isn't totally clear right now. You'll get another chance, or two. Okay, let's do it for the weight data now. So we're going to build a model of weight as it is influenced by a person's sex. And if we're going to just estimate average weight, it would be the model on the screen here. It's just a weight has some normal distribution, and its mean is a parameter alpha. And that's often called an intercept. In this case, there's no slope, but we would often still call alpha an intercept. What we do when we analyze by a categorical variable is we construct a vector of these alphas, and we subscript them by the categorical variable of interest. So what I've done on the left of this slide is I've constructed a variable s for sex, which is an index variable. There are two sexes, one and two. s equals one indicates female. s equals two indicates male. And so s sub i, or s bracket i, is the bracket because if you made the i a subscript on this slide, it would be really tiny, right? I think you can imagine what that would look like. So s bracket i tells you the sex of the i-th person in the data set. And then that lets us extract the correct alpha parameter for the i-th case. Let me animate this a little bit to give you an idea of what's going on. So consider i equals one. That's the first individual in the data set. Where s value is two, which means male. And so s bracket i has the value, that's s bracket one equals two. You can see that up on the left. And that means that that code becomes alpha sub two. And that's the parameter that gets inserted in the model. And that parameter then will functionally estimate the mean weight for males in the data set. Likewise, we can move to the second individual. i equals two now. And s two equals one. That is s, the value of s on the second row in the data set equals one. And that pulls out alpha one, correspondingly. You also need some priors for this. In the code, the assigning priors here doesn't really look any different. In mathematical notation, you'll put a little sub j or some other index next to the alpha j to indicate that there's a series of them. There's more than one alpha parameter. But you can assign each of them the same prior, which would correspond to the idea that, prior to seeing the data, you have the same prior expectation of the distribution of each of these. You don't have to do that. But this is a very neutral way to approach estimating differences in categories. Okay. In code, the quap tool in the rethinking package makes this quite easy for you because it'll automate all this lookup and stuff. You do have to construct the index variable, though. So on this slide, that's what I'm showing you. The top part of the code box here on the left, you'll see that in the dat list, the variable s is constructed by just taking the male variable in the data set and adding one to it. That gives us an index variable where one means female and two means male. And then the quap model, all you have to do is bracket the relevant parameter with the s variable. And then it does the rest. It quap counts how many unique categories there are, and it makes a vector of the right length. But otherwise, the code is same. This is one of the reasons, for example, that if the number of categories changes, the code would stay the same. You would just change the contents of the s variable in this code, and quap will figure out the rest. Okay, so you run that model. You get the quadratic approximate posterior distribution, joint posterior distribution for all of the parameters. Then you need to do something with that. So let's spend a little time processing and understanding what we've just gotten out of here. Remember what we're after is the total causal effect of an individual's birth sex on their adult weight. So the first thing we might do is look at the posterior mean weight implied by this model. And in this code, I've extracted samples from the posterior distribution, and then we're going to use those to plot up the densities here. And so all I've done is I've plotted the posterior density for alpha sub one. That means the posterior mean weight for women in the sample. And then the posterior density for alpha sub two, which is the posterior mean weight for men in the sample. These are the means. These are not the observed distributions of an individual's weights. We're going to do that next. These are just the posterior means and plotted in the upper right of this slide. Unsurprisingly, the men are reliably heavier on average, but men are not reliably heavier. I'll show you that next. They're reliably heavier on average, but they're not reliably heavier. Next, we do the actual distributions. And to do this, we need to simulate observations. So now we... W1 here is a posterior predicted distribution of adult weights for sex one, that is, women in the sample. And you'll see in that simulation just uses our norm. We've done this before. I simulate a thousand women. And in the mean, I insert alpha sub one. And then we have sigma in there to give us the scatter around the mean. Same sort of code for the men, but with alpha sub two instead. And then I just draw the densities. Now you'll see there's a lot of overlap. And of course all you know this, in any local population, men are on average taller than women, but there are lots of women who are taller than men and lots of men who are shorter than women. There's a lot of overlap in these distributions. So, I think of this overlap. If you want to think about differences between categories, what you need to do is compute what's called a contrast. Right? And so I say always be contrasting. You can't interpret the overlap of distributions because parameters are correlated with one another. If there's uncertainty in a parameter, that uncertainty may be highly correlated with uncertainty in some other parameter. It may be associated because correlation is a very limited concept of association because it's linear. So what you really need to do, the proper thing to do, and this is true of all statistical frameworks, this is not a Bayesian thing, not only a Bayesian thing, is you compute something called the contrast, which is the distribution of the difference between the categories. It's never legitimate to simply compare the overlap in the parameters. For example, for the distributions plotted on the right of this slide, what we'd like to know is what's the distribution of the difference in predicted weight between men and women in this population. We can't get that from just plotting these densities over one another. Right? We have to pull it up, and this will get us at our theoretical estimate what would be the counterfactual causal effect of changing someone's sex at birth on their predicted weight, posterior predicted weight. Would they be heavier, or would they be lighter? There'll be a distribution of effects, and that's what we want to get because that distribution of effects is the theoretical estimate we're after. So let's compute that contrast distribution, see what it looks like. Here's an extended example to bring it home while you don't compare overlap. On the left of this slide, I've just made up two parameters. This has nothing to do with the only conceptually related to the analysis we're doing, but these are just simulated numbers. I've simulated two numbers which have a strong positive correlation. You can see that. Now, I've plotted different densities for these two variables. I've plotted parameter one in blue, and parameter two in red. Parameter two tends to be a little bit smaller in parameter one, and they overlap a lot in their values. However, since they're so strongly positively correlated, when you compute the difference between parameter one and parameter two for any point for all the points, we take all the points on the left of this slide and we compute the difference between parameter one, that is the x-axis value, and parameter two the y-axis value, and then I plot the distribution of that difference in black on the right, and you'll see that this is very narrow. It clusters around minus one yearly always, and that's because there's a strong positive correlation here. So the overlap, the apparent visual overlap between parameter one and parameter two is not a good guide to their difference. So if you want to know the difference between two distributions, you've got to compute the difference. You can't just use overlap. This extends conceptually to a bunch of other bad habits people do. You can't directly compare confidence intervals to one another if there's two different estimates and you want to understand their difference, you have to compute that difference and look at the confidence interval of that difference. I'll say that again. If there are two parameters of interest, you can't compare their confidence intervals to decide how different they are. You have to compute their difference and then look at the confidence interval of that difference, and this goes for p-values as well. You never compare p-values to one another. You need to construct the difference between the things of interest and then if you want a p-value, get the p-value on that. Okay. So how do we compute such a contrast at the top output box on this slide on the left? I'm just showing you, I've extracted the samples from the posterior distribution. It's in the object post, which is the convention I use in the book and in the whole class. And then if you look at the structure of this, you'll see we've got two parameters in the posterior distribution. There's sigma, which is the standard deviation around adult weights. And we've got 10,000 samples of that. And then there is the alphas. And alpha is a matrix. It has 10,000 rows because there's a sample in each row and two columns. Because there's a column for each sex for each category. And so when we extract these things, we're going to use our little bracket notation to get the column of interest for each calculation. So now in the calculation on the bottom left, I'm computing the contrast in mu, where mu is the average adult weight between the sexes by taking alpha-2, alpha-sub-2 out, using the A bracket, comma-2 notation. So the putting nothing before the comma means all of the samples. Those of you who use R, you're familiar with this. If you're not so used to R, let me say it again. This notation, A bracket, comma-2, bracket, what that does is it gives you all the samples. It's all the rows. That's why there's no number before the comma. That implies all of them. But only the second column. So you get all the samples for the male, for the male alpha, for alpha-sub-2. And then we subtract alpha-sub-1. All the samples for alpha-sub-1 from that. But since it uses corresponding samples as it does this difference, the correlation between these two things is preserved. So this is the only correct way to do it. And then we plot that contrast, and that's what I'm showing you. At the bottom here, this is the theoretical estimate here. The posterior mean weight contrast between men and women in this particular Kalahari population is a little below 7 on average. But there's uncertainty about it. It ranges between 5 and almost 9. That's kilograms, by the way. I should have said it's kilograms. We can also do the contrast on the full distribution. And there's a little bit more value in this, too, because you get to think about this overlap again and consider how often theoretical causal intervention of changing someone's birth sex would not make them taller, right? Because there's lots of other stuff that happens that affects a person. I mean, not taller heavier, because there's lots of other stuff that affects a person's body weight. And to do this, you intuit, we simulate weight distributions, and then we subtract one from the other. So now we compute the contrast, the weight contrast. We subtract a set of a thousand adult male weights from, or sorry, adult female weights from adult male weights, and then we plot that distribution. And I've plotted this in the lower right of this slide. And I've colored it so that you can see that 82% of the area very under this curve, which is colored in blue, is a case where the males are taller. But in 18% of the cases, the women are taller in the simulation, right? So this is a consequence of the significant overlap in the distributions. Okay, where are we at? So we had our two S demands and we filled in one of them so far, the causal effect of birth sex on weight. We have it in form of these two distributions, these two contrast distributions that we've gotten, and these measured the causal effect of birth sex on weight through both paths that we have in the deck. Then the height, which is in black here, was not even in the model. We haven't even used it yet. But we've been able to estimate the total effect of birth sex on weight. Now let's worry about the direct effect and see what we have to do. We have to add height to the model. We're still going to use index variables and now what we're going to do is we're going to take our equation for a line and that's what I have up on the slide here. This is just a linear regression of weight on height. This was introduced to you in the previous lecture and focus your eyes on the intercept and the slope because what we're going to do is we're going to subscript both of those by sex now. So you can see just same structure as the previous introduction of this. But now we have two vectors of parameters. We have an alpha vector with two elements and a beta vector with two elements. So each sex gets its own intercept and its own slope. This is, as I said before, this is stratifying by sex. So we get a different regression relationship estimate for each sex. Yeah, and the all the look up works the same way except now what the linear model is doing is it's simultaneously looking up the proper alpha value and the proper beta value. And you'll always get alpha 2 when you get beta 2 and you'll always get alpha 1 when you get beta 1 because of the way the vectors are set up. The code not very different. The data setup is all the same as before and now you'll see that the slope and height is back in the model and again we just put brackets to the slope both in the declaration of the prior and in the equation for the mean for mu. After you've got the joint posterior distribution for all these parameters there's stuff we'd like to do. We want to compute the posterior predictive for women. We want to compute the posterior predictive for men and then we're going to subtract number 2 for number 1 and this will give us the contrast between these two. You can probably see intuitively from the plot on the right I've plotted some lines. Those are the posterior mean regression lines for women in red and for men in blue and you can probably see they're not very different. But again you don't just visually eyeball the lines to talk about their difference. Let's compute their difference. So what we're going to do is at every height value on the horizontal axis we're going to compute the difference in the posterior distributions of those expectations and not just the lines on there but including all the posterior uncertainty about the exact positions of the lines and then we're going to plot that distribution and that will give us an idea having stratified by sex is there any direct effect still of sex on weight that doesn't after we've accounted for the effect on height and that means looking at each height value and saying are the sexes different. I'll say that again. That means at each height value are the sexes different in weight and again you can look at those lines and you know the answer but we've got to get there in a justified way statistically. So there's a lot of code on this slide and but it's not that complicated. Let me talk you through it a bit from top to bottom. The first thing we do on this slide is I just set up a sequence of X values. Those are height values and for each of these we're going to compute the difference between the difference between the posterior predicted adult weight of women and men in this population. So that's what we do. Those mu F is going to be the posterior mean predicted weights of females in the population and I just use the link function from the rethinking package to do this. I pass in the X sequence and I assign S to be just 50 women because there are 50 height values and then we can plot that and then there is mu. We do the mu M for males the same code but now we replace S as 52 50 values of 2 because we're using the second index value in each of the parameter vectors and then we compute mu contrast exactly as you'd expect by subtracting male distribution from the female distribution and then I plot it and the plotting code at the bottom of this box produces the output on the right here and what you're looking at is horizontal axis is height and those are all the height values we're stratifying by and then the vertical axis is the posterior distribution of the weight contrast between women and men and so on above the dash line which is at 0 women are heavier and you can see that there's a slight tilt to this there's not much difference but there's a slight tilt in the sense that shorter individuals the men tend to be heavier another way to say this is for women and men of the same height if they're short then the man will tend to be heavier but as you can see it's not a very reliable relationship and for taller individuals for tall men and women of the same height the woman tends to be heavier so the woman tends to be heavier excuse me so what does this tell us nearly all of the causal effect of birth sex in this data set acts through height because there's a very substantial and reliable adult weight difference between men and women in this population and men are taller reliably taller although there's lots of overlap than women in this population after we've stratified by height we find that there's very little difference left but there's something going on here just because the distribution overlap zero that's there's no scientifically justifiable reason to simply ignore what's going on here okay let's update our estimate estimate slide now on the right I've put that graph that we just produced and this is an estimate of the direct causal effect of sex on weight and it's modest unreliable but there's a little bit of something going on there probably before we get to the end of this section and we take our intermission I want to talk about an equivalent and alternative way to do the analysis we just did so the way we just proceeded we used two statistical models one for each of our theoretical estimates we wanted to know the total causal effect of birth sex and we wanted to know the direct causal effect of birth sex and we had one causal model one dag but two statistical models and that's a very common way to proceed and and it's perfectly legitimate I do it all the time but there's an alternative and completely equivalent approach where you use one statistical model that represents the entire causal system all of its pieces and then you use the joint posterior distribution from that joint causal model to compute each estimate so you end up doing the same amount of work there's one statistical model but then at the end you have to compute or rather as I'll show you simulate each estimate separately so you've got a choice here I'll summarize this at the end but I'll say it now as well you can proceed either by stating each estimate and then having a statistical model for each or you can fit one global statistical model for the whole causal system but then you have to simulate each estimate separately from the output let me walk you through this so you get an idea how this works so here is the way you'd specify the whole causal system in quap and what I want you to see is you're just running the model for weight and the model for height simultaneously and this is perfectly fine it's two linear regressions but they run simultaneously and all of their parameters are going to be present in the same posterior distribution so let's go a little bit slow through here I've used some color on the right hand part of this slide to show you that there are some sub models in the DAG so the red causal arrow is pointing into weight to imply the statistical model for weight and I've specified that as a linear regression of weight on the interaction between height and sex if you're not familiar with interactions don't worry we're going to have a whole lecture about them later and then the blue arrow is the simple linear regression of sex on height where we stratify height by sex and we haven't done this model before but I think you can probably guess from its structure what's going on, I've just invented a vector of parameters little h and there's one for each sex and we estimate those separately if you run this model you get a really big posterior distribution I'm showing you the pracy output at the top of this slide you get two alphas, two h's and two betas and a sigma and also a tau what is tau, that's the standard deviation around height and again, don't be tempted to eyeball this pracy output and interpret differences you've got to compute the contrast between the first and the second elements in these things or between the first and second mus in the cases of the lines to understand what's going on so let me remind you causal effect is a consequence of an intervention in statistics it has no more philosophical depth than that usually now what we must do is simulate each of these interventions and you know the model so you can simulate from it so let's do it in the really draw the owl with all the steps version and then on the next slide I'll show you how to automate this starting from the top of the code box on this slide the first thing we do is we extract the variables from the posterior distribution and then I just set up hbar the mean value of height in the population in a convenience variable because I'm going to use it in the calculations below then I set in to be the number of simulated people I want to do for each variable and here this is 10,000 now there's this big block of code that starts with the word with in R is a way of creating scope so I say with post with the posterior distribution do all this stuff in braces and what that means is I don't have to put that dollar sign value after post to access the elements of it I can just use the parameter names nakedly as it were in the code and it'll find them in post and this makes code much cleaner and easier to debug and so I recommend doing this using with constructor to create some what's called local scope here so the first block there commented with the name simulate w for s equals 1 we're going to simulate 10,000 women adult women but we have to simulate them in order because we're imagining an intervention on sex so we're going to first calculate simulations for women and then we're going to do calculations for men where the only difference is sex at birth so but we know the graph and we have to follow up on all the past so the first thing we have to do is simulate height why because weight is a function of both sex and height so we need to simulate height first so we simulate height just 10,000 heights for counterfactual women s equals 1 using our norm so we access the first column of the h vector in the posterior and then we simulate their weights using their simulated heights and that's the w underscore s1 line then we simulate do the same simulation but now for s equals 2 this is the intervention the code the only thing only difference in the code is the ones have been replaced by 2s and then we compute the contrast at the bottom and I called this contrast w do s because in this causal inference literature there's this thing called the do operator which is the intervention operator which means the w do s should hold the distribution of effects that result from intervening on birth sex and so on the right we can plot this distribution it's written here as pw conditional on do s that is the consequence of changing sex at birth and we get the same distributions we got before these are I've replotted them but these are the same graphs that the same densities that we got the other way I told you these were equivalent approaches it's just in one of them you do the work up front you have to build multiple statistical models and then you get your separate estimates in this approach you do less work up front because you make one statistical model although it's bigger but then you have to run simulations at the end to get things out of it there's an automated way to do this in the rethinking package the sim function will take a clap model and you give it a data list of the variables that you want interventions on and the values that the interventions will cover so here s is a vector holding the values one and two and then you use the vars argument to say which variables should be simulated from the causal structure implied in the model and then you can get a W do s auto computed the automatic way by doing the difference between these two things and you'll get the same distributions I encourage you to try both of these out and go through step by step and make sure you understand what's going on before you fall back on the automated approach because it's a real strength that if you program the model yourself that is if you draw the owl then for any causal query or any kind of prediction at all even if it's not a causal one that you want to make from the model if you understand the model and you can write simulations that correspond to a structure you can do anything right you don't need a captive statistician helping you okay let me try to summarize this then we'll take our intermission soon inference with linear models means beyond the simplest kinds of linear models means thinking harder about the correspondence between the causal model and the statistical model that you're going to use and there are two approaches and they're equivalent and in some contexts each is better than the other so the first approach is step one as always as you state each scientific estimate then you design a unique statistical model for each of those estimates and then third step is you compute each estimate from the posterior distribution that usually means computing some contrast effect so in this approach there's one statistical model for each estimate the alternative so I call a full luxury base is you state each estimate still and then you compute the joint posterior for the whole causal system that is the DAG plus functions that relate the variable to one another implies a full causal graph that you can put into you can compute the joint posterior for given the data and then from that joint posterior you can simulate each estimate as an intervention so now the model is one simulation for each estimate so let me summarize categorical variables and then we'll take a break so categorical variables are this thing that lets you answer it's a device a kind of piece of a golem that lets you answer more structurally interesting scientific questions and I recommend in this course that we use index coding to do it and we will and I think all of the examples use index coding because it has a lot of conveniences and when you use categorical variables this way just keep in mind that you're computing the contrasts it's not the individual elements of the parameter vector that matter it's their differences quite often and you may be interested in some differences and not others but once you have the posterior distribution you can compute any of the contrasts you like at any time I should say we've done a lot of simulation in this section and it comes at you fast I know and it gets confusing something to keep in mind is you'll notice in all the examples when I've summarized I've taken a mean or an interval I've always done that as a last step and the reason is because the summaries themselves are just for communication they're not computationally meaningful what you want to compute is a mean difference between two parameters what you don't want is the difference of the means of the parameters I'll say that again what you want is a mean difference in two parameters as the parameters have a distribution their difference has a distribution and if you're going to report that distribution to someone you might report it's mean that's a good focal point so a mean difference in two parameters it's a mean contrast effect what you don't want to do is compute the means of the parameters first they're difference because that'll give you the wrong answer quite often so always this is easy to remember if you just remember always summarize last any summary should be the last step in reporting and never up the chain in your calculations okay with that I think you've all earned a break so take a short break stretch come back tomorrow whenever you're ready I'll still be here welcome back I'm going to finish up this lecture by introducing you to some ways to flexibly use linear models to capture nonlinear relationships and this is all in pursuit of better scientific models in the sense that of course real phenomena in nature are not linear forever they may be linear at local scales but nonlinear effects are routine so for example if we stick with the howl data set that we've used in this lecture and the previous but now we don't restrict it to adults you see that the relationship between height and weight is definitely not linear at least not on the natural scale but we're still interested in this relationship and if you're a human biologist you're very interested in this relationship and understanding how it works and what causes disruptions and so on the good news is linear models can fit this nonlinear shape quite readily the bad news is that when we do this it's geocentric, it's not mechanistic which is to say we're constructing non-mechanistic approximations if we use them with wisdom that's fine but we have to be aware that curve fitting comes with risks and we should consider our different choices when we do that so I want to give you just a quick introduction to some different ways that linear models can create nonlinear functions the two most popular strategies are polynomials and I'm going to try to warn you off these but by no sense do I think that they're useless again if you use them with wisdom they're fine it's just that often people use them without wisdom and then splines which nearly always in any case in which you use a polynomial a spline is better nearly always not always there are always exceptions and I want to give you a quick introduction to that because there's a very big large applied world of splines and related methods called generalized additive models so what's a polynomial linear model in a polynomial linear model you merely extend the equation for the mean with higher order polynomial terms what does that mean here's an example on this slide I have the equation for mu in an ordinary linear regression there's alpha which is the familiar intercept and then we have a term with beta 1xi this is just the familiar slope and predictor and then here's the new thing we have a new parameter beta sub 2 which is just a new parameter in the model it is multiplied by the square of the x-axis variable and you can keep adding higher order terms you can add a term for the cube you can add x to the fourth and x to the fifth and so on and the more terms you add the more flexible the function is that you can approximate the problem with this can be very effective and you can model nonlinear essentially any arbitrary nonlinear continuous path with this it's like an epicycle this is a really geocentric type of model and because it's geocentric it carries with it some problems it often polynomial models contain symmetries that are often undesirable that are not supported by the scientific background and at the edges the uncertainty at the edges of the data can really be explosive which means that in practice you can't extrapolate beyond the sample outside the range of the sample with these kinds of curves so let's take a look at the traditional animation that I used to introduce these things to you and this is like the animation in the previous lecture where I introduced linear regression and Bayesian updating with linear regression we're going to do it now with polynomials so on the left here I'm showing the posterior distribution but it's for b1 and b2 which are the two slopes if you will b1 is the traditional slope it multiplies the x-axis variable and b2 is the slope of the squared term if you will it's the coefficient on the squared term and I start these off with Gaussian prior 0 1 priors means of 0 and standard deviations of 1 and on the right we've got some samples from these priors and what's in the posterior distribution or in the prior distribution for the moment in the prior distribution is a bunch of parabolas and some are bendier than others and so I'm going to start animating that so you can see what the prior implies in this case I'll introduce one data point and the posterior distribution gets updated and then two and you can see that what the posterior distribution is learning is a range of parabolas that are compatible with the data and the gray region shows the sort of high density range of the parabolas but you can see that they swing around quite a lot and in regions with low data there's tremendous uncertainty about the position of these things I'll let that play through again so you can see the consequence of this is that it already believes in n equals 1 or n equals 2, it believes in the curvature because it has to curve, it's a parabola and it doesn't have to curve but it really wants to curve and the slightest evidence is going to curve and that's a consequence of choosing this squared term type of model so we can fit this to the height and weight data quite easily and the code for this is in the book I'm not going to emphasize it here and it fits okay but I've extended the range on the horizontal axis to show you that it extrapolates beyond the range of the sample in quite ridiculous fashion so the forced curvature of the parabola means that on the left as we go below the shortest individuals in the sample around 50 centimeters the model actually thinks that people get heavier as they get shorter beyond that and obviously that's ridiculous again you have to exercise some wisdom here about what's going on let me show you how extreme you can get with these things you can improve the fit arbitrarily by adding higher order terms so on the right here is the fourth order polynomial it contains terms for the squared, the cubed, and the h to the fourth and this fits better than the quadratic one but again at the limits of the sample it does catastrophically silly things so off the right end individuals get disastrously lighter as they get taller and on the left individuals shrink into negative weights quite rapidly as they get shorter so just a bookmark for something we're going to do much later in the course we're going to come back to this exact problem in the last week of the course and we're going to consider a more biologically inspired or geometrically inspired model of this but just to show you this now and you can play around with this in your free time if you like if you fit an ordinary linear regression to the full sample and just take the logarithm of body weight first so it's a regression of the logarithm of body weight on height you get the fit you see on the right which is a lot better and we're going to return to this as I said in the last week explain why taking the logarithm results in such a good fit here there's a very fundamental biological reason for that that's related to the strange Vitruvian man figure in the graph there but this is just a bookmark I'll explain the whole thing near the end of the course ok let's talk briefly about splines I just want to introduce them there's a long section at the end of chapter 4 that mechanistically walks you through the construction of these and I just want to give you some conceptual help with that so splines is a big family of functions for doing local smoothing constructing functions that are locally smooth and what does that mean it means that the points in each region are the ones that determine the shape of the function in that region and that's very different than polynomials I'll say this again local functions mean that only the points within each region determine the shape of the function in that region or strongly influence it in polynomials that's not true in a polynomial the whole sample influences there's a small number of parameters determining the global shape of the curve with splines that's not the case there will be parameters which are localized to different positions on the x-axis and those parameters are informed by the data near those locations on the x-axis and this makes them much more customizable you can estimate lots of interesting functions here and these things behave much better they carry with it the same warnings about having no mechanistic reality to it and so these functions will also produce really absurd biological or physical predictions if you let them so what's a spline actually we're going to focus on something called basis splines and a spline actually comes from the real world it's a term used in drafting architecture for a device that lets you draw smooth precise curves on designs so what we're looking at here is a spline that's the thin piece of wood that's bent and has these weights attached to it those brass colored weights are heavy enough to hold the spline which is that curving piece of wood at the top there in any arbitrary position as long as it's not bent so much that it snaps and this lets you draw exact curves on designs and get them right statistically splines use similar ideas and I'll walk that out for you in the next slides but just cartoonishly it's a good idea to get an idea like in the prior before it sees any data what is a spline and that's what I'm showing you here taking of the model which is the spline on the cover of the book the Japanese cherry blossom graph on the cover of my book from the prior distribution of the splines this is what you get there's just curves everywhere and you'll see that there's no real assumed shape to them they can zigzag up and down all over and be high or be low splines are incredibly flexible unlike polynomial models and then when we train them with data you see this on the bottom again this is the graph from the cover of the book the Japanese cherry blossom data and the spline here shows the trend wiggling are independent samples from the posterior distribution at the bottom and the prior distribution at the top okay so what's actually going on in these weird splines so we're going to focus on a particular kind of spline one of the simplists so called B splines B stands for basis I'll explain what that means these are linear models they really are literally linear regressions but they're regressions on synthetic variables I'll say that again literally linear regressions but they're regressions on synthetic variables and so you replace the equation for mu with something like what's on this slide you still have a traditional intercept alpha there and then a series as long as you want it to be of terms which are weight parameters those little w parameters times some capital B variable which is introduced data but this is data that you build basis functions and so the weights are like slopes they determine the importance of the different variables for the predicted mean for each case I the basis functions are synthetic variables that only have positive value in particular regions of the x-axis I'll say that again the basis functions are synthetic variables that you make that only have positive values in particular narrow regions local regions of the x-axis and so what this means is the B variables the basis functions is what they're called turn on weights in isolated parts of the x-axis and this is what allows the spline to take particular shapes in each region of the curve independent of whatever the shape of the curve is in different parts of the graph a very detailed example of building all this up and the code to do it starting on page 114 in the book so be sure to check that for the details of this and you can actually run these models and play with them but let me do a little cartoon version to try and augment that what I'm showing you here is a very simple example of a basis spline and we have the basis functions drawn in color the red is the first basis function the green is the second basis function the blue is the third and the psi n is the fourth there are only four basis functions in this particular spline but you can have as many as you want the important thing to understand is what they're doing they're segmenting the x-axis so going from left to right here what you want to see is that on the far left the most important basis function is the red is basis 1 that's the one that's determining the shape of the black curve which is the spline in this case on the far left of the graph and then as we move to the right the green curve takes over gradually there's a transition so that the green curve is determining the shape of the spline and then the blue curve as we move further to the right determines the shape of the spline and then the psi n curve on the far right and so each basis function isolates is only important in a particular region of the x-axis and the weight parameters are multiplying each of these the weight parameters in these case in this example the weight vector, the little w vector there's a weight for each basis function have the values 1, minus 1, 1 and minus 1 so the way you can think about this is the basis variable values for the red curve the red basis function are all multiplied by 1 and so they stay positive they're positive values to begin with and they stay positive for the green is the opposite they're all multiplied by minus 1 so it drags the curve down in that part of the x-axis and then basis 3 again multiplied by 1 so it pulls the spline back up and then for basis 4 we multiply by minus 1 again and that pulls the spline down on the far right so we can change the weights on each of these and move independently move each part of the spline so let's start just with basis 1 and switch it so that it has a value of minus 1 and now you can see it's flipped to the bottom of the y-axis and it's pulled the spline down and the rest of it has interpolated smoothly we can move the green up so it now has a value of 1 and now the spline looks like a dome we can move the blue down so it has a weight of minus 1 and now it drags that part of the spline down below 0 and then finally we can move the fourth basis function up and give it a weight of 1 and we've essentially mirror image the original spline we started with now of course the weights can take on any value they don't have to be 1 or minus 1 they can be anything in between or even big numbers it's just a multiplier that determines how far up or down the spline is in that region of the x-axis so you can build really complicated if you have enough basis functions let me just give you an example real quick here using the last variable in the HAL data set that we haven't looked at which is age so we're plotting age in years against height in centimeters for the Kalahari population sample and you can see that this is a highly non-linear relationship a human growth over time is linear in tiny regions but it's definitely not linear over the lifespan but we're going to fit a spline which is biologically very silly thing to do but for two reasons first I want to give you an idea of what it looks like to fit a spline and how the spline learns from the data and second I want to emphasize that when you have some scientific knowledge you can often do a whole lot better than to do some arbitrary curve fitting so I have an animation here we're going to fold in the data set little pieces at a time we're going to start with 10 individuals here we've trained a Bayesian spline on these 10 individuals and it's going to start animating and then we're going to fold in more data and you can see how the spline learns so when we start out the spline flops around wildly at the ends because there's no data out there but as the data slowly fills in the spline learns where the expected value is for each age because that's what it's trying to do for each narrow range of ages on the x-axis it's trying to learn the shape of the function in that narrow range and each part of the spline can fit independently of the other parts of the spline because the basis functions are narrow in this case so let this finish up the whole sample so you can get an idea notice some of the silly things it's doing it's perfectly happy to accept that individuals can get radically shorter or taller at any age because it doesn't know anything about the biology here the spline right is and it doesn't know what age is it's just curve fitting let's repeat the whole thing but now with basis functions so this is the same analysis but now I'm displaying on the animation the individual basis functions you can see here in black and there's a bunch of these now but notice that each of them just isolates a particular part of the x-axis some narrow age of ranges and as you move from the left to the right in age you slide from the domain of one basis function to the next and then the weights the parameters that are being learned through Bayesian updating determine excuse me, how high or low each basis function is relative to the intercept the black line in the middle there so I'll start the animation and you can watch the basis functions jump around this is the same analysis the same posterior distributions in each step but with the sampling from the posterior distributions you can see that it's the basis functions the fluctuations in the basis functions that determine the fluctuations in the spline so at each x-axis value you're really adding up all of the basis functions you're really summing vertically all the black curves to get the value of the blue curve and that's how basis splines work basis splines are great you can do lots of cool stuff with them in cases where you need to fit some arbitrary function but you need to keep in mind that biologically or physically they could learn some nonsense too all statistical methods, all golems use them with wisdom because they have no wisdom themselves okay let me try to summarize this up and I have a few more things to say before the end of the lecture curves and splines are great because the world is not only linear and so often to correctly learn some scientific estimate or causal estimate we've got to successfully model the non-linear association between different variables as well it's not enough merely to assume everything is linear polynomials and splines are both geocentric splines are a lot better typically so use them with caution but by all means use them if you need to and scientific information usually helps us if you can learn over time to insert that into the model as well so just a couple examples weight only increases with height on average so if you can tell the golem that will help it learn faster from the data and height only increases with age on average and then levels off in the early 20s with humans okay let me exit this by talking about this relationship in that light again if we have some scientific information we can come up with a better strategy and ideally the statistical model will be deeply very much inspired by some generative scientific model of the system so we're not going to build a full solution for the age versus height example here but I just want you to think about like you know a lot about human biology because you're human and for example humans have three distinct growth phases and each of them has a different rate so there's infancy where we grow quite rapidly and then there's childhood where growth slows quite a lot and indeed for the first few years of childhood, post-infancy humans grow very little and then although the brain continues to grow quite rapidly in that period and then puberty around the time the brain stops growing or as growth is essentially stopped it will grow a little bit more but it's almost adult size puberty kicks in and then there's another growth spurt and this is earlier as everybody knows it's earlier for girls and it is for boys and so there's this brief period where girls are on average taller than boys and then around 20 in the early 20s humans stop growing and then there's typically actually a slight decline in stature with age as the sands of time weigh down on us this is biological information and you can build models out of this and in the literature in which people study human growth that's what they do, they don't use off the shelf linear models or polynomials or splines they have more mechanistic models of human growth which recognize these different stages of growth and try to learn the rates of growth in each we are finishing up the second week now, that was the fourth lecture and so we're going to continue on next week by talking more about causation and the links between scientific models and as statistical models and we're going to focus on confounds much more now, that is cases where there are things about the data generating process that lead us astray so I'll see you there