 Hi everyone! Welcome back. We have three more lectures. This is lecture 18. Yeah, some excited people. Three more lectures. This is lecture 18, and then two next week. And we got today to finish up introduction to varying slopes, and then I'm going to show you how those same concepts allow you to do different kinds of models. So to remind you where we were before, we had specified this complicated random slopes, random intercepts, and random slopes model for the chimpanzees data, and then we are re-specifying it because we had a bunch of divergent transitions. And I want to show you how to do the non-centering re-parameterization when you had random slopes. And there was this French guy named Kuleski, a French Polish guy, Kuleski, and who figured out this cool technique that is used all the time. I mean, really, it's just ubiquitous in applied mathematics when you're working with vectors of numbers. And it's used for lots of things, but what we're using it for here is it's a way to take a bunch of uncorrelated values and give them any correlation you want. In this case, that correlation isn't the one you want, it's the one the data wants. We're still learning from the data to get to possibly your distribution, but to get a non-centered parameterization, we need the prior not to be conditional on other parameters. And that's what the non-centering means. So to do that when you have a correlation matrix, you have to get fancy and that's where Kuleski saves us. So the details about this are in the chapter. There's a big box about it. There's tons of material online about Kuleski factors and decomposition. But really what you need right now is just to understand that the reason this Kuleski stuff is in there is to achieve this better mixing. That's what it's for. And then you can learn the code and you can copy and paste this in your own models and it works great. The caution I'm going to give you before we get into the details here is that for some data sets and some models, centering is better. So it's not like you always want to use the non-centered form. You need to be flexible. Sometimes you need more than one language. Sometimes French is good, sometimes English is good. Sometimes C++ is good, sometimes R is good. Things like that. So here's the model. Don't panic. We're going to walk through this and I just want to show you the relevant bits. This is really just the same random slopes model. There's more code because we have to deal with this intermediate step of dealing with Kuleski factors. So let me highlight in red all the bits that have to do with the actor effects. Remember we had a matrix of random effects where each actor is a row and each column is one of the treatments. So there's a parameter for each actor in each treatment. We're estimating the behavior of each actor in each treatment. We're doing it with shrinkage, right, with pooling so that we don't overfit. So there's a lot of parameters but we're doing it with pooling. It's the adaptive regularization. And that's what this alpha actor TID, TID is the treatment ID. You remember this from Monday? Monday was a very long time ago for me. So I assume it was for rest of you as well. Lots of things have happened in the intervening days. I'm dizzy, right? I had like a section meeting in the Mock's Pond Society. I blacked out. I don't remember anything. I should say these things. The microphone's on. Sorry, section leadership. It was a wonderful meeting. So the green is the corresponding or analogous block effects for each block in a row. This is matrix beta. There's an experimental block, which was a day in this experiment. It was a day that all the treatments were done on. And then each column is again a treatment. So what's the average behavior in that treatment on that day? That's the way you can think about this. Sometimes there are days effects, right? I mean, maybe the weather effects how hungry chimps are and then they choose the pro-social option more. It could happen. It didn't in this experiment. But you got to check for block effects. Yeah. It's just like science 101. Worry about block effects. And so the red comes down then into this big block with the adaptive priors. And this is where we do the non-centering again. So there's a big overthinking box in the chapter where I show you, I don't only say more in detail about this code, but I also show you the underlying stand code that this generates. Because what we're doing here is almost exactly stand code. Almost. It's just a little bit of help that my package is doing for you here. And in particular, I explain what this transparse thing is on the front. But you can ignore that. It just puts it in a particular part of the stand code. What you want to see is that we're creating the matrix of actors. It has rows as the maximum number of actors and four columns. It's called alpha. And then it's defined with this weird thing called compose non-centered, which takes three arguments, the vector of standard deviation sigma actor. This L row actor with the L in front is the Koleski factor. So for some reason that I have never discovered, Koleski factors are usually written as a capital lambda, which looks like an L, right? If you're not Greek, if you're Greek, you could probably tell the difference. So no, they look completely different. But we'll just say it looks like an L. And so when you see this L on the front, that means it's the Koleski factor from that correlation matrix. And again, this Koleski factor is this clever trick that lets us mix things together. So what compose non-centered does is it just does the matrix multiplication to take these three things, the sigma actor, L row actor, and then Z actor, which is just the vector of Z scores for each actor in there, right? This is the non-centered part. And it just, matrix multiplies it all in the right way to get us back to the covariance matrix that we want, right? And, well, not the covariance matrix we want, to get us sigma actor and L row actor produce the Koleski covariance factor that we want. And then you multiply those, multiply those by the Z score matrix. And then you get all the alphas on the right scale. It's like magic, right? So now we've got the linear thing we can plug into the linear model back. So this is just putting the scaling back in and putting the mean back in. This is what it does. Same thing analogously for the block effects on the next line. Looks the same. The compose non-centered function just does the matrix multiplication. And if you want to know what the matrix formula is, again, just look at the scan code. I show you in the overthinking box what it is. It's actually no longer than that line. It just has little multiplications and a transpose in it. So if you're familiar with linear algebra, you'll like the raw form. It's actually no more complicated than that. Okay. Last bit we talked about down at the bottom. Now we don't have, the posterior distribution doesn't contain a correlation matrix. It contains a Koleski correlation factor. And so there, you have to define this as Koleski factor core, four by four lower triangular matrix. And it has this distribution. It's still an LKJ distribution. But now we say that it's on the Koleski factor. But it means the same thing. Eight of two means that hill, right? That gentle hill. Weekly regularizing on the priors. So I appreciate that there's like suddenly the word Koleski has been sprinkled through the code. And you can think of it that way and get away with it pretty much. Remember, like you do the non-centering, oh, I have to go look at that example where the Koleski, where Koleski is sprinkled through the code and then copy the relevant bits. And that'll get you by. And then in time, you'll come to understand in more detail how this works. And then you can use it flexibly in lots of different model types. But it's important to recognize what's going on here so that you're not confused by the additional lines. They're just de-centering the priors so that the change runs better. That's all that's going on. Does this make sense? Yeah, enough to keep going. I know you have to go home and draw the owl. That's always how it is. Okay. What you get from this is really amazing. The change run way, way better. It's a consequence. What I'm showing you here are the number of effective parameters for the two versions of the model. Remember, these are the same model, same data. They produce the same inferences. It's just one of them does it way more effectively than the other. It runs faster and you don't need to run the change as long as you get the same quality of inference. So in the horizontal, I'm showing you, so let's say each point in this graph is a parameter in the posterior distribution. There are a lot of them, right? It's like 80. Keep or take a couple. The horizontal axis is the number of effective parameters, sorry, samples for each parameter from the centered model, the default one that looks like the way we talk about multi-level models. Then the vertical axis is the corresponding number of effective samples for the non-centered or Koleski version of the model. The diagonal there is a quality. I want you to see is that they're all the points are above the quality line, sometimes double, more than double, the number of effective parameters that you get from the centered version. This makes a huge difference. In really big models, sometimes this is the only way to get them to run right. I want to be clear. This is not something that is particular about Hamiltonian Monte Carlo. It's just that Hamiltonian Monte Carlo warns you when you need this. Other sampling strategies like Gibbs samplers, this matters too for them. There's a big literature on this. In fact, this effect was first discovered with Gibbs samplers, and there was this literature on it. Terrible term non-centering comes from that literature for Gibbs samplers. It's just with Gibbs samplers, you don't get these divergent transitions to warn you about it. Inference, finally. What do we learn from this model? Why are we doing this? Everybody here is an applied statistician, right? We care about the inferences. You go through all this stuff fighting with the machine. You have to just get things to work. The payoff for this is sweet, sweet inference. What are we learning about the chimpanzees now? What I'm showing you here are just the standard deviations from the random effects. We're going to look at those. Of course, there's like 70 individual random effects in there. We can plot those up later as posterior predictions. For now, let's just think about what we learned by inspecting the standard deviations. What do these standard deviations do? They generate the strength of pooling. Remember, this is what happens with these adaptive priors is the amount of variation among the units is learned, and that's what these parameters are doing. They're storing that learning. That was good English. They embody that learning, that information. The first four are the actor effects, and they're counting one, two, three, four that corresponds to treatment numbers. We're treatment one. There'll be a graph about this in a second. It'll be clear, but treatment one was pro social options on the right, right, and no partner present. Treatment two is pro social options on the left, no partner present. Treatment three is pro social options on the right, and the partner is present treatment for on the left, partner present. There's a graph about this in a couple slides to come. You'll see that there is a variation, big difference in some of them. There's a lot less variation in treatment two, and I'm going to show you what this does in a little bit. The thing that you'll notice is that the block variances are much smaller. Why? Because the blocks are pretty much all the same. That's what the model has learned. This generates more pooling, more shrinkage on the blocks than it does on the actors. The actors, there's a lot of variation, and you know why. Why are the actors different? Their handedness differences are really massive in this data set. This is why I like teaching with this data set, to get all this individual unique snowflake of the chimpanzees in it. Makes sense? Let's look at these. They're easier to look at than to look at the raw numbers as always. I like to make plots. Now we're looking at the correlation effects instead. You can convert that Koleski thing into an ordinary correlation matrix. I show you how to do this in the notes. You just multiply it by its transpose. Actually, that's what's great about Koleski factors. The Koleski factors defined as the lower triangular matrix. If you transposed it in a matrix, multiply it by itself, you get the correlation matrix back. I know it sounds like magic what I just said, but there's this function actually that just takes the thing and rotates it and multiplies it by itself. That's all you need to do to get the Koleski factor back. Then you're to get the correlation matrix back. Then you're back to this thing and you can think about it. What am I showing you here? I've plotted out the posterior distribution of each cell in the correlation matrix for the actor effects. This is a four by four correlation matrix. The diagonal is all ones because it's a correlation matrix. The correlation of any variable with itself is always one. You'll see this. All the things that are on the one over there are row actor 1, 1, row actor 2, 2, 3, 3 and 4, 4. Those are the ones on the far right of the show up as points. Makes sense? Comforting, yeah? You understand that part. Then the other things, you're going to get the cells which are combinations of the correlations between the different treatments. I want you to see is that they're all positive. Now some of these are replicated, right? Because 1, 2 and 2, 1 are the same thing. They're up there twice. This is just the way it comes out. But your brain can handle that. It's the same parameter. It's a symmetric matrix. The lesson I want you to see is these are all very positive. Why? What does that mean? This is handedness. At the individual chimpanzee level, the responses to the different treatments are highly correlated. The reason is because of handedness effects, which we have previously seen that in the other way we did this model in a different way. We estimated a parameter for handedness. We didn't do that in this model because I wanted to drill down to this slide. I was planning this slide. Now you see that the handedness arises as a correlation among the treatments at the actor level. Satisfying? Yeah? Feel good? It's like those drawings where it's a bunny or it's a duck. Yeah, does anybody know this? Natalia knows this. Thank you, Natalia. Somebody knows this, yeah? If you don't know this, I'll shut up because it must sound really weird, but there are lots of drawings where you can see the multiple ways, right? If you get good, you rapidly move back and forth between them. So a lot of the information in data sets is like this, where you can capture it in different parameterizations, you're capturing exactly the same information. And it tells the same story and makes the same predictions, but it's going to look different when you inspect it. So the handedness effects are appearing as correlation effects now in the way we've done this. But if you made parameters for preference of left and right, these correlations wouldn't show up like this. Yeah, has to do with how you do it. The homework that I'm going to have you start this weekend has an effect like this too, where there's a particular way you set up the data set, you get this correlation in the effects, and then I'm going to ask you to figure out why. So you're welcome. All right, does this make sense? Yeah? This is lefty, right? Lefty is generating a lot of this. Let's look at the posterior predictions. Posterior predictions now, or what does this mean? We take the posterior distribution and we push it back through the model, and we have it retro-dict the sample. Why do we do this? We want to make sure the model worked. It should be able to capture the texture of the sample it was trained on, but you don't expect it to be identical, because that would just mean you're overfitting. And this is also a place where since I'm still training you, folks, I want you to see the shrinkage, and we can talk about how the shrinkage arises in this. So let me explain this first. We're looking at each actor going across in groups of four points, actor one on the left, actor seven on the right. In each actor's cluster of points, we've got the four treatments in order where they are right no partner, left no partner, right partner, left partner. And I connected the two points which are on the same side of the table together with a line segment, right? So that the line, the tilt of the line is telling you about what happens when you add the partner. Yeah? And you notice the answer is not much, right? But you already knew this. We already got the punch line there. There is an effect, though, of moving the food to the left of the table is in all the chimpanzees except actor two and six and seven, they pick it more, right? They're attracted to more food. They, even though they don't get more food, they pull the lever where there's more food present. This is a common thing if you do experiments with chimpanzees, you're used to this, right? And they love grapes, do anything for a grape. And actor two doesn't respond that way because actor two doesn't respond, right? Precious number two loves the left lever, just always left lever. Not a single trial in which actor two did anything but pull the left lever. But notice that the, to the blue points of the raw data, we just average the raw data for that actor in that trial. Notice that the model doesn't retro-dict, always pull left for actor two. Why? Because there's shrinkage. You were ready for that, right? This is regularization. But the regularization varies by treatment now. And so actor two's predictions are pulled towards the mean of the population a little bit. And notice that pulled most in treatment two. Why? Look at the variances that are estimated for the treatments at the bottom of the slide. The variance of treatment two was lower for all of the other chimpanzees. And so you get more shrinkage in that treatment. It's not a huge effect, but it explains why you get more shrinkage in that. It's because sigma actor two is about one, whereas sigma actor for all the other treatments is healthily above one. Yeah, so you get more shrinkage. Does it make sense? Okay. Actor six and seven show their own unique and used in Cratic Patterns in these things as well. Okay. I think this is a good point to pause and before we move on to different model types, which is what I want to do for the rest of the lecture, to give you what I like to call horoscopic advice about multi-level models. Why horoscopic? This is the thing we'll talk about on the last day as well. The last chapter of my book is called horoscopes. This is a metaphor I use because a big part of my job is people come into my office with data sets and they ask me to tell them what to do with it. 15% of my job, let's say, give or take. And quite often the only kind of advice I can give is rather horoscopic because I don't know enough about the scientific context. If you just bring me a table and say what should I, what sort of model should I use, I'm just going to start asking you more questions. There's just no way to responsibly as a statistician tell you what sort of model you can use until I know stuff other than the data. The design of a proper analysis depends upon information outside the data set. It just does. Your dad comes from outside the data set. The meaning of the variables, all the stuff we call metadata these days, because we had to make up a term that made it sound more scientific. The meaning of the variables, metadata, all those things are outside the table. So the table's not enough. And nevertheless, it's nice to have some kind of defaults, a place to start. I've made this argument before that defaults really matter. So horoscopic advice is the kind of advice I'd be willing to give if I had to bear minimum facts, like a horoscope. If I only know your birthday and I was going to predict your life course, what would that be like? It'd be terrible to be like a horoscope in the newspaper. Apologies to anybody who really likes horoscopes. They're greatly entertaining. But so statisticians often, stats, books, and courses often have to give horoscopic advice. And those defaults are helpful, but you should feel free to deviate from them when you know things outside the just the numbers in the table. So here's my advice about multi-level models. Above all, think about the causal model first, please. You need something other than the grid of data. It's really helpful after you've got an idea of the causal model and the models you're going to need to fit to evaluate it. To start with what I call the empty model. That is, you identify the clusters of interest, like chimpanzee or flock or or department or district in the Bangladesh data, any of those things, and then put them in as random intercepts, varying intercepts. And what does this do? It's like a shrinkage in Nova. It's a way to find out where the action is in the data set. Where's the variation at? Is there a lot of variation among these units? Is there not very much information? It's a great place to start and then you get the machine humming and purring before you start adding in predictors, right, to evaluate your causal hypothesis. For the predictors themselves, the default behavior is always a standardized unless you have something better to do. And often you will, right, like they might be ordered categories, then you don't standardize them. Then you use that crazy Deere Clay thing that I taught you. Yeah, but standardizing is good default behavior. Regularization, always and how do you evaluate how to regularize you simulate from the prior. You're used to this now. Get some sense of it. Then you start adding in predictors and vary their slopes if there's cluster variation, right? The presence or absence of varying effects comes from the design of the thing. If there's repeat measures in a unit, then you cluster on it and you do shrinkage. That's where it comes from. It's fine to drop varying effects. It's lots of extra code and bother when the sigmas are really small. You could also just leave them in because it'll aggressively shrink. So it's fine one way or the other. But there's not a correct behavior there. And when interpreting, you should consider what you really want to do. If you're really interested in explaining the sample, then your posterior predictions will focus on those units. The same chimpanzees, for example. If you really care about lefty as an individual, then you focus on lefty's prediction. But chances are you probably don't care about lefty. Sorry, lefty. Precisely. And then it's about the population and what's the variation in the population it's like. But you should be really clear on this. And both of these questions are perfectly legitimate depending upon who you are and your question. So I have colleagues from back in California who do political science and they are focused on North American politics and there are 50 states in the United States. And we're probably not going to get a new one anytime soon. Right? Washington D.C. and Puerto Rico keep trying, but not with this Congress. It will never happen. So when they fit a dataset to those 50 states or past presidential elections, they actually want to know about those units. The estimates for those particular states matter a lot for predicting how those states behave in the next election. You care about lefty. Very intensely. A Rhode Island in this case or Vermont or whatever state you're focusing on. I don't want to make any of these by naming them states. But a lot of us, like myself and my research, the individuals, of course we care about those individuals intensely, but they're exchangeable in the sense that we're trying to make inferences about a population and to generalize from those individuals. We're not going to make predictions about those individuals again in the future. Right? Does that make sense? So I think that's like the chimpanzees in the dataset I've showed you. Okay. And of course your knowledge, Trump's all, you can always deviate from the horoscope. Feel free to do it. Okay. In the remaining time, let me try to give you the quick introduction, conceptual introduction, to two families of covariance models that go beyond simple random slopes. And all the mechanistic details for doing these models are in the book. I'm going to move quickly through it. I'm going to, you know, draw the ovals for the owl to continue this metaphor. Right? And then when you go home and you run the code, you'll draw the rest of the owl. But it's important to get the conceptual idea in here. And all the cool things we can do with covariance matrices has lots of modeling types. This is a tool when you're interested in covariance among different variables. It's a way to start thinking about things. So this is just kind of my quick guide to say that lots of different types of statistical models are really multi-level models that focus on covariance relationships. So what I'm going to show you today is something called instrumental variables, which is common in some traditions in economics, but is actually from biology. This is something Sewell Wright invented. I'm going to show you how this is a covariance model. And some of you will have heard of Mendelian randomization. Mendelian randomization is instrumental variables. This is Sewell Wright raising his ghost doing important work. The next example I want to show you is a social relations model. There's some people in my department who work on this, so this is for you. And network models are also models that depend upon modeling covariance structures. Variation in behavior, co-variation behavior among units, nodes, and networks. Factor analytic models, which I won't show you an example of, are also covariance models. And for example item response models, which again some people in my department use these item response models, are examples of these kinds of models. The animal model that's used to study heritability of phenotype is a covariance model as well, which is a big covariance matrix, which is scientifically specified by the rules of meiosis, which tells you the expected covariance among the traits, like say height in mothers and daughters. It's called the animal model and it's used a ton. It's just a big covariance matrix. So you could do it now and probably want to de-center it. Phylogenetic regressions are big covariance models where the covariance comes from the phylogenetic distances among the species. And I think I'll show you an example of that on Monday next week with the primate data set that we've used before and spatial autocorrelations as well. On Monday I'll show you an example of this. Spatial distances are another kind of distance you can specify a covariance matrix from and you can estimate how the covariation in effect falls off as units get further from one another in space. Why? Because there are unmeasured influences on them, but they're associated with space. And so you can't measure those things, but you can measure space and then you can estimate the impact of those unmeasured effects. That's called a spatial autocorrelation. And the penguins here just because it's my favorite photo in the world and I wanted to use it somewhere in the course. You're welcome. It's the indifferent penguin. Okay, so instrumental variables. This is a cool technique which goes beyond the backdoor adjustments that I taught you about earlier in the course. Remember we're trying to do causal inference. We're trying to be principled about it. If you're trying to figure out which things to condition on, the wrong answer is all of them. Because sometimes conditioning is bad news and creates confounds. Adding variables can create confounds just as well as it can remove them. So I taught you this thing called the backdoor criterion that you can use given a causal graph to figure out which things you should condition on. Sometimes the backdoor criterion will tell you you can't remove the confounding. That doesn't mean all hope is lost. There are other cool things you can do by analyzing causal graphs that you can still de-confound even if the backdoor criterion tells you you can't shut all the backdoors. Here's an example. The most famous one as I said this is invented by Sewell Wright. So he could study basically guinea pig genetics. It's called instrumental variables. So imagine this kind of classic case where we've got we're interested in the effect of some variable here called E for education on some other variable here called W for wages. Most people, not the people in the room here, all of you are noble scientists, but most people get educated so they can make more money. Right? If I can assert that without evidence maybe it's wrong. I think it's true. That's what people say. And many people in education are interested passionately in measuring the effect of education on wages. This is a big deal. It's all public debates about funding universities hinge upon these sorts of returns to education. Right? And so we would like to know the arrow going from left to right from E to W. The problem is of course there are a huge number of confounds. Here labeled you innocently which influence both of these things. And we can't possibly measure them all or even maybe imagine them all. What would they be things? They could just be personality characteristics like how lazy you are. Right? So if you're a lazy person I will get an assert without evidence. You might consume less education and you might also make less money. But that has nothing to do with the effect of education on wages. It's just a confound. Yeah. And there are going to be a lot of confounds like this in any sort of observational system. So what do you do? The backdoor criterion tells us we cannot shut this backdoor because we cannot measure you. We can't condition on you because we don't have it measured. Backdoor criterion says go away. Right? But there's hope. If you can get something called an instrument. I've added it here as Q on this graph. Bidding with this example. Then there's some hope. What is an instrument? Instrument is some variable which affects the exposure of interest but does not affect the outcome. So if you look at the causal graph at the bottom of the slide. Q enters into E. But it does not affect W. What is Q here? In this example the framing talking about education is effect on wages. Q is the position of the year you were born in. Q is for quarter. Say first quarter, second quarter, third quarter, fourth quarter of the year. And so this is a famous example from the economics literature. It's a fact in North America at least empirically that people born earlier in the year consume less education in their lives than people born later in the year. Just a timing of your birth in the year. It's not the year you're born in. It's what part. So if you're born in January, February, it's a fact in North America at least. I couldn't find data for Europe. In North America at least that people consume less education in their lives. Why? Two things are causing this it seems like. First, because of the random way that age is assigned socially. Age is a social variable. It's influenced by biology, but it's mainly a social variable. It's your age that you're assigned is randomly cut off by this thing called January. But your biological age is not fixed by the calendar year. Some crazy Roman person that's in the calendar at some point. So you can be biologically almost a year older than somebody the same age as you. All right, socially. And it's your social age determines when you start school. So people born earlier in the year, at least in the way the North American system works in most states, are actually biologically older when they start school. And therefore they will graduate with having completed less school than somebody else in their lives. Yeah. The other effect that happens here, so they'll start school later. And in most parts of North America, you can voluntarily drop out of school when you turn 16. If you get a job, if you show proof of employment, for example, and lots of people do this. Actually, I had a very close friend in high school who did this. Turn 16 and he was like, man, I'm out. He was done. And that might have been the right decision. I don't know. But as a consequence, if you turn 16 now, again, this is your birthday to determine that. If you turn 16 earlier than somebody else in your same grade, you will have consumed less school than they will have, even if they do the same thing and drop out at 16, right? Because for them it might be at the end of the school year. And they will have now consumed an entire year of school and you only like two months of it. Yeah. So lots of people do this in North America, at least. And these two effects combined mean that people who were born in the first quarter of the year can have much less total schooling in their lives than people who were born later. So now we've got a kind of natural experiment. We can use, there's this thing which is assigned by the hand of God, so to speak, which month of the year you're born in, interacts with the social system so that it's like an experimental manipulation. But nobody manipulated it actually experimentally. But it adjusts the amount of education you have independent of the confounds. Because I assert that the confounds are probably, and we have to argue about whether this is true, but are probably not associated with January. Right? People born in January are probably not lazier on average. But again, we have to check this. It's an assumption. But I assert that's probably true. Yeah. So this would then be a valid instrument, Q. So why does this help? What Q does is it makes E into a collider. Do you see that? Now we have a collider path from Q to E to U to W. There's a collider on this path. We condition on E, which we're going to do, because we're conditioning on education. And we know Q. We get information about U, because this is the way colliders work. It's like the light switch. Remember, if the light is on and the conditions for the light to be on are that the switch needs to be on and there needs to be electricity, conditioning on E is like saying, I know the light is on. Conditioning on Q is like saying, I know the switch is on. Now you know U. You know the electricity is there. How does that work in this case? There's no light switch here. What is actually going on here? I see the examples on the next slide. So let me talk about the correlation first. The only thing about this is that U generates a correlation between E and W and Q tells us something about the deviation in E separate from that correlation. So then we learn about the strength of the correlation U across the cases. This is statistically what's happening. I like to think about it as a collider and the finding out effect for sliders. So think of it this way. Imagine on average that people born in the first quarter of the year consume 10 years of education in their lives. That's on average. Averaging over the whole sample. Right? And people born later in the year consume more. That's the instrument effect. That's the path Q to E. Okay? And then we look at a particular person in the dataset and we see that person was born in the first quarter but they consumed more than the average amount. We have just gotten information about the confounders. We have learned that they are not as lazy as some other people in the same category as them. Or whatever the confounder effects are. Whatever the U effects are that are generating correlations between education and wages, you find out about them. When you find out about their specific education because we know Q, the connection between Q. It's like an experimental manipulation. Some people are nodding and some people are like, what are you talking about? That's okay. We'll loop back around to this. Here's another way to think about it. Often these are called natural experiments, at least in the biology literature. In economics they're called instruments. Everybody knows what that means. In biology I don't think I've ever seen the word instrument. Sometimes you call these natural experiments. If we could experimentally manipulate education, we would do that. But we can't. That's not what's going on. If we experimentally manipulate it, you just absolutely determined the number of years of education people would get, we'd satisfy the backdoor criterion. We could make causal inferences. The whole problem is we can't do that. It's not ethical to do that. The first place. Nor is it practical. But what we have instead is Q is like a natural experiment. Nature itself has manipulated the number of years of education people receive, independent of their other characteristics because their birthday is random with respect to those characteristics. And so now it's like a weak experiment. It's like we partly closed the backdoor by the manipulation. It's like an experiment that doesn't always take. The natural experiments are like this. It's like Q is suggesting to individuals that they should do more or less education. If you're born in January, it's like nature suggesting to you that you should drop out of school early. To being born at the end of the year is like nature suggesting to you that you should take more schooling. Not everybody follows those suggestions. And the tendency is to follow the experiment, give you information about confounds. It's a cool effect. I want to assert that, well, this isn't me asserting this. If you follow up on this literature very much, what you'll see is that many, many experiments that people run are actually like this because we only suggest treatments to people. So epidemiological studies are often like this. You give people medication and they're in the control group or the treatment group, but then they walk out the door and maybe they take their pills, maybe they don't. So this is called intent to treat. And you have to estimate this. You have to care about the extent to which people actually take their pills. Some of you will know there's this really famous case in literature, one of the early antiretroviral trials, large scale antiretroviral trials for HIV, where the people taking these medications knew that there was a control arm, but they all wanted the medicine. And so they pooled their medicine, crushed it up and redistributed the powder. The epidemiologists found out about this, they were upset. But luckily the drugs worked. And so everybody got better. So this is a famous case. I forget what that trial was called. This is such an awesome story, the collusion among patients. Turns out this is not rare. Patients are smart, active individuals. You don't do experiments on people so much as they do experiments on you, in my experience. So you worry about these kinds of effects. So this instrumental variable analysis is just the correct way to analyze any experiment where the treatment cannot be actually forced, but it's only suggested. So lots of psychological studies are like this too. You have to know whether the individuals are paying attention to the stimulus. You say the stimulus was the treatment, but did they look at it? Did they actually notice the stimulus? So I'm going to make some of you worry about the same. It's my job. I'll make you share in my grief. But no, we're going to do better. We're always doing the best job we can. So this is the intent to treat interpretation of instrumental variables as well. Okay. Generatively, what are we going to do? So again, you should go home and walk through this simulation in the book. I'm going to necessarily move through it quickly. I just want to give you an idea. The underlying model is we take this DAG and we simulate from it. Here's the generative way to think about this. First, there's a model for wages. Wages are a function of education and the confound. So that's what you see here. We're going to generate wages as a standardized variable or a center variable, normal distribution. There's some intercept. There's some effect of education measured by this beta coefficient, beta sub EW. And then there's this confound, U sub I. It isn't measured, but in the simulation it's there. Then there's an education model. Education has two arrows entering into it, Q and U. So again, it's influenced by Q. There's a coefficient for that and U. I've made the effects of U just standard one to make the example easy, but they could be different in the different components, right. The confounds could affect one positively and the other negatively. There's a simulation in the text where I do that, where I flip the signs on and show you what happens. And then we have to generate quarter of birth. I assert that there's a 25% chance you're born in the first quarter of the year. That's probably not true, right. Humans have seasonal births, but just for the sake of the example, there we go. And then I'll just generate the confound as a Z score. We simulate this. Here's the code. You'll go home and you'll draw the owl. The part I want to show point to is the red part here. I'm making the effect of education on wages zero for the sake of this example to show you the power of the confound and how it's removed. So let's assert. I don't believe this is true. I'm an educator, but let's assert education has no effect on wages. So it could be true. I mean, could all just be social networks, right. So people say, but you go to college, it's to meet other people who will make money, right. So people say, I should stop talking. Education is great. Do it forever. I never stopped. I never dropped out of school. I'm still in school. So here's the model where we don't use the instrument. We just run the naive regression. We condition wages on education. And I want you to see in the table at the bottom, we estimate a very strong and reliable effect of education on wages because of the confound. This is purely the confound. This is the terror, right, of causal inference. And the reason we want to think about dads, right, you can't interpret these things naively. Now let's add the instrument. And the way we're going to do this, you could just write down that model I already showed you. But you'd have to estimate all the use, which is actually possible. You can make each little you a parameter. I know this sounds like madness, but you could do that. If you want to see how to do that, I'll show you sometime. But the way better way to do that, so you don't need hundreds of parameters. There are 500 individuals in this sample. You need 500 new parameters to estimate, which would be the confound effects. You should just use a multivariate normal as your outcome distribution. So now, as the likelihood, as the top part of the model, we're going to say w and e are drawn from a common distribution with some correlation. Where does that correlation come from? It comes from the confound. And then we condition on education, only in the right place. So let me show you how this works. We've got this covariance model at the top. The first part of the path we're going to worry about is the model that connects the instrument to education. So I've circled it here. And there's a linear model for the mean of education, the expected amount of education, which is a product of some intercept, and then your quarter of birth times the effect we're going to estimate at the instrument. How strong the instrument is on manipulating, on nudging. That's that first path. Does that make sense? And simultaneously, we're going to run the other model, which is the effect of education on wages. And it has a linear model, which is just that the expected wages is some intercept plus some effect of education. And then at the top, we deal with the confound by having a covariance matrix that we estimate at the same time we fit these two linear aggressions. And it has a correlation, excuse me, parameter inside of it, just like the correlation parameters you've seen before. And that correlation parameter gives you information about the confound. How strong is the confound? All this unmeasured stuff that generates deviations, common deviations, correlated deviations in education and wages after we've conditioned on all the other things. And it rises from that confound effect. So it embodies this path, I'm sorry, that I drawn with like two orbits here, but going through you. Does make sense? This is that the same generative model, but we've effectively, what we say is we marginalize over all those little U values that we would otherwise have to estimate because we don't care about them, we just care about the correlation structure in them. So we make a covariance matrix for it. Again, you'll go home and you'll look at the code, but it's pretty short. I just want to show you now there's a multi normal as the top line of your model. It's the probability of the data and you've got this correlated structure between pairs of W's and E's after you've conditioned on Q. And when you run this, of course it works because this is an example in stats class, right? Things always work here. Then you go home and I know not always. First thing to look at are the regression effects. BQE is the strength of the instrument. That's positive. We assumed it was, there's no confound there. You can estimate, in this case at least, you can estimate the impact of the instrument on education. And now look at our unconfounded estimate of education. Not a education. No good for you. Just drop out. No, don't stay in school kids. But this is the inference given the simulation, right? That we wanted to de-confounce it. And notice that the correlation, row one two, is the correlation between education and wages after having conditioned on the instrument, right? So this is the effect of the confound. The confound here is generating positive correlations between education and wages, which tells you something, gives you clues about what it could be, so that maybe you could go out and measuring it. There's an example in the chapter where I resimulate this so that there's a negative correlation, that the confound generates a negative correlation between education and wages. And that would give you different hypotheses about what it would be, right? Different mechanisms would generate a negative correlation. But you can pick both up from exactly the same stats model. It doesn't have to be a positive correlation. Okay. Before I move on, I want to say instruments are not magic. You've got to find a good one. And you can't test from the data alone whether you have a good instrument. You just can't, right? It's a dag. You have to believe the dag. And you're going to argue about the dag. And I think the literature is full of actually ridiculous instruments out there. And I have an example in the text from a real paper that I pick on, that I put in there. And I ask you to read that. Sometimes instruments are valid, but they're incredibly weak because the natural experiment makes very, very gentle nudges. It's like you suggest a treatment to somebody and then they don't take any of the pills. Yeah. And then the instrument is so weak that it doesn't save you. And you still can't do causal imprints. If the instrument is super weak, that is the path going from the instrument to the exposure isn't very strong. You're sunk. It's just not a very good instrument. So I think this is the problem with a lot of medallion randomization studies is that individual SNPs don't actually do much to your phenotype. That's a big problem. Okay. And I just want to introduce you to the fact that there are yet other kinds of things that you can discover in causal graphs. There's this famous thing called the front door criterion where you use mediation hypotheses to make valid causal inference in the face of unmeasured confounds as well. I don't have an example of that for you, but I'm going to add one to the text. I promise. Okay. Last bit here, one of my favorite types of bottles. This is an example that comes from a close collaborator of mine. So often in the social sciences and in organizational biology, we're interested in dyadic relationships among units. Actually, if people who do cell biology have bottles like this too, right? And this is complicated because there's this big field of interactions. And pulling apart the special dyadic relationships between nodes can be complicated given that. You can't just average across all the mass behavior. But that behavior is described by special kinds of covariances in the behavior of the units. And so it's a kind of covariance model. These sorts of models are often called social relations models, or at least this is one way to do this. And let me give you an example. So I said from a collaborator of mine, Jeremy Koster. This is his field data. And the data, these data are in the rethinking package as Koster Lecky. And these data are 25 households from rural Nicaragua. And the outcome is gifts. Usually, this is gifts of meat from one household to another. And so I'm an anthropologist. This is what we do. We follow where the bits of the armadillo go, right? Which household did they go to? And you can track them all, right? All the bits of the armadillo. It's all eaten. You figure out where it goes. And there's lots of reciprocity in these networks, but measuring it is tricky because there's also generalized effects that go on as well. So with these 25 households, we end up with 300 dyads. That's just all the combinations, right? There's a, I put the R code up here to prove that to yourself. If you want all the combinations between labels one to 25 in twos, there are 300 of them. And this combin function in R will do that for you. And so this data set has 300 rows. Each row is a dyad. And we're looking at the gifts that flow in both directions in each dyad. And I plot them up for you here just as raw data. Gifts from A to B in each dyad on the horizontal from B to A on the vertical. You just plot it as a scatter plot. There's a correlation, a modest correlation of .24. This is not the way to measure reciprocity. I assert. But there are lots of papers that measure reciprocity this way, both in the primate literature and in the human literature. We got to do better. And so that's what this is about. And Koster-Lekke did a lot better. I'm going to show you the way that they did it. Underline, thinking about this generatively, here's the problem with just reading that .24 correlation. It's produced both by generalized effects and household-specific or dyadic effects. So there are also lots of predictors in this dataset, like kinship and distance between the households and lots of other things. We're going to ignore all that for the moment and just describe the raw co- variation, okay? But all that stuff, if you go up and read the Koster-Lekke paper, you'll see the impact of things like kinship, which is very strong. It is a matrilineal group. Matrilineal kinship is a really incredible predictor of gifts. So we think about the count y, a to b of gifts from a household, a to b in a dyad. There's some average giving rate in the population, alpha, on the log scale. And there's also generalized generosity and receiving rates for each household. And we need to account for those at the same time we account for reciprocity. So some households are really generous. They give to a lot of other households. And we try to capture this effect with this giving offset, this g sub a, that there's a random effect, a varying effect for each household, which is how generous they are in general. And it's true in these data, as you'll see, some households are really generous and they give to lots of other households. Other households are incredibly stingy and give to almost nobody. There's also generalized receiving effects. And we model these with an R random effect, R for each b, because some households receive a lot independent of how much they give. These can be independent effects. But for each household, we want to measure these effects. And these generalized effects are contaminating that correlation graph that we saw before. Yeah, because there'll be some households that receive a lot but don't give anything and vice versa. And then there are dyad effects. Considering only the specific households A and B, how much does A tend to give to B? And how does that correlate if we're going to estimate with how much B gives to A? Right, that's the reciprocity measure at the dyadic level that we want to get. So the G and R effects here are not reciprocal. They don't have to do with dyads. They're just A with regard to everybody, B with regard to everybody, where A and B are just arbitrary labels and they change across the rows in the data set. And then A and B are the specific dyadic effects. Everything in this one here model is a parameter. Right? Well there's data there, the labels, A and B are data. There's an index at the household. But it's all just bearing effects here. How do we make this crazy model? We need covariances. Yay. So we're going to have two covariance matrices in this model. One for the generalized effects. So you want to think about for each household I, it has two parameters to describe it. G and R, which are generalized giving, offset. That is how much more than alpha it tends to give. And then it's generalized receiving, which is how much more than alpha it tends to receive from any donor. And these could be correlated. And in fact, they're going to be intent. Right? They're going to be correlated. Well, we'll talk, we'll see why this is the case. And so we set up this as a random effect. So this is a two by two covariance matrix, exactly like the ones you see before. There's a single correlation parameter in there. There's a variation in the giving this offsets. And there's a variation in the receiving this offsets. Yeah. Good. So nothing new here, really. It's just you have to wrap your mind around the weird data application that we're doing. The next thing are the dyad effects. Now for each dyad ij, where i is a household and j is a household, there are two random effects to estimate. And those are the donation offsets in both directions. Right? So if there's a household that includes you know, our labels A and B, A could tend to give a lot to B and B very little to A, in which case these pairs, if that's true across a bunch of household, then these pairs of D parameters will have a low correlation. Yeah, there's no particular pattern to them. If instead when one is large, the other tends to be large, then the correlation will be positive. They could also be negative because households end up in dependency relationships. This happens, not in this data set, hence, but in other data sets it can, where the pairs as dyadic relationships are things where there's a support household that gives a lot to a particular other household. So this happens in some locality systems where parents and grandparents are giving a lot of gifts to particular households where they have descendants in them. Yeah, but the descendants aren't giving back. Right? Because they're not obligated to. Something like that. Or in Europe because the other way, right, is your parents age and give a lot to them. They give less to you. Yeah, depends upon all the kinship system and how it works. So we capture these effects by making a covariance matrix. But this covariance matrix is new and special. It only has one sigma inside of it. Why? Because it's symmetrical. The labels A and B are arbitrary. If the variation in one of these Ds, it has to be the same as the other. They're the same type of parameter. Right? So they have the same variance because A and B are random. And it's like, if you watch Olympic judo, somebody wears the blue shorts and someone wears the red shorts, but that's random. The variance in winning between the red shorts and the blue shorts has to be the same because it's randomized. So the variance for the A households giving to B and the B households giving to A has to be the same in this data. Sorry, did the judo example help? Sorry. Olympic judo is great, actually. It's my favorite part. But the shorts are randomized, right? I think. You don't have to bring your own shorts. They give you shorts. So we just have to... So this covariance matrix only has two parameters in it. So it has one correlation, but it only has one sigma. And we make the whole covariance matrix from that. Does it make sense? So in the text, I'll show you how to do that with code. It's not hard. You just copy a sigma twice into a matrix. It's actually pretty easy to do. But the details are boring. Oh, this is the slide I showed ahead up when I showed you that. There's just one sigma, sigma sub D. So what happens when you run this model? First, let's think about the generalized effects. So the GR effects, as I call it, we've got this row correlation matrix for the GR effects and then the two standard deviations for it. I put them up here. The thing I want you to focus on is the correlation. There's row GR 1221. They're the same parameter. It's minus 0.4 on average. That's the posterior meaning. All the masses below zero. There's a negative correlation. What does that look like on the graph? Here, I plot them for you. This is on the outcome scale, on the number of gifts scale. Generalize giving as an offset for each household. This is not an offset. This is for each household. Each point is one of the 25 households. Generalize giving for that household against generalized receiving. You see the negative correlation. It slopes down into the right. That's the negative correlation between the two. What this means is households that give a lot, those are the ones on the far right of this graph, tend to receive less. Why? Because they're rich. Doing really well. And lots of people, they give to a lot or they're begged for. Begged from a lot. And then you have poor households on the left which give less because they don't have as much. But they receive more need-based transfers from other households. In general. Not just diatically. This is having conditioned on diet effects in the same model. So this is just generalized receiving and giving. So you see kind of the need-based structure of gift giving among these households in this. In the text, if you want to draw these fun little ellipses, the codes in the text, these are the 50% posterior compatibility ellipses. We've got two parameters, right? It's parameters on both axes now. And so the uncertainty is some cloud. So I've drawn little ellipses to give you an idea of how that goes. If you want to draw fancy ellipses yourself, the codes in the text to do it. It's easy. There's this package called ellipse and you give it a covariance matrix and it draws an ellipse. Right, that's all it is. It's easy. Okay. Now let's think about dyads. So the dyad effects again, let's look at the correlation. There's a covariance matrix for the dyads. It has one correlation parameter in it. You see it here as row sub D12. It's 0.88. It's really, really high. So having conditioned on the generalized giving and receiving, now the dyadic reciprocity is extremely high. If you plot it, it just looks crazy. Some of these deviations are small. I think a lot of this has to do with zeros. Zeros are very balanced in this data set. Households that never give to one another having taken out the generalized ability effects. There's lots of reciprocity on zeros. And then there's also reciprocity on special kin relationships in this data set quite a lot as well. But those effects having taken out the general effects, there's tons of reciprocity in these data sets. I think this kind of analysis is dying to be done with primate grooming networks. They're primatologists in the audience. I don't think it's been done very well. Joan Silk has a cool paper where she does a version of a model like this actually. And that's the only case I've seen of somebody who's done this. But there's lots of avenues to do these things. Okay. You got to go home and draw the owl. This model runs fast even though it seems like it's crazy. You're estimating 600 dyad parameters in this model. Right, because there's two Ds for every dyad and there's 300 dyads. No problem. It goes pretty fast. It's great. By the way, yes, you have to use Koleski. Mansua Koleski is here to help you. Commandant Koleski, sorry. Is here to help you. And it samples really smoothly. Zero divergent transitions. I promise. Okay. I have written a homework set that I'm very pleased with, I have to say. And I did it on the train yesterday. And you're gonna go back to the Bangladesh data that you are about to turn in today. Right, that first analysis of it. You're gonna start, you're gonna do random slopes with that same data set. And then you're gonna start adding predictor variables. There are two interesting predictor variables in the data set. The woman's age and the number of kids she already has. And both of them have big effects on contraceptive use. And I want you to, yes, draw a dag. You knew it was cunning. And then use that dag to analyze the data set and the causal influence of age and number of kids on contraception in a big random effect structure. Fun, yeah? Thank you. Somebody nodded. I appreciate it. Okay, have a good weekend. And when I see you on Monday, we will continue with covariance even more aggressively. Thank you.