 If you've come this far in the course, then you've already drawn a lot of Bayesian owls. This is just an analogy, of course. I'm not asking you to draw any real owls. But the analyses we've produced in this course involve a lot of steps, just like real drawings of owls. There's a foundation that's laid down that eventually vanishes underneath the later detail. And it's also true that there's no separating the technical aspects of the drawing, the almost algorithmic steps by which general design is replaced by detail, from the artistic inspiration, that is, the imagination of what this all will look like in the end, the estimate of what we're trying to get in the first place. In these later lectures in the second half of the course, I think you feel this blending very strongly, that it's extremely technical material, but it's also true that the drawings are much more satisfying once they're done, because they're very effective, mature, scientific projects that describe whole samples, and get us what we're after. In this lecture, I want to talk about another way to expand the varying effects strategy. And in this case, we're going to think about lots of technical and scientific things. I want to start by reminding you of the scientific goal, though, what the owl is supposed to look like at the end, what we're trying to do in the first place. I want to do this because varying effects are very technical in their details. To express them and program them and interpret them correctly requires lots of technical skill. You can acquire that technical skill through instruction and practice, but it's important that you know why you're doing it. So let's focus on that a bit. To remind you, the varying effect strategy is to use unmeasured features of the clusters in your data. These clusters are things like departments or individuals or stories. These clusters have unmeasured features, and these unmeasured features leave an imprint on the data. And we can measure these unmeasured features through the use of repeat observations on each cluster, and we can make better estimates through the use of partial pooling at the same time. There are two ways to think about why this is such a great thing, a great technology as part of our scientific analysis skills. The first is the purely predictive perspective. The unmeasured features in the data are an important source of cluster-level variation. That is, things like departments or countries or individuals introduce variation in the data through things that we haven't measured, but there's a way statistically to actually take that variation out so that we can see better the predictive importance of other aspects. And this will allow us to make better predictions for future samples. Simultaneously, we get regularized estimates through the use of partial pooling, and that will give us better predictions out of sample. The second is the purely causal perspective. The causal perspective, as I keep saying, is a certain kind of predictive perspective. It's about being able to predict the consequences of interventions. And from this perspective, it's important that we statistically adjust for competing causes because that gives us better, more precise estimates of the causes of interest. And it's even more important that we do something about unobserved confounds. And clusters sometimes are unobserved confounds. Let me give you some examples. So think back. The first half of the course, when I introduced confounding through this example of marriage, age at marriage, and divorce. And in this example, it turned out that age at marriage appears to be the strong driver of divorce rate in North America, not the marriage rate M. But of course, there are lots of confounds in systems like this. That is regional cultural differences between parts of the United States that are actually causing differences in age at marriage, and possibly also differences in divorce rate. So the association between age at marriage and divorce that we found may not be an accurate measure of the causal force of age at marriage on divorce through these regional issues. So if we were to revisit that analysis, we might want to consider regional, geographical, and historical cultural relationships between the different states that may also account for associations between age at marriage and divorce. Another example, when I introduced collider bias, I used this example of educational achievement of grandparents, parents, and children. And I introduced an explicit confound of neighborhood. That is that parents and their children share exposures from their neighborhood that grandparents don't share with them. And this could generate more similarity to parents and their children in their educational outcomes or their incomes or any number of other things that are not shared with grandparents. If we could use the neighborhood's identity and have repeat observations on neighborhood, through a varying effect strategy, it's possible to get some statistical control on such confounds, even though we don't know what it is about neighborhoods that are influencing parents and their children and making them more similar. Another example, remember the trolley problem. This was one of the most complicated DAGs that I've used so far in this course. But the relevant bits I want to draw your attention to are the participation nodes at the very bottom. In this example, I've made it a bit simpler. I've made it only so. Education E is driving participation in the sample, in the survey. Remember this was a survey, voluntary online survey. By itself like that, that's not a confound. If the sample is selected on education, you just need to post stratify to some other educational distribution. But no confound is necessarily introduced because there's no backdoor pass or colliders created by the DAG on the screen, just by conditioning on P on participation. However, you look at the top at the individual U that I've drawn. Those are the individual identities. We have many repeat observations. I think it's 30 for every individual in the sample. What if there are aspects of the individuals we haven't measured, like their personalities or their cultural backgrounds or their religious backgrounds, that influence both their responses to the trolley problems, which are represented by R at the center of this graph, and participation in the sample. So imagine, for example, that individuals who have a particularly strong moral psychology are more likely to participate in such a study that is about moral psychology and will have different responses from a typical person in the population. In that case, conditioning on participation does create a confound through the individual features that we haven't measured. But if we can use repeat observations on individuals through a varying effect strategy, we can get some statistical control on such a confound. One more example. Longstanding question in political science and international relations is the extent to which different forms of government are more or less likely to lead countries to go to war. So in this DAG, I'm asking you to consider some particular pair of countries, the dyad, numbers one and two. And we have their governmental forms at different periods of time, G1 and G2. And then in the middle is the outcome, W for war, whether they're at peace or at war. We're interested in the relationships between the forms of government and the chances that pairs of countries go to war with one another. This is a long-standing hypothesis that democracies don't go to war with one another. Democracies only go to war with non-democracies. There are a number of potential confounds represented here by X or other explanatory variables, and the X's could be geography or any number of other things that may also lead countries to go to war. In these cases, there are likely to be lots of unmeasured confounds that are associated with the political boundaries of the countries. You're represented just by these nodes for nation in each case. And it isn't that the nation itself is causal, rather that there are historical and cultural features associated with nation states that also influence their government and the chances they go to war, and therefore are a statistical confound in the orthodox sense. If we have repeat observations on each nation, however, there's some hope that we can use a varying effect strategy to get some statistical control on those confounds. Okay, so that's the causal perspective on varying effects as a useful tool, one of the really essential tools in our causal inference toolbox. There's a big advantage over an alternative and very similar approach that's often used for the same reasons. It's called the fixed effect approach. The fixed effect approach is essentially from a statistical perspective varying effects in which we set the standard deviation of those varying effects in infinity, when we fix it in infinity. We don't try to learn it from the sample. And what this does is it results in no pooling across the clusters. So it results in treating every department or every cafe or every nation as if it contains no information about the other nations in the sample. So there's no pooling, no partial pooling. The fixed effect approach can be very effective, but it has some serious drawbacks. In particular, it does not allow you to include other cluster level causes, aspects of each cluster which are time invariant across the repeat observations, but it may also be important objects of study. So for example, in this example on the right that I'm repeating of whether democracies go to war with one another, if there are aspects of the X's like geography which are invariant over the time series of interest because geography doesn't change very rapidly. With a fixed effect approach, you can't include such predictors because the fixed effects soak up everything that is about each nation. With the varying effects approach, you can include it all. There are no guarantees, of course, but don't panic. As always, the right thing to do is make a generative model first and draw the Bayesian owl. Figure out if it's possible with the data at hand and the design you plan to get the inference you want. Okay. In the previous lecture, I introduced this distinction in the varying effects models between clusters and features. So clusters are entities in the sample, things like tanks of tadpoles, stories in the trolley problems, individuals or departments. Each cluster has repeat observations in the data. And then there are features. Features are things about the clusters which we want to measure or which may be causes of something we want to measure. So like survival rates could vary by tanks. The treatment effects can vary by story. Responses can vary by individuals. And emissions rates and biases can vary by departments. In the previous lecture, I focused on the clusters. How do we add more clusters to a model? Because often in a realistic scientific problem, there are a number of different clusters at different levels of hierarchy in the population that are present in the same sample. But now we're going to look at features and show you how to do features as well in combination with more clusters. So that what results here is adding features means adding more parameters, but not more population priors. Simply more dimensions in each population prior. So let's talk about what that means. We're committed to having one prior distribution for each cluster because we're thinking there's one statistical population. It doesn't have to be a real empirical population, just a statistical population of the clusters we're talking about we're sampling from. They can be treatments, they can be individuals, they can be tanks, they can be departments. It'll help if we walk our way up in complexity here. So when there's only one feature for the cluster of interest, then the population prior we assign is just a one-dimensional distribution. And this is what we did with the tadpoles in the very first varying effects. Example, this alpha j is a normal distribution with some mean alpha bar and some standard deviation sigma. It's a one-dimensional distribution because when we sample from it, once we get one number, just one alpha j, just one log odds mortality for the tadpoles for example. What happens when we have two features per cluster? Then we have a two-dimensional population distribution and now when we sample from that distribution we get two numbers. So in the example I've put here, we get an alpha and a beta. I haven't said what these alphas and betas are, but they're just two features of each cluster in the population. And the most convenient way to do this is to use a multivariate normal, which is just a normal distribution. More than one normal distribution joined together with some covariance among the different dimensions. And so a multivariate normal is parameterized by now a list of means, a vector of means, here alpha bar and beta bar for the two-dimensional, and then some covariance matrix often written as sigma. The multivariate normal is incredibly useful and computationally very flexible. We can extend it to very large numbers of dimensions such that for some number n features where n can be 10, 20, 30, we use an n-dimensional distribution and every time we sample from this multivariate normal, this n-dimensional multivariate normal, we get n numbers, and those n numbers are a list of features that have some pattern of co-variation among them, some correlation structure within each entity, within each cluster, within each department, within each tank, within each story, within each individual. This strategy works very well, conceptually and computationally, with a little bit of practice. The hard part for these models is learning the associations. It's not the dimensionality. Scaling up in dimensions is easy. Learning the associations is hard. Let's think about learning the associations just conceptually first and then we'll turn to the code. So put the code out of your mind and let's just think in terms of the pictures of what comes out of these models. What I'm showing on the slide here is the population distribution of the average log odds of admission on a horizontal axis for applications from men and on the vertical, the average difference between applications from women and men for the UC Berkeley admissions data. This is a hypothetical population, just a prior right now, and you're looking down on it from the top. I'll see in distribution. So it's like a hill where it's taller in the middle there and then gets flatter as we move out into the outer rings. There's no correlation structure in this. It's just round all around. It's not tilted in any way. So a priori, there's no association between departments that have a good chance of admitting applications from men and have also a good chance of admitting applications from women and some bias. The vertical axis here represents some bias, some difference in the admissions rates between men and women. Now let's think about a series of departments that we'll get data from and we're going to use the varying effects model. This is like the coffee golem example from a couple of lectures back, but now with applications, graduate school applications as our example. Think about a first department and also before we've seen any data from it, similarly, we use the population distribution to inform our prior expectations of it. What I've done on this graph is I've drawn the posterior distribution for each department using the red here and then I've drawn the blue cross to indicate the mean from the population distribution, just the mean for the horizontal axis and the vertical axis. We're going to use that as this animate so that you can understand how these models learn. We're going to look at five different departments simultaneously. Don't worry, we're going to update them one at a time and I want to show you, like with the coffee golem example, how this kind of model learns, a model with memory. The first thing we do is we look at the whole pile of applications from men for department one and what this does is it updates the posterior distribution for department one along the horizontal axis. In gray now on each of these graphs, I'm showing you the posterior distribution from the previous step. In other words, I'm showing you the prior in each place in gray. I'll say that again. I'm showing the prior from each department and from the population in gray. This makes it easier to see what has changed when we inspect applications from any particular department. So first thing to notice in department one, most of the movement from the prior has been along the horizontal axis. There's been a contraction of variation on the horizontal axis because we've gotten a lot of information about the admissions rate of applications from men. We haven't learned anything yet in particular about the vertical axis, about the difference in application rate between men and women and we haven't looked at that stack yet. We're going to do that next. But before we do that, look at the other departments and you'll see that they've moved as well. Even though we have no data for them yet, we haven't looked at any applications from the other departments yet. They've moved as well because the population distribution has moved. We've just looked at, I think it's a hundred applications from department one only from men and this has shifted the population distribution, from the upper left to the right and this has had an effect on the prior for all of the departments. This is the varying effects feature that we've seen in the two previous lectures. It's a standard effect. Now, we look at the stack of applications from women for department one and we update that as well. It turns out that in this particular department, women are advantaged and now, so it's above the mean. What I want you to see, of course, department one's posterior is now strongly contracted in both directions but look at the population of departments in the upper left, the blue rings. You'll see that it's starting to tilt to the right and that indicates some correlation between the values at the department level between the horizontal axis and the vertical axis. That is to say departments that have large values of A, that is the parameter horizontal axis, will also have larger values of B. This is one department so far, so it's not very confident and you'll see that the other departments have likewise tilted a bit. Let's look at department two now. In department two, we look at just the applications for men and it's right on the population average. It happens to turn out by chance. We haven't seen the applications for women yet but the best guess is, since this department is average for A, it will be average for B and that's why it's right there on the middle of the cross. Now we update and it turns out it's also average for B and we've got this little contraction. Let's look at department three. Now we look at applications for men for department three. It's a little below average. I say that because it's to the left of the vertical blue line. Remember the vertical blue line shows the population average across departments for the horizontal axis variable for A, which is the law of gods of admitting a male applicant. Now we update for women and it's a little bit below. Now the model has seen or the Golem has seen a department both with a higher than average horizontal axis value and a higher than average vertical axis value, a higher than average A and a higher than average B with a lower than average A and a lower than average B and so if you look at the population distribution in the upper left now the correlation has gotten more confident. It's tilting more. The model now thinks that there's some correlation across departments between A and B. Let's see if that pans out and what this does to the learning and the anticipation that the model has. So now department four we have a value that's at minus one approximately on the law god scale and notice that even though for department four we haven't seen any applications or the model hasn't seen any applications from women yet it expects the it expects the distribution to be less than the average that is the center of the red distribution is below the horizontal blue line and that comes from the correlation in the upper left the correlation across departments between A and B and now we update and the model's anticipation is confirmed the departments with lower A also have lower Bs. Now it's quite confident there's some correlation let's see what happens with department five before we look at department five's data you see that it already expects a correlation there's a tilt to this Gaussian Hill we look at department five's mail applications far below average the lowest we've seen yet and you'll see before we even look at the applications for women the model expects the B values to be below the population mean to be below the horizontal blue line and then we look and we find it that way this is what Bayesian updating with varying effects looks like in a multi-dimensional varying effects model in which the model is not only learning and partially pooling for more than one feature at the same time the features here are the horizontal and vertical axes of these graphs that is the law gods of admission for applications for men and then the difference in law gods of admission between applications for women and men on the vertical axis now it's not just estimating those two things and doing partial pooling within each but it's estimating the correlation between them what this lets it do is it lets it pass information from one to the next just as to back up one step before in department five the model saw the applications for women it was already able to accurately predict that for this department the problem built that the value of B would be below the mean and that's the predictive and causal inference value of practicing varying effects in multiple dimensions while simultaneously learning the correlations between the features it'll help a lot if we look at this in the context of a real data analysis example so that we can see how to really draw the owl how to actually program it and to make this a little more approachable I appreciate that this is complicated material let's revisit the prosocial chimpanzees example from the previous lecture so just to very quickly remind you this is an experiment on the chimpanzees six experimental blocks around 500 trials total and four treatments and we're interested in the causal effect of these treatments on the chimpanzees behavior and in the previous lecture what I showed you is most of the variation in these data comes from the handedness preferences of the individual chimpanzees and very little comes from the treatments and that's what we were able to estimate with the simplest possible varying effects approach now let's think about taking this actor variation and allowing the actor's responses not to just be some average like they are in this graph but also to vary by treatment that is each actor might have a handedness preference but it might not express that handedness preference in all treatments equally the data are here to answer that question so let's see how to statistically program it so just to quickly remind you I'll show you again the dag for this experiment just a very simplified structural diagram we don't anticipate any confounding but we do have competing causes like block and actor that may moderate the treatment effects and that's something we like to estimate as well so this is going to be a Bernoulli model with the outcome P which is whether the individual chimpanzee pulled the left lever and we have a generalized linear model here on the law god scale and there's four terms in it but let me zoom in on this a bit just so I can explain it this is not unfamiliar material to you but it's good to be clear at the start so the first term here, the alpha bar this is the mean for each actor so you see I've subset it by A of I this is the actor on the ice observation the ice lever pull this is going to be a varying effect it will be partially pulled across actors this is what we did in the previous lecture then next to it we're going to have the possibility that each actor has a unique offset for each treatment and this is a way to measure the extent to which the treatments affect the handedness preference in different ways they can either turn it off, they could exaggerate it and so on this is the mean for each block different blocks could have different mean rates pulling the left lever for reasons unknown we didn't look at a term like this before but for symmetry I'm going to add it in here so that it looks like the actor effects and then we have block offsets for each treatment, that is the treatments could have been more or less effective in the different experimental blocks and this is very much similar to what we did in the previous lecture but now we're going to have separate variances for each of the treatments which we didn't do before so how do we specify the multivariate normal prior, that is the multidimensional prior, so there are four treatments and so when we think about the alpha sub j's which are these offsets for each treatment alpha sub j should be a list of four numbers because there are four treatments where j is an actor from one to seven and so this is a four dimensional multivariate normal and you'll see the way I've written it here there's a mean that is just four zeroes because we're centering this on the alpha bar values and then I'm parameterizing it with a correlation matrix represented there by the capital R R sub A for actor and a vector of standard deviations S sub A that is there's a separate variance for each treatment within the actors and then the correlation matrix R gives us a way to estimate to learn the correlation structure within actors across the treatments so for example if some particular actor tends to pull the lever more in treatment one they may pull it more in treatment two similarly for the blocks it looks very similar in expression but now it's for the blocks from one to six, six experimental blocks and for each one we need to draw from a four dimensional normal distribution to get the effect of each block one through six in each of the four treatments and again we want to model the correlations through R sub B and allow each treatment to have its own variation its own amount of variation its own standard deviation in the vector S sub B okay here's the rest of the model we also need the conventional one dimensional varying effect distributions varying effect priors for the alpha bar and beta bar vectors which I show you here this is just like the stuff we did in the previous lecture just the simple one dimensional ones nothing new there and then the priors at the bottom all of the scale parameters in this model whether they're in the vectors capital S sub A or capital S sub B or in the scale variables for alpha bar and beta bar which I've represented here with tau just so I could use a different letter hopefully that's less confusing I'm going to give them an exponential distribution with a rate of one and then at the very bottom of this we have the two correlation matrices R sub A and R sub B and we're assigning them this prior distribution LKJ core the LKJ correlation distribution is a prior distribution, it's a distribution for correlation matrices correlation matrices are highly structured things so you can't just give them any old distribution you want and you certainly can't just assign every element of the matrix its own distribution because there are constraints on the relationships among the different elements of a correlation matrix right I say a lot more about this distribution in the book I'm not going to say more about it in this lecture just for the sake of time so you really should look at the book and see what it does it's a very nice family of distributions for sampling random correlation matrices and doing regularization on correlation matrices ok I appreciate that this is a quite terrifying and monstrous thing that we have conjured this is like one of the old ones can we survive a model like this yes it's long and it has lots of moving parts and parameters but fundamentally it's no different than the first model we fit in the course it's still just an expression of the relationships among the variables and parameters that allows us to count all the ways the sample could arise given the assumptions as long as we're careful and we build it up in steps we maintain control and we will not be devoured so let's take a trip through the code and I want to show you the symmetry between the code on the left and the statistical model on the right so you can see how the code reflects the assumptions of the model first we have the Bernoulli outcome and the generalized linear model I hope there are no surprises here and now the code for the adaptive priors so to set up the multivariate normal priors you use the multi-normal distribution and the way we're conceptualizing this is that A is a matrix but I've represented it instead as a list of vectors I'll say that again A is a matrix but I'm representing it here as a list of vectors so the idea is that there are 4 treatments and for each actor A we need to sample values from the multi-normal for each of those treatments so for each actor A we have a vector of 4 numbers and so that's what this code means when you read vector of links 4 colon A for each actor that means for each actor there's a vector of links 4 that has this distribution as a multi-normal with means of 0 a correlation matrix of row A those capital R's on the previous slides are the Greek letter row and some vector of links 4 of standard deviations sigma A the same for B for the betas its own correlation matrix and vector of standard deviations and then the A bar and B bar you've seen this before these are ordinary one-dimensional partial pooling priors but I'm using tau as a new symbol so that you can quickly tell them apart not get too confused finally we have all the fixed priors the non-adaptive priors the ones that don't perform partial pooling that don't have parameters inside them for the tau's the sigmas and then the correlation matrices and you see those LKJ core distributions there again okay you can run this model and what you'll see is that the chains don't run perfectly they have some problems what you're looking at here is the output of a utility function that I use for myself it's part of the rethinking package but it's not documented because it's just a thing I use in my own work all the time but I'm sharing it with you here just to show you the output which is not particularly pretty this function is called dashboard and if you pass it a fit-stand model what it'll do is it'll produce this display and in the upper left we want to look there is a plot of the number of effective samples for each parameter in the model each point is a parameter against the R hat for that parameter and what you see is that we have some parameters on the left which are not looking good they have R hats a good bit above 1 up to 1.04 and a number of them have very low numbers of effective samples that vertical red line on the left represents 10% of the samples drawn the post-form up samples drawn and that's a good ballpark for big trouble in a Markov chain it's just a ballpark it's not a test just use it as a visual reference you'll see also we have 37 divergent transitions so check yourself divergent transitions aren't always doomed remember they're just rejected proposals and other Markov chain Monte Carlo algorithms reject a lot of proposals we get alarmed about them with Hamiltonian Monte Carlo because often it won't reject any so when it does start rejecting things that's a hint that something's up and maybe we're getting biased samples there are parts of the distribution the posterior distribution that we're not exploring well then you can see the the tranq plot of the log probability is not look good either for long stretches one of the chains there, the yellow one is above or below the others and the effective sample size is not great either but we're going to do that after the break take a break and before you come back I really strongly encourage you to review quickly the first half of this lecture again there's a lot of conceptual stuff that's new and there's a lot of technical stuff that's new I promise you you don't have to understand it all to keep moving forward you really don't but you should understand a bit of it before you press on and when you come back I will be here okay welcome back to fix the model that was having so much trouble before the break we need to revisit this lesson from the previous lecture of centered versus non-centered distributions so to refresh your memory I had introduced this example of the devil's funnel and this is a simple two-dimensional distribution we'd like to sample from but it has a region of really horrible curvature and as the virtual skateboarder wanders into that narrow funnel the simulation has a lot of trouble following that steep curvature there are a number of things we can do to fix this we can use smaller step sizes for example but the most effective thing is to reparameterize the model and that means to factor the parameters V out of the distributions for the other parameter and then have a nice Gaussian bowl again as you see on the right and we get back to the target distribution that we really want by deterministically expressing X here in this example in terms of the other two parameters we can do this trick for all kinds of varying effect models varying effect models often have distributions of this shape with terrible funnels in them just refresh your memory we saw the centered model in the previous lecture this is for the chimpanzees where we have the distribution for the actors alpha j is normal with a mean alpha bar and some standard deviation sigma and then for the blocks as well it's centered because it has a parameter inside the distribution sigma B is inside the normal and then the non-centered version that factors all of the parameters out of the priors replaces the priors on alpha and beta with priors on Z sub alpha and Z sub beta these are just normal 0 1 priors because they're Z scores that's why I call them Z and then we recompute the alpha and beta to be mathematically the same as in the centered version by multiplying the Z scores by the standard deviations of effect for each cluster type whether it's actors or blocks and that's what the loget line on the right looks like okay, good enough oh I should have shown you this yes, the Z's reappear in the linear model we can do this strategy with multi-dimensional prior distributions as well conceptually though it's a bit odd so to remind you here's the model we would like to fit today for the chimpanzees the one we looked at prior to the break and sampled data from how can we factor the R and the S that is the correlation matrices and the vectors of standard deviations out of these distributions how do you factor a matrix out of a distribution and then put it back into a linear model that is the loget line that's just a sum of terms how can I put a matrix in there that is a question we're going to answer over the next series of slides and I'm going to take my time but before I get to the meat of it let me emphasize again this basic issue about this business is that there's really no separating arriving at the scientific estimate that is achieving your scientific goals and technical aspects like expressing the model correctly and getting the machine to run right if you're going to draw the owl you have to do it in the right steps and in the beginning it doesn't look like an owl but that doesn't mean you've done anything wrong and there are certainly aspects of drawing the owl which are tedious and boring and not as satisfying as looking at the finished product and that's well that's what science is like too so bear with me I'm going to walk you through it with the steps of drawing the owl and if you're going to use varying effect models with multiple dimensions and you really should because they're now a standard part of the scientific toolkit in many scientific fields I'd say most quantitative fields these are ordinary models if you don't use them yourself your colleagues are using them and if you want to understand your colleagues work you need to understand these models if you're going to use these models you're going to have to grasp even though from a mathematical perspective that's the same model you have to draw them differently on the paper okay so at the top of this slide I've repeated the sort of head part of the model that is the outcome piece of i is a Bernoulli distributed variable with some probability piece of i and we model that probability as the log odds of some of terms now let's think about the alphas some of terms and what they are now it's a matrix in fact alpha in this model is a 7 by 4 matrix because there are 7 actors and 4 treatments I'll say that again alpha in this model as it appears in the logit p line at the top is a 7 by 4 matrix in each in each line in each line i we're only pulling one cell out of that matrix because we're only looking at one actor and one treatment but it's a 7 by 4 matrix because there are 7 actors and 4 treatments it turns out we can re-express this matrix in terms of other parameters so we can do the non-centering and here's the equation to do it now this thing is monstrous but let me walk you through it piece by piece so you understand a little bit at least of what's going on the computer's going to do this for you so you don't have to be comfortable doing all the multiplication and doing this by hand by any means but it's nice to know what's going on inside the machine so the first thing inside this monstrous expression is a diagonal matrix of standard deviations we take the vector of standard deviations there's one standard deviation for every treatment that is how much do actors vary in treatment 1 that's S sub A1 how much do actors vary in treatment 2 that's S sub A2 how much do actors vary in treatment 3 S sub A3 and how much do actors vary in treatment 4 S sub A4 and we take this vector and we string it out along the diagonal of a matrix a 4 by 4 matrix and we fill all the other cells with 0 that's diagonal this is a useful thing in lots of matrix algebra and those of you who are familiar with matrix algebra have seen this before those of you who haven't who aren't so familiar with matrix algebra that's fine you don't need to understand matrix algebra to understand this lesson matrix algebra is just a set of shortcuts for doing ordinary algebra so you're not missing anything this is just a quick way to do lots of calculations the next thing is we multiply this diagonal matrix by this thing called the Koleski factor of the correlation matrix across treatments this is something we compute from the correlation matrix across treatments that is the thing we're trying to learn we're trying to learn the correlation structure among the features of this cluster type that is actors we'll talk more about this in a few slides for now just let it be a bit mystical it's a matrix though um and then finally we have a matrix of Z scores and these are the little 0 1 distributed it's a matrix where the margins are treatments by actor they're in the different order than alpha the opposite order from alpha and that's just to make the matrix multiplication work out well finally there's this little t in the upper right of the whole thing after those three matrices in the parentheses have multiplied together the result is a matrix where the the rows are treatments and the columns are actors and so we transpose it to flip the rows and columns so that alpha is a matrix where the rows are actors and the columns are treatments this thing is not pleasant to look at but it's expressing a set of basic transformations just like the ones we did in the previous lecture to do the re-centering of alpha so that we can take the standard deviations and in this case also correlations and multiply them by Z scores and get things back on the original scale and all the matrix algebra in this expression is just to make it more compact not less compact and so it's actually less confusing this way believe it or not in this lecture since this is not a matrix algebra course I'm just going to abbreviate this whole thing using this notation here I'm going to say let diag be an operator that takes a vector and makes a diagonal matrix and then this is the thing that we're going to insert into our statistical models and this is just like multiplying Z scores by standard deviation it's the multi-dimensional equivalent of multiplying Z scores to get back on the scale the centered scale of the parameters we need to talk about this Koleski factor a bit more though so Koleski was a man Andrzej Luj Koleski I'm not quite sure how to say the last name or how he would have said it it's pretty obviously a Polish name and yet he was French and served in the French army so Koleski I think both are acceptable Koleski, if you want and the story here is that Koleski was a commandant an artillery commander during the First World War oh I'm sorry there's a bunch of French on the screen so I took French in high school and my French is terrible but I'm prepared to offer you my translation of the first sentence of the paper the artillery commander Koleski of the geographical service of the army killed during the Great War he was killed in 1918 you can see in the lower right of this slide which was the last year of World War I imagine during research on the compensation of the geodesic networks a very ingenious process of solving the equations known as normal obtained by application of the method of least squares to linear equations in lower number to me you recognize words in there but you're not sure exactly what this means that's fine what commander Koleski figured out during his service as an artillery commander was a very general solution to systems of linear equations that can be used to do lots of clever stuff in many branches of applied mathematics and we're going to use it here to do our non-centered distributions and then put them back on the centered scale so that we can sample efficiently from Gaussian bowls but still get the parameters we want that are not standardized this is what it looks like in the simplest possible form so in this code block what I've done at the top of it is I simply define some two-dimensional Gaussian distribution with a correlation of 0.6 between the two dimensions so this is the idea that there are two dimensions and the standard deviation of the first is 2 that's what Sigma 1 means and the standard deviation of the second is a half and they have some correlation row of 0.6 so that's a two-dimensional Gaussian distribution but now we're going to simulate it do a Monte Carlo simulation of that Gaussian distribution by sampling from independent standardized normal distributions and that's what I do in the middle of this code block so these two are just Z scores our norm of N in R if you give our norm only the first argument it assumes the mean is 0 and the standard deviation is 1 so we have two vectors now of Z scores, random Z scores that have no relationship between one another Z1 and Z2 are totally independent and now we're going to use Koleski's technique to blend the correlation back in and get the defined 2D Gaussian at the top of the code block so we define the target distribution using A1 and A2 we compute A1 in this familiar way of just multiplying the Z scores by Sigma 1 you've seen this in the previous lecture so this just rescales the Z scores to put them on so that they have the proper variance and then here's the Koleski part that weird expression at the sorry I should have highlighted that expression at the bottom A2 is the Koleski factor this thing that we'll talk about more in this lecture where it turns out for the bivariate case the proper Koleski factor is this odd looking expression that we can express using addition and multiplication and a square root and it allows us to basically mix together the values of Z1 and Z2 for each case and get the proper correlation and so you see in the output box on the right of this slide now after you run the code here the correlation between Z1 and Z2 is effectively zero because they're independent but the correlation between A1 and A2 is about 0.6 as it should be and the standard deviations of A1 and A2 are also correct and this is the Koleski magic okay you're not going to have to do any of that yourself the computer will happily do all that it'll compute the Koleski factors for any number of dimensions and it'll do all the multiplication to get them back on the centered scale as well so the computer does these odd-diag matrix multiplication lines that you see on the screen now that compute alpha and beta from their corresponding non-centered pieces okay the next thing we need in this model of course is a bit simpler than all that it's just to express the prior distributions for the Z-scores so we have these Z matrices now for treatments by actor and treatments by block and we give every element of these matrices a standardized normal prior distribution no correlations among the elements the correlation matrix is going to handle that and also for the means we give them we're going to make them non-centered as well so we give them their own Z distributions this is just like the previous lecture there's nothing new here these are just one-dimensional partial pooling priors and we assign them back on their mean effects we get alpha bar and beta bar back just by multiplying the Z-score for each actor by the proper standard deviation for actors this is Tauss of A and then for blocks Tauss of B and then finally the priors as we expressed them before this is the same as the model before the break there's nothing new here we can still express the standard deviation priors the same way and the correlation matrix priors the same way the only difference is now the golem is going to have to take these correlation matrices and compute the proper koleski factors the L's so every L sub A is computed from R sub A and so but the machine does that for you it's just the code and it doesn't add any complexity to the expression of the code let's look at the code so this is long no this is really staring into the eyes of an old one but let's just take it one piece at a time you will maintain control you will not be consumed I promise you so the adaptive priors these lines that convert the non-centered parameters the sigmas the koleski factors and the Z-scores onto the centered scale so we can use them in the linear model A bar and A and B bar and B and those are all centered parameters come from centered distributions they have to be computed from the non-centered distributions so the two lines I've highlighted here do that I've already explained them on the right all the matrix multiplication and on the left in ulam there's a convenience function for doing this calculation it's called composed non-centered but all it does is the matrix multiplication and transposition that is expressed on the right that's literally all it does and this gives us our alpha matrix and our beta matrix and we can use them directly in the generalized linear model next these boring standardized normal prior distributions that get plugged into those composed non-centered expressions just above then the one-dimensional partial pooling distributions and conversions for A bar and B bar then the fixed priors no surprises there I don't think the only thing to note perhaps in the code is that here I've defined the priors directly on the koleski factors so instead of having row A I have L row A and instead of lkj core we have lkj core koleski this is just a shortcut to reduce computational load we do all the calculations using the koleski factors directly but really it's still just a correlation matrix it's just something derived from a correlation matrix and this is still the lkj core prior distribution just as before finally for convenience you can't really interpret koleski factors from a correlation matrix so you need to convert them back to a correlation matrix so you can understand the posterior correlations among the features of each cluster and so I put this at the very bottom just a little bit of code to take the koleski factors and save them as correlation matrices so that they show up in the posterior distribution the samples that you extract after you run this model ok this model samples a lot better than the previous one no divergent transitions and the chains mix much better we get higher numbers of effective samples the r-hats are good on the left I'm showing you an interesting feature there's nothing wrong with these trank plots except you'll see for some of the elements of the koleski factor parameters remember a koleski factor is a matrix just like the correlation matrix is derived from but some of the elements are invariant because a koleski factor is a diagonal matrix that is either the upper or lower diagonal is filled with zeros and that's what you see here for some of the elements of any koleski factor when you run one of these models it's going to be invariant and so the number of effective samples is going to be NAND which is not a number because you can't compute effective samples for something that doesn't vary it's effectively zero I guess this is not a problem and that's why I show it to you that nothing's gone wrong it's just that these are returned in the output but they're not really free parameters that have been estimated they're just products of the koleski calculation the important lesson here is to look on the right again I showed you something like this in the previous lecture when I compared the centered model to the non-centered model I'm doing that again now for this example with many more dimensions the fully flexible chimpanzees and multi-dimensional chimpanzees and again you see that the non-centered model on the vertical axis nearly always except for two parameters they're shown in red has a larger effective sample size for each parameter than the centered model on the horizontal that's true of this example there are other examples where the opposite might be true so again it's not that the non-centered version is always better sometimes the centered version is better but it's much easier to express the centered version and that's why I'm teaching the non-centered version so explicitly you would never teach the non-centered version first because it's such a non-obvious way to code the model so if you start off with some intuition about which is better that's great but being able to convert between the two for each type of cluster because sometimes one cluster needs a centered expression and sometimes some other cluster in the same model needs a non-centered parameterization that comfort in switching between them is a real essential part of being able to draw these owls it'll make you much more powerful and much more sure of your results and that in turn will give your colleagues much more confidence in your results okay after all this work it'd be nice to see what the results are wouldn't it so let's take a look what I'm plotting out on this slide are the posterior distributions of all of these varying effects on the left the actor treatment effects in blue and on the right the block treatment effects in red so for the actor treatment effects the way to interpret this is each cluster is an actor labeled one two three four five six and seven for each of the seven chimpanzees and then in each cluster the treatments are ordered from left to right one two three four what you can see is the same pattern as in the previous lecture the actors vary a lot in their handedness preferences there's number two who always pulls the left lever in every treatment plastered up against the top of the vertical axis and the other individuals seem to respond a bit to the treatment when the pro-social options on the left that is in treatments two and four the second and the last but most of the variation you can see from just looking is between actors not between treatments and then on the right for the block effects the block treatment effects rather is the same patterning each cluster is a block now labeled one two three four five and six and again the treatments are ordered from left to right one two three four it's a different pattern now in that the blocks aren't very different so most of the variation here is actually within each block that is the treatments do differ especially within block six you'll see the blocks are basically all the same we can verify this but looking at the standard deviation estimates from the model remember it's a varying effects model so if we want to make assessments about these variances and whether they're more within or between the clusters we just inspect the posterior distributions of the variance components of the sigmas so in this case I've augmented this graph to show them at the bottom on the left bottom left we have the posterior distributions of the variance components for the actors so in blue we have the density we saw last time this is the posterior distribution of the standard deviation among actor means that is the averages among actors so this is an assessment of the variation across actors on average in how they act and then the purple densities there are four of them there labeled over one another are the by treatment effects and you'll see that those are much smaller this reflects what you see in the graph just above it most of the variation is between actors and not within actors across treatments the actors aren't responding much to treatments on the right we have a reversal of this pattern well it's not a very strong reversal but nevertheless it is a reversal the red in the lower right is the density we saw before there's not a lot of variation among blocks and so the posterior distribution of the standard deviation across blocks that is how different blocks are on average is quite small and then in purple again there are four densities layered over one another are the by treatment effects each of these assesses within each block how much variation there is across treatments this is a bit larger and there's not a lot of variation either within or across blocks the total variation explained by block is small in this experiment but to the extent that there is variation by block it's within block across treatments and not between them if you wanted to hold out some hope that the chimpanzees were starting to behave in this experiment in a way similar to how human kids behave then you could pick it up from block six there where the treatment effects appear to be stronger however they're just responding more strongly to the pro-social option being on the left in block six they're still not responding to the presence or absence of a partner anymore okay let me sum up this has been on correlated varying effects it's just an introduction to really get a grasp on this material you have to practice with these models and use them and make some mistakes and recover you have to draw the owl so if you're feeling a bit like there are parts of this you haven't understood yet that's normal that's how it's supposed to be that only means you're paying attention as always if you're feeling confused it's because you're paying attention and it's not going to translate you and thank you for that so let me remind you what I tried to get across in outline form and then you can review the lecture again perhaps at double speed and pick up the bits again that you need to the first is that for varying effects models they tend to work better if we use priors that also learn correlation structure not just the variation across the clusters so remember the varying effects strategy is that if we learn the priors variance from the data then we make better estimates well the same extends in multiple dimensions to also learning the correlations if we learn the correlations across the features across the clusters or the correlations so there are features within each cluster if we learn the correlations within each cluster of those feature values then we can also make even better estimates and this allows us to have partial pooling across the features and that's what I showed you in the department example in the first part of this lecture where the model effectively anticipates before it's seen the applications from women where the average will be and it anticipates it from the population correlation between the two parameter types the second thing we get from priors that learn correlation structure is that it allows us to exploit those correlations for predictions now when we want to make predictions of various kinds our predictions can be better because not only do we know there is variation in the population but that variation will have some structure and that structure can be very important for predicting new clusters the ones that we haven't observed in the data this is equally important for causal inference I keep saying a special kind of prediction where we try to predict the consequences of an intervention or to explain the results of something thinking about it the other way and these correlations can be important for that as well because remember varying effects are a kind of unobserved confound and getting good estimates for them can help us get some control on those confounds okay one note before I end if you do these models without the correlation structure so you can back up in this lecture and take the correlation matrices out and express all of the priors in the previous model in terms of one very one dimensional Gaussian distributions with their own standard deviations that is each treatment would have its own standard deviation that model would run fine it would make very similar predictions and the varying effects in the posterior would tend to have a similar correlation structure not exactly the same but a similar correlation structure to the posterior distribution from the models that actually learn correlations another way to say this if I can make it simple is that priors are just priors so these models have priors with correlation structure in them and that allows the model to learn the correlation structure across the clusters but a model without learning of correlation structure can still have correlated varying effects in the posterior distribution because just because the prior has no correlations that does not mean the posterior won't I'll say that again just because the prior doesn't have correlations that doesn't mean the posterior won't an absence of correlation in a prior is not a claim that correlation is impossible and as always the posterior distribution is free to go where the data makes it go that's true for all Bayesian models unless your priors constrain the outcome space of course so that means there are going to be many applied contexts in which all the complexity in this lecture doesn't help you very much and is unnecessary you can get away with treating all your varying effects as one-dimensional ignoring correlation matrices and all of these eldritch monstrosity that I've done in the last hour or so and I can sympathize with the desire to do that however science is a profession and we have an ethical obligation to do our best so if you know how to do better you should be willing to do a little bit of work extra to do better so that you can defend your results as saying that you did the best thing you knew how to do not that you thought the best thing would be too hard and so you compromised ok that has been lecture 14 and this is week 7 next week of statistical rethinking 2022 we're going to do more multi-level models yet more but I promise we're going to turn more to applications now and I'm going to look at custom kinds of multi-level models that are useful for doing things like social network analysis and then we're also going to look at a funny sort of multi-level model known as a Gaussian process which is extremely useful and very broad family of methods and I'll see you there