 Science is a profession, and like all professions, it takes many, many years to become a professional. The skills that are required take many years of practice, even though you receive expert instruction. To those who haven't received that instruction and also haven't practiced, it seems like magic, the sort of things that professionals can do, and it seems effortless. But of course, professionals know that it never ceases to be difficult to do these things. It's just that they've practiced a lot, and they work in ways that make the practice safe and successful. Varying effects models are like, well, blowing glass in a way. They produce beautiful things, wonderful results. They're really essential scientific instruments, to use them in a way that requires some practice and also some safe practice. So this lecture will focus on that. But let me begin by reminding you what varying effects are. In the previous lecture last week, I introduced you to varying effects through this tadpole survival example. I'm repeating it here. I hope you remember this figure. And what I showed you, I hope I showed you, is that if we take a model, sort of ordinary model in which we estimate the survival across a range of different units in the data, we always have this choice of what the prior should be for those parameters that estimate the features of each unit. And in particular, how wide the prior should be. In this example, that's the sigma parameter that determines the standard deviation of the normal distribution assigned to the log odds of survival for each tank. So for very small sigma values, like on the left of the graph on the bottom part of this slide, we call this the complete pooling kind of model. If we just fix sigma at a very small value, that's like assuming that all of the different units in the data, all of the tadpole tanks in this case, are the same. And so if we see data for any one of them, we update our estimates for all of the tanks using that same information because we've assumed they're all the same. And so information about any one tank tells us about all of the tanks. The population is represented by one number essentially, the overall mean. This is the complete pooling approach. In the middle, as we increase sigma, the scale variable, we get partial pooling, where there's some mix between information, particular to each tank, each unit in the data, and information that comes from the collection of data. That's the complete pooling kind of solution. And then if we increase sigma far enough off the right-hand side of this graph, we reach at infinity, but actually long before that, the no pooling approach, where the model has such a wide prior that it effectively forgets about every unit in the data, every tadpole tank, when it moves to the next one, the amnesia kind of model. And what I tried to show you is that partial pooling is a really good intermediate solution because it produces better predictions. It seems to automatically locate the values of the sigma parameter, the standard deviation that perform well in cross-validation. And that's why we use multi-level models, these models that learn the shape of the prior from the sample because they make better estimates for better predictions and better predictions of causal effects as well. These are great tools, great technologies, great glass-blowing devices, but they bring with them practical difficulties. And I want to spend this lecture just talking about those practical difficulties. I'm not really introducing you to new model types in this lecture, so you can relax a bit. Instead, we're going to practice what we met last week and we'll practice it in ways that make it safer and easier to understand. So I want to argue that in general, varying effects models are a really good default kind of regression. There are many, many more research circumstances in which they're appropriate than there are circumstances in which they are not appropriate. And that's because we like regularized estimates. However, there are a lot of puzzling things that most people need to be shown and most people need some time to practice so that they can really responsibly and professionally use multi-level models. The first of these is how do we use more than one cluster type at the same time? So a cluster type is something like a department or a tadpole tank. So in the trolley problem data from the start of last week, we had two cluster type stories and participants. And we liked them both in the model. How do we do this? The second is how we calculate predictions. And calculate predictions gets more complicated with varying effects models because there's more than one model in the model. There's the model of the observations and there's the model of the groups. And when we generate predictions, we have to decide how we use the separate models because we don't always use them both. And third, there are a number of technical issues with multi-level models and just sampling parameter values from their posterior distributions, which are new. And I want to introduce you to those today as well and introduce you simultaneously to solutions to all of these things. Okay, a little bit of terminology. We're going to talk about clusters when we talk about multi-level models and we're going to talk about features. Clusters, I'll also sometimes say units. These are kinds of groupings in the data, discrete groupings in the data. And features in contrast are aspects of the model that vary by cluster, by group. These aspects are often parameters, but sometimes they're not. So we've already seen examples of these things and you can map clusters to their features. In the tadpole example, we had tanks and the features of the tanks that varied were their survival rates. There were stories in the trolley problem example and the features that varied are the treatment effects, which can vary by story. We have individuals and a number of examples, whether it's the graduate school admissions example or the trolley problem example. There are individual applications and participants. And these things, individuals vary in their properties that affect what they do in the model, their average responses. And then departments can vary by admission rate and also their bias. So there can be multiple features for each cluster that they vary in. Today, we're going to be focusing on clusters and not features. So how to add clusters to models and then how to manage the additional complexity that arises from that. And what this means is adding more index variables because every cluster type needs its own index variable and also adding more population priors. I'm going to show you how to do that in this lecture. In a future lecture, I'm going to show you how to add features. And this means adding more parameters that pertain to each cluster. Another way to think about this is you get more dimensions in each population prior. We're not going to get more population priors. There's only one population prior per cluster. Instead, each population prior grows in dimension. That may not be clear right now. It doesn't need to be. You can wait until the next lecture. Okay, let me introduce an empirical example to motivate this. It's always more interesting that way. So here's an experiment that I discuss in multiple chapters in the book. It's introduced in the generalized linear model chapter and then it comes up again a lot in later chapters. So if you're following in the book, you already know this example. I'm going to explain it as if you don't. So what you're looking at here is a camera's eye view of an experiment with chimpanzees. And we have our protagonist behind this glass window over here who's the focal actor in the experiment. And that glass window is a barrier, but there are two holes cut into it. One on the left and one on the right. And the chimpanzee can't crawl through them. Chimpanzees are big, but he can. He or she can put his or her hand through the hole to reach some levers. There's a lever on the left and a lever on the right. And the chimpanzees, these are captive chimpanzees who are used to doing research and they know all about levers. They're trained on this task. And in this particular task, there's some food on the table. Rather, there's some trays on the table and some of these trays have food in them. You can see the tray on the left there has a food object, a little yellow food object, and the tray on the right is empty. And the chimpanzees can see this. They're extremely clever. And the chimpanzees know that when they pull a lever on a particular side, it will cause the accordion-like device in the middle of the table there. You can see it at the bottom of this picture to extend and it'll push the tray closer to the window and then the chimpanzee can grab it and get the food item. So what you're looking at here is kind of the controlled or training set. Just to train the chimpanzee on how the device works and the chimpanzees have no mistakes in pulling the right lever to get the food. They learn this extremely fast. Now here's the real experiment. On the other side of the table, now we're looking at the camera's eye view in the other direction, there's another window, but there are no levers. There's another chimpanzee in some of the conditions or there can be another chimpanzee there or the window can be empty. Let's call this the partner. And again, there's food laid out in some of the dishes and in this particular experiment, what's going to happen in every case is that one side of the table at random will have two food items and the other will have only one. And in particular, only one for the partner. So now the question is, if the actor pulls the left lever and we're going to call this the pro-social option because there are two food items on the left in this example, then both the partner and the actor get food. However, if the actor pulls the right lever, then only the actor gets the food. And so the question is, which chimpanzees will tend to pull the pro-social option? It's pro-social because it costs them nothing. The actor always gets food in this experiment. It's just that if they pull the left lever here, the partner also gets food. I should say that human kids nearly always pull the pro-social option in an experiment like this from a very early age. So we're interested in whether chimpanzees behave similarly pro-socially as human children in a very similar kind of task. Here's the data set we're going to use. It's in the rethinking package, data chimpanzees. This is 504 trials of the sort I just showed you with seven different actors. That is seven different individual chimpanzees and six different experimental blocks, which you can think about as days on which trials were run. We care about the blocks because often experimental blocks induce biases and treatment effects can vary by block. So treatments are randomized within blocks and there are four different treatments. We have first that the pro-social option can be on the right side of the table from the actor's perspective and there can be no partner at the other end. The second is that the pro-social option is on the left and there is still no partner and then both right and left again, but with the partner. So we're interested in all four of these treatments because it's not just enough that the chimpanzee pulls the pro-social option. The chimpanzee could be attracted to the fact that there's more food on that side of the table, even though they understand they're only getting one piece of food. What we want to see is if they pull the pro-social option more often when the partner's on the other side, which would be consistent at least with the idea that they understand they're helping the partner at no cost to themselves. Let's draw a dag. This is an easy dag because this is a controlled experiment. The treatments are assigned and randomized and counterbalanced and all those good things. If you have an outcome, P in the middle of this dag is just pulled left, then there are features of, let's just say, there are treatments, which are combinations of condition inside, those four treatments. I've listed on the left of the slide, which influence, hypothetically, whether a chimpanzee pulls the left hand option. So it's coded this way. This isn't the only way to code the experiment, but I've coded it this way because it'll turn out to be quite revealing. But you can think of it as we want to see if when the pro-social option is on the left, the probability that the left lever is pulled is higher and then also in combination with whether the partner is present. The second influence is the block. There may be unmeasured block effects, which moderate the treatments, and then there are the actors and there are features of the actors, in particular, their handedness. Chimpanzees have preferred handedness right or left just like humans do. And so you need to take this into account as well. This is another Bernoulli model because the outcome is a 0-1 variable, whether they pulled left. So a familiar model structure for you by now, I hope, and we use the conventional loget link. And for the linear model here, I'm going to use an interaction between treatment and block and then add to that an actor effect. Let me explain why I'm doing it this way. So we want both block and actor variation across them to influence the results. And we could just add block as its own effect, just like actor here, but that isn't wholly satisfying because the idea is that the blocks moderate the treatment effects in some way. So that means we want to look at interactions between them. So I've created a matrix, beta, in which the rows are the treatments, 1, 2, 3, 4, and the columns are the blocks. And so for each treatment, we get an estimate for each block. And we can look at the variation there. And if the treatments don't vary across blocks, that's good. And if the treatments vary within blocks, that tells us something about the treatment effects working. And then the actor effects, these alpha parameters are going to encode handedness on the log odds scale. You'll see. Okay. Let's complete the model with some priors. So it'll be good to label all the pieces of this model. So let's start with these first two familiar pieces and just add some labels on the right of the slide. The top line is just the probability of the left lever being pulled. It's the probability of the observable variable P, probability of the data. And then the second line is the definition of the log odds of the left lever being pulled, just our generalized linear model. Then we need the population prior, the prior for the actor effects for the alphas. And this is structurally identical to the tadpole example that we did before. Each alpha j, where j is the ID number of some actor in the experiment, has a normal distribution with a mean alpha bar, which is kind of the mean handedness, if you will, in the chimpanzee population. And some standard deviation, sigma sub A. And I've done a sub A here now because we're going to have more than one sigma in this model. So just hang on. And this is a prior for the distribution of handedness in the population. And these are the actor effects. Then we have an analogous prior for the treatment block effects. I'm assigning this prior to every element of the matrix beta. So for a cell in beta j comma k, that is a treatment j, and a block k, it has a normal prior with mean zero and a standard deviation sigma sub B for block. Why is the mean zero here? Well, you could put a bar beta in here, a beta bar in there if you want. But that would be redundant because you can only have one mean in the loget model up there. It would be overparameterizing the model. That wouldn't blow up the model. The model would still work. It would just sample less efficiently. I encourage you to try that experiment if you don't believe me. But there's no need to do it. In a sense, the zero mean is defining the treatment effects as relative to the handedness. Or you could think of it the opposite if you like. It's the same additive model. But you only need one intercept, one alpha bar per generalized linear model. And then we need a prior on the sigmas. And I'm going to give them both the same prior distribution, this exponential with a rate of one. These sigmas are often called variance components. You'll hear that a lot and see it in a lot of books. And we'll talk a little bit about that later in this lecture. Okay, one of the things I've snuck in here, which some of you will find odd, is that I'm doing partial pooling on treatment effects, which are almost always treated as fixed effects, which is a term I haven't used. I've avoided using it. A fixed effect could mean different things, but conventionally in biology, for example, it means an effect that has no partial pooling as it has a fixed prior without any parameters in it itself. But I'm using partial pooling here. So let me spend just a moment justifying that because it may seem unconventional. And sometimes you'll even read that this is a bad choice. And I really disagree. It's very reasonable. Treatments have all the same properties of other things that we do partial pooling on. So first of all, the treatments are not completely different from one another. If you encountered the data from this experiment in order of the treatments, if I sorted the data set so that you saw all the data for treatment one first and treatment two and then treatment three and treatment four, and you fit the model only to treatment one first, your estimate for treatment one would give you an expectation about treatment two. Of course it would, because the treatments don't completely alter the chimpanzee behavior. They only influence it. And so if you learn the effect of one treatment, you'll get an idea of the size of the effects that are possible in other treatments. The second intuition I can provide you is there are many possible treatments and any particular experiment only uses a few of them. If you're doing psychology, think of it as there are many stimuli you've only selected one probably without examining the effectiveness of lots of other stimuli. So there's a population of stimuli you could use. In this experiment, there's a population of different presentations of the task and all we're learning about is how chimpanzees respond to this particular one. And so treatments vary due to features that we're not modeling. And again, that's like the other kinds of things we've done partial pooling on. You can conceive of a population in which we have only sampled some number of objects from it. And of course the best reason to use partial pooling as always is because it works because it results in better estimates because it regularizes it uses the data to choose the prior to learn the prior. And you have to choose a prior and there's nothing magical about using partial pooling because you always have to choose a prior and the fixed prior as I showed you with the cross-validation example at the start of this lecture repeated from last week's lecture is that you could choose a fixed prior that gives you the same regularization properties as the partial pooling kind of model which is the very effects model but you'd have to be impossibly clever or do the cross-validation exercise. Better just to use the partial pooling estimator and get better estimates. And this works for treatments as well as it works for things like tanks that are not experimentally controlled. So try to summarize it this way at the bottom of the slide. If you have a set of parameters in a model that are assigned the same prior whether they're treatments you don't want to choose a prior that biases from the beginning the treatments to be different so we often assign them the same prior it's usually going to be better not always but usually to learn the prior from the sample and that's because you'll get better regularization you'll get better estimates less overfitting and you'll avoid underfitting as well. Okay, that's my sermon on modeling treatments as varying effects. Here's the model and the ULAM code here is very similar to the mathematical model definition. I don't think there are any new tricks in this model notation. Again, you'll see the matrix for beta for B there. I've done this before in previous examples. The code that defines the treatment variables may be a bit odd but if you play around with that you can confirm for yourself that it defines the treatments 1, 2, 3, 4 as I defined them on the previous slides. Okay, you run this model and it contains a lot of parameters I'm just showing you the praisy table on the right you really don't want to stare too hard at this it'll stare back. For the moment I do want you to notice though that some of the parameters don't sample very efficiently in particular at the very bottom Sigma B has a quite low effective sample size and the R hat is 1.02 If you look at the trank plot you could look at the trace plot too but I think the trank plot is more revealing in this case you'll see that it's uneven that is the ranks are much more widely dispersed on the left than the right and this is a tell-tale sign that the chains are not exploring the posterior distribution very efficiently. I'm going to show you I'm going to summarize the effects from this model in subsequent slides later on in this in this lecture I fixed this model code so it samples more effectively and so the results I'm about to show you are from that fixed version of the model so I'm sorry to leave you in suspense about how to do that but I think it's better that we summarize what we learned from this model first and then get to the technical issues so what I'm showing you here are the posterior distributions on the left of the actor effects remember there are seven chimpanzees in this experiment overall 504 trials and I have them arrayed out on the horizontal axis on the left and then the vertical axis there is the probability each of them pulls the left lever and these are the alpha estimates for each individual and the intervals there are 89% posterior intervals you'll see there's a lot of variation we have below the dashed line in the middle are the right-handed individuals I don't know if they're actually right-handed but they behave as if they're slightly right-handed they prefer the right lever and then we have our three lefties especially number two number two actually always pulled the left lever in every trial that number two participated in and so the model is quite confident of the handedness preference of actor number two and six and seven as well prefer the left levers there's a variation in handedness here it appears to drive results that's why there's all this residual variation picked up by the varying effects on actor the alpha parameters on the right you see the treatment effects split up by block so along the horizontal axis I have the four treatments so RN means pro-social option on the right and the N means no partner and then LN means left and no partner RP means right and partner LP means left and partner and then the dots are the posterior means across blocks and so there are six blocks in the experiment so there are six points for each treatment and you'll see that there's not a lot of variation there's some variation across blocks but not a lot and also the treatments don't deviate from the mid-zero line very much meaning the treatments don't have much of an effect in this experiment and the other two treatments do slightly tend to pull the left more when the pro-social option is on the left you'll see that the two treatments that start with L the second and the fourth are a little bit higher on average than the other two but it's not much handedness is driving nearly all of the variation in this experiment let's look at the variance components as I called them before we can look at the posterior distributions the sigma parameters and I put those at the bottom of this slide so on the left in blue is the actor variation and there's a lot of this remember these sigmas are on the log-odd scale just like the other parameters and so two is a lot of variation on the log-odd scale remember when we assigned a prior to an intercept in a binomial or logistic model a standard deviation of 1.5 would cover the whole range from probability of zero to one so actors cover the whole range just with their handedness preferences here and then on the right in red at the bottom you see the analogous posterior distribution for treatment and block variation you see this is much smaller so this is a way that the model picks this up and represents it in the shape of the population of these effects something to note about these variance components is that they don't add together simply to give you the total variation in the data that's only true for linear models simple linear regression models the variance components are additive but linear regressions are the only regression models where that works that way in general the variance components inside a GLM aren't simply additive although of course they are in the latent scale but it's the outcome scale that matters and the total variation in the outcome variable is not simply a sum of the variance components because it's transformed by the link function which squishes parts of the distribution on the latent scale there are ceiling and floor effects and so the consequence of this and what you want to keep in mind is that variation in one component like treatment or block will moderate variation in the others like handedness and vice versa this is just like a GLM or like this this is not the first time I've mentioned this in a generalized linear model all of the predictors moderate all the others even if you don't explicitly model them as interactions and that's because of ceiling and floor effects as I've said you can only kill the salamander once if any one thing is dangerous enough to kill a salamander it doesn't matter how many other dangerous things there are it only takes one of them and in that sense the lethality of a predator depends upon things like ambient temperature okay all of this also means that when we compute predictions from varying effects models we have to do all the same stuff that we dealt with before and that is marginalized over some target populations some sample that we have in mind for the causal effects and this is because of these moderation effects as I've mentioned before but that's okay we know how to do this we just use simulations with varying effects though you have to make a choice first about how you're going to structure your simulation and that's because there are two models inside the model and so imagine for example that you were going to make predictions of some intervention for the same groups that were in your sample for the same chimpanzees or the same tadpole tags or the same UC Berkeley departments in that case you could use the posterior distributions of the varying effects estimates themselves the ones that correspond to each unit in your data because you're imagining intervening on the same sample the same population but often that isn't what we have in mind especially in the case of an experiment where we want to generalize the experiment beyond the units in the experiment itself that's nearly always the case and in that case what we want to do instead is use the posterior distribution of the population of groups that we've estimated and that means the sigmas become much more important and the particular alpha j's we've estimated are discarded and we do not use them so we ignore the varying effects estimates when we predict for new groups and that includes predicting interventions or causal effects and we do that to marginalize over the population distribution now word of caution about this is that we've assumed that the population distribution is normal is Gaussian and I've said before this is a prior distribution it isn't a claim about the actual shape of the effects in nature and that's still true so when you do these simulations you might want to reconsider that again and think about what you think the shape of the distribution is and whether you're happy with it being Gaussian in these simulations because it may not be symmetrical and you may learn after you fit the model to your sample that the variation around the population mean that it is not symmetrical or that it has thick tails and in those cases it would be a very good idea to revisit the Gaussian assumption on the varying effects before you simulate interventions okay let me give you an example and then we'll take a break here's an example of simulating intervention for new groups of reed frogs and I hope you care about these reed frogs they're adorable and we want them to survive so let's imagine an intervention we use the data from the experiment and we train a model on it I show the model on the right and this is a model that considers both the variation across tanks those are the A parameters in the varying effect model on the right this is just like the previous lecture and then I have a B matrix a beta matrix there that considers all the interactions between the presence of predators P and the size of the tadpoles G smaller large so there's four parameters in the B matrix and then we estimate we train the model on the sample and now I'm going to compute an intervention for you imagining a target population that's in which 50% of the groups are exposed to predators and 25% of the groups have large tadpoles and we're going to imagine an intervention where on this we go to this target population and we say nutritionally we're going to augment some of the groups such that the result is that 75% of the groups are large now not just 25% what is the causal effect of this according to the experiment what's the distribution of expected causal effects across groups and we're going to do this through simulation here's the simulation code I'm going to step through this for you so you understand how to draw the owl remember the owl and we want to draw it the very top of the code is just extracting the samples from the chains putting in an object post then I define how many groups I'm going to simulate 1,000 and there are 2,000 samples total from all the chains and I'm going to use them all then I make an empty matrix S1 which is just going to hold all these groups and all the simulations for each sample then I loop over the groups S and the first thing I do inside the loop is sample from the population distribution of tanks and that's what I'm doing there so you see it's our norm and I'm using the posterior distribution of sigma in this line to simulate this and then next I sample a value of the predator presence 1 or 2 and the size of the individuals G and I'm using the probabilities 0.5, 0.5 for predators because that's what the example assumes 50% predation at the group level and then it's a 25% chance that the group is large and then finally we just use our GLM we do the inverse logic on these parameters and data that we've simulated and then we generate a binomial result random binomial result for each of the probabilities that we've simulated for a group of size 35 and they don't have to be large groups like this, they could be smaller but I want to show you that the density will affect how many survive of course even though I haven't modeled the effects of density in this it's going to affect the counts and that will affect the variation larger groups can have more variation to more total variation on the outcome scale so for this which I call the status quo simulation because we're simulating what the target population is supposed to look like now the model expects this distribution of survivorship on the right all the way from zero there'll be a very small number of groups where almost all the tadpoles die but in most case most of them survive but there's a very wide range of results we do a very similar simulation for the intervention and now what we change is the distribution of sizes it's just the G line that changes here and you can see that we made them bigger now we flipped the 75 and the 25 in that line and then we get a different frequency distribution of results you'll see where there's this second mode on the left there peaks around 15 and that's a result of the fact that large groups, large tadpoles suffer more from predation than small ones do according to the model and if we look at the difference between these things we want to take the difference because different groups experience different results of the intervention and so we want to do this comparison group by group and look at the resulting distribution of differences after matching the groups and that's what we've done here and you can see that on average there's the mean is a little bit below zero so the average causal effect here is quite small it's just slightly hurts the population to actually make them bigger but you'll see that it's not symmetric and so there's some risk involved here that we may also want to consider because the variance goes up from this okay using that kind of technique that kind of simulation technique you can simulate well anything you want from any model as you understand the generative model and the group variation in these things of course the varying effects that moderates the causal effects and so one of the things that's happening here is there's a lot of group variation according to the model and that pushes the law of God's survival up against the ceiling quite a lot in this it creates additional variation which makes the other causes like predation or tadpole size matter less and that's what I mean by moderation when you average over the group variation by simulating it then you can account for that effect in your predicted interventions so you have to know the generative model but that and that's one of the reasons from the beginning of this course I've emphasized that when we draw the owl we think about the generative approach first whether it's just a heuristic thing like a DAG or it's something more detailed like a full simulation like a genetic data simulation like we did a lot in the first half of the course we need those same techniques to simulate the causal effects that come from complicated models like this okay let's take a break you've earned it when you come back we're going to talk about some more technical issues with these varying effects models but before you do that I really encourage you to look over the preceding slides take some notes and make sure you're ready to continue and when you are I will be here welcome back let's pick up with some interesting technical issues with varying effects models what I'm showing you on the screen here is a simple joint distribution between two priors there's no data involved here one variable v which has a normal distribution with a mean of zero and a standard deviation of one half and another x which has a normal distribution of zero and its standard deviation is the exponentiated values of v sample from distribution and then the Hamiltonian Monte Carlo simulation on the left is running around the skate park as it was sampling from this distribution quite efficiently what I want you to see is that the shape of x creates this kind of funnel at the bottom that is for small values of v x is a very narrow normal distribution and that's why the distribution contracts in a kind of funnel towards the bottom but this doesn't cause a problem for Hamiltonian Monte Carlo it handles the curvature just fine but what happens when we make the standard deviation of v now something very unfortunate happens actually the funnel gets very very deep and very narrow so I'm showing you in this animation as we change the standard deviation of the variable v that is the vertical axis variable the high density region of this joint distribution gets sucked down into this funnel this is an example in the book that I call the devil's funnel that's an example I learned from Radford Neill and it illustrates it's a simple example perhaps the simplest that illustrates one of the challenges in sampling from posterior distributions of varying effects models why because varying effects models have joint distributions that look like the one on this slide that is they have distributions which are conditional on other distributions and in particular their scale variables their standard deviations are parameters which are dependent on other distributions and that creates this sucking effect that you see on the slide here if those scale variables get large enough so now let's run HMC Hamiltonian Monte Carlo on this version of the funnel and you'll see some weird paths arise each of these red dots is a divergent transition that's a rejected proposal from the Markov chain and to remind you rejected proposals don't mean that anything is necessarily broken but it's just that the whole reason we're using Hamiltonian Monte Carlo is we want all the proposals or as many as possible to be accepted that the problem with other versions of Markov chain Monte Carlo are that they make very inefficient proposals many or most of them are rejected and that wastes a lot of computing time Hamiltonian Monte Carlo thanks much harder about which proposals to offer it runs a physics simulation but that allows it to make good proposals but here's a very simple joint distribution where Hamiltonian Monte Carlo has a lot of trouble you'll see as it wanders down into the funnel and it will because that's where the mass is the probability mass it has a lot of trouble sampling and it ends up breaking what's going on here well the basic problem is that the distribution is very curved in that area very narrow and curved and that's what generates these divergent transitions which are the result of broken simulations where the approximation we're using for the curvature is not accurate enough the algorithm luckily detects that but this makes proposals inefficient so Hamiltonian Monte Carlo you'll recall uses a bunch of piecewise linear approximations of the actual curved posterior distribution and it does so in the length of those linear segments is called the step size and the step size is constant and if the curvature varies a lot in different regions of the posterior distribution as it does in the example here the curvature it's very narrow in that funnel at the bottom and it's very flat and low curvature at the upper part of the plot that means that the same step size is not going to be optimal everywhere or even good for get optimal and therefore get into a region of very high curvature like the funnel the simulation simply cannot follow the surface it can't turn fast enough so there are a couple things we can do the first is we can use a smaller step size that'll create a more accurate simulation and the second is to reparameterize I want to talk about each of these so the setting a smaller step size is something that an engine like Stan is going to do for you automatically it's going to try to figure out the best step size for the posterior distribution during the warm up phase and so if it ends up selecting a small step size it'll be able to explore these narrow funnels like the one here you'll see that we're not getting nearly as many divergent transitions the side effect though that's unfortunate is that it's going to explore the posterior distribution much more slowly and so there's no free lunch there's always a trade off here smaller step sizes are inefficient in their own way even though they avoid the divergent transitions so is there something else we can do or in addition I should say because you can do both of these things together and the answer is luckily yes in many cases there's a clever trick which is much much better than simply setting a very small step size and that is to reparameterize so let me explain what that means so here's the joint distribution that I've used in the preceding slides for the Hamiltonian animations just two variables the first with a normal distribution with a standard deviation of three and the second with a normal distribution where it's standard deviation is the exponentiated values of the first this is called a centered distribution because one or more the distributional statements has a parameter inside of it so the distribution for X is a normal distribution and it's a function of V in many cases take the parameters out of those distributions and specify an equivalent joint distribution but it's going to look different and this is called the non-centered version and that's what I'm showing you on the right you see what I've done is I've created a new variable Z this is a new random variable and it has a normal distribution with mean zero and standard deviation one it's just a Z score distribution standardized normal and now I've defined X not as a random variable but just as a deterministic function of the values of Z and V that is it Z times the exponentiated value of V and this is equivalent the distribution of X as defined on the left and the distribution of X as defined on the right are identical they look quite different we can do this because we can always divide a normal distribution by its standard deviation making it standardized and then we can put it back on the non-standardized scale by multiplying it by the original standard deviation and that's all we've done here Z scores of X and then I put X back on its normal scale by multiplying those Z scores by the standard deviation which is the exponentiated values of V so these are equivalent but it turns out in this case that the distribution on the right even though it's the same distribution is going to sample much more efficiently because we are going to draw values from the joint distribution of V and Z rather than the joint distribution it's going to be much nicer because it's going to be a Gaussian bowl and the Gaussian bowl is the best place to skate so here I'm showing you for two chains simultaneously just to show off a bit sampling I'm Tony Monte Carlo sampling from a two-dimensional Gaussian bowl this is what it's born to do this is its vibe it's very very good at this very very low number almost every proposal is accepted high efficiency, good exploration of the space we get this through read this re-parameterization trick from moving from a centered version where there are some variables which are their distributions are functions of other variables to factoring all of the parameters out of the distributions and making some formulas to re-establish the original distribution and that's called the non-centering trick let me show you this trick in the context of running the actual stand chains that might help you understand it a little better so I'm just taking this funnel example remember I called this the devil's funnel in the textbook and I can express it as a stand model so this is in chapter 13 we have model 13.7 and 13.7 in C for non-centered and these are the simplest stand models you could write they're just prior distributions there's no data here but that's fine we can use stand to sample from priors and you can run these two models they're not going to take long to run what I want you to see about the second one though is that little gq line the third line in the formula list of the non-centered version where I'm reconstructing x through the deterministic function of z times the standard deviation that we want to differentiate values of v if you run these models the first one gives you a number of divergent transitions I got 112 out of 2,000 on this particular run you should try it for yourself that number will vary depending upon where it explores and you can see from the diagnostics in the pricey output that these chains are not in good shape they're not returning reliable estimates they're not exploring the posterior distribution efficiently and if we ran them a very long time we would probably get a good picture of it but there's another trick and that is simply to use the other model the non-centered version which you can see from the pricey output on the bottom half of the slide is very efficient the diagnostics are good and we're getting the right answer if you look at the trace plots you can see the problem this is the trace plot for the centered version of the devil's funnel these are not healthy trace plots I'm sure you can see it at glance they're not exploring the same space and then here's the trace plot for the centered version these are the hairy caterpillars you want and I'm also showing you the trace for the reconstructed X and you'll see those spikes that's fine that's what the distribution of X looks like it's got those spikes because of the existence of the funnel to explore it and so that's fine you'll see that the effective sample sizes the NF values are good okay that's a made up example though that's not a real data analysis example the devil's funnel example is there as this little memory nugget if I can invent a term to help you see the quintessential form of the reparameterization of a varying effect to take that quintessential form and stick it in a real data analysis example so I'm going to do that now for the reed frogs that we met in a previous lecture to remind you we had fit varying effects across the tanks of tadpoles I repeat the model on the slide here these alpha parameters, alpha sub tank or alpha sub j as it is expressed in the prior are the log odds rate of survival in each tank and this is a centered version of the model I didn't call it that at the time but it is and you can see that because there are parameters alpha bar and sigma in the prior for alpha j and that's what makes it centered it's got parameters to determine where it is to determine its location and its shape here's the equivalent non-centered version of this model these models are completely equivalent from a mathematical perspective but once you run them in Hamiltonian Monte Carlo they are not because they imply different posterior geometries that you need to skate around different skate parks with different shapes and that's the non-centering trick that I showed previously so what I've done, I'll walk you through it piece by piece first we replace the alpha sub T's in the loget line with this weird term which is the mean alpha bar plus the z-score z sub tank there's now a z-score that's specific to each tank times the standard deviation we do this to get back to the original alpha j's that's just the definition of alpha j it's the mean plus the z-score times the standard deviation and then notice I've added z sub j is a normal zero one to the model on the right so that's the distribution we need to sample from the z-scores and it has nice curvature because it doesn't have sigma inside of it okay, that's not the example we want to work with though let's revisit the chimpanzees from the first half of this lecture and think about that model again and that model was centered just to repeat it here the non-centered version analogously now has two effects two distributions of z-scores and we're going to use them both this is a complicated model but it's the same as the first let me just show you the corresponding pieces of the two models for the handedness parameters for each chimpanzee for the alphas on the right the non-centered model they appear again as this sum of a mean alpha bar plus the z-scores times the standard deviation that corresponds to that cluster type that is the individual chimpanzees and I've also added of course the standard normal zero one prior for the z sub alphas comma j and then the treatment block effects remember I had done an interaction of treatment with block to look for block effects and I've replaced this in the non-centered version again with this product of the z-scores from the matrix now we have a matrix of z-scores for treatment and block times the standard deviation of those values which we're going to learn from the data we had a fixed prior we could have had a fixed prior here but we do partial pooling on the treatments because we assigned the same prior we'd rather just learn the prior from the data rather than choose it arbitrarily these models are equivalent but they run quite differently in the machine here's the Oolong code for both of these models the model at the top on the left is the code you saw in the first half of the lecture the model on the bottom is the non-centered version and it's messier, yes it is you'll see the z-scores in the loget line and you'll see that the instead of A and B being defined in the priors we have the z-scores for them defined in the priors and then at the very bottom of the model I've used this gq trick again, gq stands for generated quantities to simply compute the alpha and beta values so that they're saved in the posterior and we can interpret them much more easily later who wants to interpret z-scores you'd have to do all this multiplication after the fact otherwise which you could always do you could ignore the gq lines run this model and then post-process the posterior distribution and compute the alphas and betas using these definitions but it's more convenient just to add lines like these little gq lines at the bottom let me walk you through this say it all again and highlight the bits in the math model on the right that correspond to the code we've added A bar now to the generalized linear model to the linear model component it's not in the prior anymore and it's added to the z-scores for the actors times their standard deviation and the same for the treatment block effects we've got z-scores times their standard deviation there's no mean recall for the treatment and block effects because we only need one mean we only can have one mean for a generalized linear model and alpha bar takes care of that the gq lines that I mentioned before that just simply this is a convenience, it's not required okay let's look at those effects so I'm showing you here the tranc plots as the trace rank plots for the new model and these are much better much more even along and much more switching and better effective sample sizes for the parameters and on the right I'm showing you the direct comparison of the diagonals the horizontal axis are the effective sample sizes of parameters for the centered version and the vertical is the effective sample sizes of the same parameters for the non-centered version each point is a parameter and where they're colored blue they're above the diagonal that means the non-centered version has a larger number a larger effective sample size and the metric always helps there are some combinations of models and parameters and data for which centering is better and there are other combinations for which non-centering is better and it depends upon whether the prior or the data are more important in determining the shape of the posterior well that's one of the things that matters and that affects whether you get these narrow regions that are hard to sample from sometimes you need to switch from a non-centered to a centered for some cluster types and then switch from centered to non-centered for others so you need to be flexible in this the lesson here is not that non-centered distributions are always better it's that reparameterizing between the options is a technique that you want it's a very good glass blowing technique to use the metaphor from the start of the lecture with some practice you can do it very quickly because it looks complicated at first all that z times sigma stuff but eventually it becomes second nature and you'll be doing it all the time with your varying effects models and they'll run much faster you won't have to wait as long and the results will be more reliable and that'll make you more comfortable and it'll make your colleagues more comfortable as well so to wrap up here technologies like varying effects models are really indispensable now they've become a standard part of the scientific tool kit but they take some time to master you have to be patient you need instruction and you need most of all practice and as you practice you want to practice safely and you want to practice working safe and you want to practice working responsibly and so I've tried to teach you these little techniques as part of that mission so what we covered today is more than one cluster type and how to specify that how to calculate predictions from varying effects models because there are more choices to make now due to the fact that there are multiple models inside the model and third how to sample chains efficiently for varying effect models we get much more complicated shapes of the posterior distribution for these models as a direct consequence of the fact that prior distributions other distributions embedded in them there are functions of distributions now and that can result in things like the devil's funnel but we have we have techniques for solving this problem so we can move forward and work on the scientific models we want to work on rather than let purely statistical considerations decide which models we use okay that was lecture 13 which covers mostly material in chapter 13 of the book in the next lecture this week we're going to look at varying slopes which is the other aspect of expanding multi-level models where we consider more than one feature of each cluster type in the model I'll see you there