 Welcome back. This is the second of the lectures that introduced multi-level modeling. Creatively, I've named it multi-level models, too. You can imagine in your head that it also says Electric Boogaloo. Anybody remembers that movie? No? Okay. Shame on you. I think I should give you an update on where we are, and partly because I had to do this this morning to figure out what we're going to get done with. So today is... Let me frame this a bit by saying you've learned a tremendous amount. You have. It probably burns the amount of learning that you have accomplished. That burning sensation means the medicine is working. And you have a tremendous amount of statistical power at your fingertips now. You can fit quite complicated models. And with the introduction of multi-level models, the possibilities are going to open up really rapidly. And what I want to do today in finishing the introduction to multi-level modeling is teach you some of the more technical things about them so you can use them more responsibly and make very good use of it. And we're going to need these tools next week when I crack it open even more and we'll have some more elaborate but routinely useful kinds of models like varying slopes where we extend the adaptive prior approach of varying intercepts to many clusters of parameters at the same time, all co-varying with one another at some hyper level in the model. And this will also let us do things like multi-variate outcome models which we'll use to fit instrumental variables. This is a return to causal inference, something more than the backdoor criterion that lets you get causal inferences. Even in cases where you can't close all the backdoors, if everything else is just right and you believe your dad, you can still get a causal inference using something called an instrumental variable. There's also a frontdoor criterion which I think I'll mention but I won't have an example of just for a second time. And I want to introduce a basic kind of network model called the social relations model which is also, apparently it's a custom varying effect sort of model, but it's custom, the covariance matrix is scientifically informed. We'll see what that means when we get there. But this is all about the responsible conduct and the possibilities that open up from understanding varying effects. And then week 10 is the end and I'll spend the first lecture that we're telling you about Gaussian processes which are, to my mind, an extension of the varying effects approach to an infinite number of categories, which I know sounds scary, but actually some of you know adding infinities makes things easier in math, not harder. It's the opposite of the real world. Infinity solves problems in mathematics. This will be very, very useful for routine kinds of variables like distances, phylogenetic distances, spatial distances, where there's covariance amongst units, but they're not categorically discrete. There's some continuous distance that we need to use to categorize them. Similarities, age is another kind of thing like this, right? And then on Friday, measurement error and missing data, topics that are common to everything that we do in the sciences, whether you do experiments or observational studies, these things happen. So into the material for today. This is the slide we ended on on Monday. We're going to pick up again with the pro-social chimpanzee data set. Very quick summary just to remind you, this is an experiment with a small number of individual chimpanzees, many repeat observations on each, in four different experimental treatments which combine which side of a table, left or right, a pro-social food option is on, pro-social is our interpretation. The question is whether they interpret it that way. So the pro-social side means both individuals get food, the other option, the non-pro-social option, is only the experimental individual gets food. The other kind of treatment is whether there's a partner present. And so in this fantastic piece of art on the slide, there is a partner present, but if the other end of the table were empty, then it isn't. And so remember we're interested in whether chimpanzees pull the pro-social option more often when there's a partner present. And the answer was no, right, at least in this experiment. I want to use this data set again, since you're already familiar with it, there'll be less friction. I want to use it to show you how to build a more complicated types of varying effects model where we've got more than one kind of cluster at the same time. This is a very common type of data structure called a cross-classification model. What's cross-classified here are the individual chimpanzees, which are called actors in this data set, and the experimental block. Experimental block is basically what you get, right? So there are block effects because maybe they were all grumpy, the weather was bad, then they were extra hungry, so they all pulled the pro-social option more in some particular block. It happened, right? Block effects happen. The terror of science is block effects. It's one of the reasons you want to randomize the order you feed your DNA samples through the machine, right? So there aren't block effects, time effects correlated with your treatments, right? You've got to de-correlate the blocks in the treatments. A very common thing. So in this data set, it's everything's balance. You've got all the chimpanzees and all the blocks. It's amazing they pulled this off. Chimpanzees will do anything for a grape, right? So this is something you know. And so we're going to be interested in using both of these types of clusters. That is, actors are a type of cluster because you've got multiple observations in each individual chimpanzee. So you can estimate parameters that are specific to that chimpanzee, like handedness, right? You've also got repeat observations inside blocks. There are blocks, and that's a day, and there are a bunch of observations within each block, and so the blocks can differ on average. And that's how we measure the block effects, which are noise that is interfering with inference, right? These kinds of models are called cross-classified, but as you'll see, you can just design them as a very intercept model. There's nothing about the data structure that really distinguishes cross-classified from, say, something that's nested except your interpretation of it. The model looks the same. What does it actually look like? Okay, so here's the multi-level chimpanzee model with both actor and block intercepts. Draw your attention to the linear model part of this slide. You'll see that there's an alpha for each actor. There's a gamma for each block, and then there's a beta for each treatment. So each particular observation is getting an offset from each of these effects. The betas for each treatment are ordinary fixed effects. It's not an adaptive prior. We're regularizing, but it's not adaptive. There are four treatments. The alphas are the bearing intercepts on actor. These are just like the tank effects in the tadpole example for Monday. You remember that. Yeah? It's the same sort of thing. This is the handedness, right? This is the handedness of the individual because the outcome is pulling the left lever. There are seven actors. This is an adaptive prior. As you see, there's an alpha bar, and there's a sigma alpha, which is the variation. So the model is going to learn this prior from the data just like with tanks. And then we've got another adaptive prior in here. This one is the gammas. There's a gamma for each block. There are six blocks in the dataset. This is an adaptive prior because there's a sigma in it. This prior is conditional on a parameter. So we recognize that it's adaptive. There's a parameter inside of it. There is no gamma bar. Why? You could put it there, but it would be redundant because then you'd be adding alpha bar and gamma bar inside the linear model, essentially. So you just have some two things, like the left leg and the right leg, that you couldn't distinguish. Remember that from some chapter way in the past. If you don't destroy the model, it's just a redundancy that's unnecessary. It'll still run, but it'll run less efficiently and it won't help you at all. So we can just put a zero there. It's still an adaptive prior because the width of this prior is being learned from the data through this sigma sub gamma, which is a different standard deviation than the others. And then we've got hyper priors at the bottom for each of these. The bits that apply to the block variant effects. Does this make sense? How the definition works? You can extend the strategy to have as many different cluster types as you like in principle. Since you're using stand as your market chain engine, in principle extends a very long way. You can have tens of thousands of parameters if you like in these things and you can often sample them very efficiently. What does the code look like? I'm going to walk through this slowly. So, hang on and I'll highlight bits of it. This is the Ulan code version of the model on the previous slide. Just a logistic regression with a bunch of stuff. Right? That's all you think about. We've got a logistic regression up top. We've got a linear model with a vector of actor parameters, a vector of block parameters and treatment parameters. And then the section in the middle with adaptive priors. The vector affects. So, in the linear model, there's this A for each actor. And then we've got a vector of those. They get each actor parameter has this common prior which itself has two parameters inside of it. It's conditional. That's what makes it adaptive. It's conditional on other parameters. And then down at the bottom in the hyper priors, you see those two parameters that are learned from the data that give the shape to the priors for the actors. This is what creates shrinkage. You're going to see you're going to get shrinkage here just like with the tanks. Yeah? And then the same thing for blocks now shown in green. We've got block up in the linear model. We've got the block adaptive prior. And then there's the sigma down at the bottom for blocks as well. Make sense? Easy, right? You just keep on. You have to manage it. But the strategy grows naturally from there. Okay. You should run this model at home. Remember, you always go home and draw the owl. Finish drawing the owl. You won't encounter any major problems with this. And you summarize the plot the precy output on the left. I'm showing you the B parameters are the treatment parameters. There's no new story here. Remember chimpanzees on average are attracted to the pro-social option, but not more when the partner is present. So there's more food on the side of the table. They tend to pull that lever more, but they seem to be almost entirely insensitive, if at all, sensitive to the presence of a conspecific. The alpha parameters that come just next, these are the handedness ones. And you recognize sweet, sweet individual number two there, right? Lefty. Lefty has got a really big bearing intercept. Yeah. Remember, pulled out. Number four on the horizontal axis means basically always. Why is there such a wide posterior, marginal posterior distribution for Lefty? Someone have an idea? What's that? Yeah, because all of them mean always. Yes. No, it's the most succinct version of it. It's a lever, right? And so the data can't tell you the value of that parameter, except that it's really big. And that's what happens. If you didn't have priors, infinity would be compatible with the data, right? But we know infinity is not a value we wish to consider, and that's what the prior says here. Yeah, does that make sense? You can't, four and five and six all make the same predictions. Because the data just, the likelihood is insufficient to identify the parameters. It's just how reality is. It's hostile to human life. So then the G parameters that come below are the block effects, six block effects. You'll notice these are very small. They huddle around zero. This means there's not much variation among blocks compared to the variation among actors. You can see this. Yeah, and what's going on. Down to the very bottom, is our alpha bar. Just slightly left-handed in this sample. Yeah. And then the two sigmas, and you'll see the two sigmas tell the same story that you can see with your eyes. Right, your ocular scientific instruments. You assess the variation among the intercepts for actor and the intercepts for block. You can see that the actors vary more than the blocks do, and the sigmas are picking that up, right? So the posterior for sigma among actors is around two. That's a lot of variation, right? This is not describing exactly your sample. It's trying to learn about a population that these chimpanzees came from. That's one way to think about it. So if you were going to take this experiment to a new colony, and you wanted to have a prediction of the variation among the individual actors. So you'd use that, for example, to know how many repeat measures are individual, then this sigma would be useful for designing the experiment. We'll do some more with that at the end of the lecture today. And then sigma g is practically zero. It's not exactly zero, but it's very, very close to zero because there's not much variation among blocks. And then I plotted those two densities for the two sigma variables in the right-hand part of this slide. A block, a sigma sub g for the blocks is shown in blue. The sigma sub a for the actors is shown in black. And you can see how different these are. Does this make sense? So the consequence of this, what's the consequence? There's a lot more shrinkage among blocks. If one of the blocks were extreme, it got shrunk towards zero really aggressively. The actors will be very little shrinkage because lefty, bless lefty, lefty proves that individuals, chimpanzees are individuals, right, they have personality. And, uh, okay. So, um, it's natural to ask then should we even have varying intercepts on block? And the answer is it doesn't matter. It really doesn't matter. You can put them in the model, it won't hurt anything. You can leave them out, you'll get the same inference. And the reason is because there's basically no variation among blocks. But it's a nice feature of varying intercepts and varying effects in general is that if there's not much variation among the members of that cluster type, it isn't harmful to add the varying effects because they get shrunk very aggressively in that case. So what I'm trying to show you here is, here's the model, same model, but we've taken out the block effects entirely in this model. You'll see that they're missing. There's nothing about block in this model. But we've got varying intercepts on the individuals still. And we can compare these two models using WAIC or, uh, or LU, you'll get the same numbers. And, uh, you'll see that they're very, very similar models. Their difference in WAIC is only two. And the standard error is very close to that as well. These are effectively the same models in terms of out of sample predictions. There's really nothing to get excited about the difference. And, uh, you'll notice that the parameter counts difference is quite small. Model 13.4 had, uh, seven additional parameters because there were six blocks and one sigma but you'll notice that it's not seven more effective. PWAIC is only about 11 whereas in the model that emits all those blocks it's about nine. So it's like a two effective parameter difference, uh, you know, rounding. I like to aggressively round, rounding even though it has seven more parameters. But those parameters were aggressively regularized because the model didn't discover much variation among those units in the data. This is like the fail-safe device with varying effects. Yeah? And lots of machine learning works the same way. It doesn't use exactly this structure to do the adaptive regularization, but deep learning systems also have this way to consider many, many features and aggressively regularize the ones that don't matter. That's a nice thing. Regression trees also another non-basin way. Well, there are basing regression trees, I shouldn't say that. Regression trees are another strategy. Lots of regularization strategies that will do this for features just make some sense? Yeah? Does it make enough sense that you can go home and draw the owl? That's really all I can hope at this point. Yeah, you got to go home and run the chain and draw the owl, right, and play around with it and look at the predictions. Okay. Let's add some more random effects. Why not? So this is also a chance for me to try and make that happen again, right? Lots of this. This term effects, which is a synonym for varying effects. The word random has this tendency to lead people to interpret the effects in a stronger way than I think is warranted. These are just statistical things that are used to regularize inference. The source of the clusters is irrelevant to whether we add varying effects or random effects or not. But you will read in textbooks this advice that, you know, so we're considering the issue here where we could add varying intercepts on treatment. There are four treatments. We have repeat measures of each, right? Every chimpanzee participated in every treatment. We've got tons of data per treatment. We could, we want to regularize those and we have in this model, but we have not adaptively regularize them. Why not? And some people will say, well, you can only put random effects or varying effects on things which weren't fixed by the experimenter. They're fixed by the experimenter. We call them fixed effects and then you don't use adaptive priors. That's nonsense. It's just nonsense. If you care about more accurate inferences, you use adaptive priors. That's all there is to it. So we can do this. It's fine. It doesn't matter. Don't succumb to the mind projection fallacy, that how you set up the data determines how you should learn from it. Right. And so to do that, all we have to do is add a sigma beta to the prior for the B treatments. Instead of, I thought there was like .5 before, I think. I won't go back to the previous slide, but I think I made it .5. And if we just make it give it a parameter, we can learn it. This is what the model looks like. Then I'm circling there the new adaptive priors for the treatments. There are four treatments. Four B parameters. We get one more sigma down at the bottom. We got lots of sigma's now, right? Stack them all up. And we run this model. It's no problem. And I'm just comparing this to the previous model. Model 13.6 is the one shown in the slide. 13.4 was the first one we did today where you've got varying intercepts for block and actor, but fixed regularized treatment estimates. And they're the same, basically. They're slightly different. The new model does regularize the more extreme one, beta 2 a little bit, but they're basically the same estimates. Why? Because there's tons of data per treatment. Right. So the varying but you're learning the sigma, it doesn't change the estimates very much. It trims the posterior uncertainty a little bit because there's not much effect. But it doesn't change your effective inference about the treatments at all. There's no harm done here. Okay. One of the things that will happen when you go home and draw the owl is you will get these warnings that you have probably already seen occasionally. Yeah? Have you seen these? There are these mysterious things called divergent transitions. Have you seen them? Yeah? They're great. They're kind of spooky. You get them and you have no idea what that means. Divergent transitions are your friend in the sense that they're telling you about something that is numerically inefficient about the Markov chain. If you have a very small number of them, it's hardly ever a serious issue. But it's easy to get rid of them. And there's a whole section in chapter 13 where I talk you through how to do this. And I want to spend some time today, probably next 15 minutes talking about these and how we fix them because it teaches you something really important about multi-level models. And that is that we can write the same model different ways in the code. And even though they're mathematically equivalent, the Markov chain will see them as quite different. And often this is just a required thing. If you want to run these models and sample from them effectively, you need to be able to switch between the different ways of writing the same model. I know this sounds like madness. But even though the algebra will be equivalent, the geometry is going to look really different to the Markov chain. And that's the intuition of that. So, here's what you'll see. This is my R environment, by the way. It's glamorous. I know all the cool people use our studio. I have nothing against our studio. But there are some, there is some value to using the terminal like it doesn't crash as much as our studio. Jeff's not here today. But if Jeff were here, he would not have had vigorously about this. So, you'll get these warnings, three divergent transitions after warm up and it gives you advice to increase Adapt Delta, or this case above 0.95, which is the default for Oolong. Sometimes that will work and Adapt Delta is this parameter I'll tell you about in a moment what it does. But sometimes that won't save you and there's just nothing you can do except to reparameterize your model. So, let's start before we get to that what that means to reparameterize your model to saying what a divergent transition is. So, forget about statistics for a moment and imagine you're on a frictionless roller coaster. Right, terrifying thought. But a frictionless, real roller coasters have friction and where the wheels touch the track there's a lot of friction actually. Right, because sausage on a roller coaster track. Right, but imagine it's frictionless. So, as this roller coaster moves from point A to B to C to D, there are two forms of energy which are being adjusted, shifting, there's a total amount of energy comes from the momentum of this roller coaster in this system and it's frictionless so it'll never stop moving. Right, but the energy is in two buckets and even though the total amount of energy is constant in the system it has to be that's all energy is in physics, right it's this thing that balances the ability to do work and it's got a bunch of like pockets you can put it in but it has to be conserved it's just the thing that balances equations. It sounds mysterious, energy. It's just equation balancing. From point A and the roller coaster went down to point B it lost potential energy potential gravitational energy because it was high and that it's converted to kinetic energy to motion as it goes towards B and then as it moves up from B to C as it is pictured right now the conversion goes the other way that kinetic energy is converted back into potential energy because you get to the top of the hill and then the conversion will do it again and this is what roller coasters do, right they just keep converting kinetic energy into potential energy and back and forth but the sum of those two things is constant in a frictionless system and a system with friction energy is still constant but you're deep heat you get heat energy our Markov chains don't have heat energy which is why I want you to imagine this is frictionless it's like your frictionless roller coaster so what is a divergent transition then a divergent transition is when your roller coaster is on track right so and we can detect this inside the Markov chain by monitoring the energy of the system it's a physics simulation and it has energy and in fact in Hamiltonian dynamics the Hamiltonian is the total energy of the system and so it's actually conveniently calculated right inside the Markov chain if you look back at the chapter where I introduced this I think it's Hamiltonian Monte Carlo and you'll see it there I calculate the total energy inside that loop and we use that to monitor the chain if the energy is not conserved in your chain something's wrong and it means your numerical approximation is bad and you popped off the track why would that happen because we don't actually have a smooth continuous simulation we have those little steps these little linear jumps we get a gradient then we take a little finite step and then we do it again so we're getting a piecewise linear approximation of a smooth surface and so if the pieces are small enough and the roller coaster track doesn't bend too much that's fine but when the roller coaster track bends really violently or your step sizes are big then you can pop right off the track and you want to monitor that so it works and so these divergent transitions are when that happens from one position in the Markov chain to another that's a transition those smooth things that were drawn in the movies and divergent means it pops off the track pops off the true surface you don't actually pop off of anything inside the Hamiltonian simulation because it's a big manifold but you know you go off where you were supposed to be and you recognize this because you can monitor the energy so this is the summary slide to tell you all this and there's a whole section in chapter 13 because I want to show you some pictures to give you a little bit of intuition about it and so on the right here I'm showing you a posterior distribution which tends towards divergent transitions this happens a lot almost any time you have a parameter that's conditional on other parameters so let me, oh before I move to the next slide and explain this thing I want to say the bold statement at the bottom of this slide other sampling strategies other than Hamiltonian Monte Carlo have problems with exactly the same kinds of services but they have no way to monitor it because they've got no energy so gives sampling and all of those metropolis hastings based strategies they won't tell you when this is going on but they definitely experience the same problems but they don't have all these extra parameters like momentum that Hamiltonian Monte Carlo has which seem like a bother at first but turn out to provide diagnostics one of the reasons to use Hamiltonian Monte Carlo is because well it just works better than the other things especially in high dimension but another reason even if you're in low dimension to use it is because it freaks out when things are wrong and the other strategies don't you have to work much harder to get diagnostic information from them so if you're paranoid like me you like things that throw up warnings all the time yeah, you love warnings don't turn off warnings that's a bad idea very very bad idea if your smoke detector goes off when you cook in the kitchen well then open a window don't turn off the smoke detector same thing with Markov chains open a window, don't turn off the warnings so a little bit about divergent transitions let me show you a routine kind of situation and this arises all the time in varying effects models where you get divergent transitions it arises through this effect that's often called the literature the funnel so this funnel is turned on its side left hand parts at the bottom would look more like a funnel what is going on here I'm showing you a very simple posterior distribution two parameters, v and x v has a normal 0, 3 prior and x is conditional on v where the e to the v is a standard deviation this is like x is conditional on v when this is true you can get very interesting shapes with very very strong curvature so what you're seeing is v on the horizontal here that's that one and now x since the width of x depends upon v as v gets small x contracts and you get this very narrow valley so do you think that part there near where the 0 point on x that's this very narrow valley very high ridges the funnel and so your Markov chain has to get down in there and take some samples because it's part of the posterior distribution but the curvature is really tight there so your roller coaster has tendencies to pop off out in the big smooth plane where v is large you can take big steps and move around on the big expansive plane out there and everything's good and it's completely different and a big problem here is there's no single step size which can efficiently explore both of those regions of this posterior distribution the valley you want a really small step size in the valley otherwise you pop off the track and you want a bigger one out in the open area so you effectively explore it yeah so these kinds of posterior distributions are much more challenging for all kinds of samplers and Hamiltonian Monte Carlo is just nice as it throws warnings about these things so think about two kinds of behavior on the left I've set the step size large and I've colored in red divergent transitions I'm just monitoring the energy and when the energy at the start of the transition is different from the energy at the end that's a divergence now and the Markov chain will reject these samples but it can reject samples of Hamiltonian Monte Carlo and the rejection criterion is the difference between the starting energy and the ending energy because that means you've got divergences so this doesn't necessarily corrupt your chain but it does mean that it's way way less efficient than it could be and it could induce subtle biases in the posterior distribution because the places where you do the rejections are particular regions of the posterior and you won't be sampling there in the posterior distribution now in infinite time you'll be fine but you don't have infinite time again you've got to finish your PhD so I'm here for you to help you figure that out the other criterion is we can make the step size small but then you could just spend all your time missing the funnel completely as in this one I think I took 500 samples in each of these so now Stan will do better than this he's got no adaptivity thing going on and I've just made some examples for you Stan will sample from this quite well this particular example but as you increase the dimensionality like stack up the X's like in a varying effect model in a varying effect model you're not going to have just one X that's conditional on V you could have hundreds of X's that are conditional on V this gets pathological real fast in that circumstance so we want to do something about it what do we do? you've got two strategies you can do in this scenario on the right the emergencies go away but you could spend a long time before it finds the funnel and take samples from it yeah also makes the chain run slower however that's a fine trade off I think valid inference is worth waiting for the second option which is often way more effective is to reparameterize and I should say you only undertake both of these steps either of these steps after you've looked hard at your model it's correct because in my experience when the chain isn't efficient on the first go it's because I did something stupid like left out of prior completely like I haven't parametered has no prior at all that's like the most common my own personal most common mistake there's this folk theorem I think I showed you earlier in the class the folk theorem of applied statistics is that when you're having trouble estimating the model Tuesday because the model is wrong so the folk theorem holds and the first thing you should do is look at the model and make sure it's all there you didn't do something ridiculous after that you can attempt both of these steps so what's the reparameterized thing is weird when you first see it so I want to spend some time explaining it okay so it's a fact a wonderful fact that most any statistical model can be expressed several different ways in the formal mathematics if you'd write them on the page they're all perfectly transformable they're the same algebraically identical models let me show you some examples here's a really simple model alpha is some parameter that's conditional on mu and sigma you see these things in your dreams all the time alpha is conditional on mu and sigma we can re-express this little model multiple different ways we'll be focusing on the distribution of alpha alpha has some probability distribution here's another way to write it we can say that alpha we can say that alpha is deterministically the sum of mu same mu is in the first line plus some new parameter beta and we assign beta distribution normal zero sigma this now alpha in the middle of this slide bottom of this slide has the same distribution as alpha at the top you see this we've just re-centered it we've taken mu out of the prior for alpha and we added it back on later but you can always do this with a normal variable you just translate it wherever you want it to be make sense? now you're thinking you could do that why would you do that I'll show you why but right now this is just about what you could do we're just in modal verb land we're doing could and we'll do should later so we can go one step further we can get sigma out of the distribution as well now we say alpha is equal mu plus z times sigma and we assign z a normal zero one parameter z is a z score or a z score if you come from a barbaric place no sorry this is a fight between Canadians and Americans whether you say z or z some people have opinions about this I don't alpha is mu plus z sigma and z is normal zero one but when we multiply those individual z's sigma they're back on the scale that we want because sigma gives their width gives the width of the target distribution and then we add mu back in and it's translated to the position we want and alpha and all three of these expressions is the same distribution does that make sense yeah good times and I know why would you do this you could do this yes but it's like that Jeff Goldblum thing the scientists were so busy asking what they could do you remember that we should do this sometimes absolutely and the reason is because even though it's mathematically the same your roller coaster sees these distributions differently because the shape is different the geometry that it has to cruise around on is different this is a cool effect in a very handy way to make your Markov chains more efficient so on the left we have the form the kind of default form that I showed you often called the centered form this is not great terminology but as always I got to teach you the terminology that's in the papers so how do you understand that it's centered there are parameters inside the distribution so there are parameters that are conditional on other parameters since they've been located by sticking parameters inside them someone invented this term it didn't make sense then you just have to learn it that way centered means that there are parameters inside the distribution of the parameters non centered is the radical version where we try to take all of the conditioning out of the distributions and stick it inside some linear model and so the right hand version here is the non centered version where we have the z score normal zero one prior make sense? Just as a definitional issue right now let's look at the geometry of these two equivalent distributions we get the same information from each on the left we've got the one I showed you before the funnel, I'm showing you some good Hamiltonian trajectories down to the funnel where we get divergent transitions the roller coaster goes down in there and flips over and the same thing but in the non centered part now we've got a Gaussian bucket again and we can just cruise around this Gaussian bucket forever doing nice oscillating paths it's beautiful and that's what you get, they're the same distribution actually you can convert between them but the one on the right is way easier to cruise around in under roller coaster one of these is much more violent right than the other and so when we re-parameterize the version on the right do our varying effects that way you typically get many more effective samples from your Markov chain than you do on the left so again with my hand written Hamiltonian Monte Carlo code not effectively tuned or not I'm showing you the difference in the funnel those red points are divergent transitions that are rejected they're left out of the Markov chain and you start back over where you started again whenever that happens and they're scattered around but a lot of them are down in the funnel and the ones that are outside the funnel by the way are ones that start in the funnel you start in the area of high curvature and you get out of it and you still have a divergence so and then on the right same settings no divergences at all 100% acceptance rate nice Hamiltonian Monte Carlo so this is a big difference and it's really nice let me show you what this looks like in a real model how to do it so this is the chimpanzee model again with actor and block just as before this is a centered model because it has parameters inside the adaptive priors and it's sample 5 divergent iterations what I'm going to show you is if we non-center it make it into a non-centered parameterization it'll be more efficient so let's focus just on the linear model first this is the first thing to wrap your mind around is that linear model up top is not going to be written with these z-scores and sigmas inside of it because each centered version of a parameter like an actor z sub actor times the sigma among actors and that rescales it and then alpha bar is just out in front we've taken that out of the prior two we just put it inside the linear model so it's the same but it looks way worse doesn't it it looks like a much harder model now this is the trade off you can't have it all and then the same for block I made these things called x for block because I didn't want to have z twice so now it's x-scores for block times the sigma sigma gamma for the block standard deviation and the treatments are still fixed effects so we leave them as they were yeah, makes sense and then the rest of the junk so the new bit down here is where we used to have the alpha and gamma priors that were adaptive we now have z normal zero one and x normal zero one so there's still vectors of parameters these are varying effects but they're standardized varying effects and we have to de-standardize them to make the correct predictions so that de-standardization happens inside the linear model fun? yeah, I teach you this because there's really no avoiding it this is the most important thing to understand about making varying effects models run right and lots of things are basically varying effects models because in complicated models you often have distributions with conditional and other parameters that happens all the time factor models, all kinds of things have this phenomenon going on inside them good, can I move on to the next slide? I have no reason to rush here we can stay on this slide the rest of the time it's necessary okay, I know that you're like no please move on so in code here's what it looks like there's this really long linear model now yeah, that's okay it's all the same and now we've got z here, d norm zero one x block d norm zero one the rest is the same, it's that easy now what is this, get us you run this thing and again you should really go home and run this and see for yourself it'll also sample faster you can compare the timings this chain will run faster as well and the physics simulation runs better but what I'm doing here on this slide is I'm going to compare the effective number of parameters for the number of samples per parameter for both models, these are structurally equivalent models same data, same analytically the same posterior distribution and they are if you look at the precy output, they're basically the same they're the same posterior distribution for sure, both of these have worked but so nef sub c is the number effective parameter is a vector of the number effective parameters sorry, number effective samples per parameter this is the centered model, which is the first one we did and sub n c is the non-centered one and I'm just showing you in this table per parameter, all the b's, all the a's all the g's the a bar and the two sigmas those numbers and they're always bigger for the non-centered model sometimes double so what does this mean? it means you don't need to run the chain as well and with a really big data set this is a huge savings for two days to sample I definitely worry about these things before I let it run for two three days you don't need that in samples that run the chain for less time but I don't have to fight my scientists for cores as much does this make sense? this is often just like the best thing you can try and as we have examples going in the next week I'll often show you this trick we're going to repeat this trick next week as well and lots of things and I'll show you how to do this with the phylogeny you can non-center the whole analysis and lots of stuff to make it work last thing that we really need to talk about this week is how you do posterior predictions with multi-level models and the answer is you've got to make some choices now that there are different levels to interpret the model app because it's multi-level you get some agency back and you get to decide how you want to think about the generalization of this model there are different aspects of the posterior distribution which may be relevant to you depending upon your scientific purpose and the most direct way to consider these choices you can make is when you generalize your inferences from this sample are you interested in new units like new chimpanzees are you interested in the same chimpanzees if you're interested in new chimpanzees say at another colony I think the ones I showed you were in Texas those are Texas chimpanzees you're going to pick Leipzig chimpanzees to make predictions for you wouldn't get to use the alphas those alpha parameters are irrelevant to you they're in the posterior but you can't use them to make predictions for a Leipzig chimpanzee or rather you shouldn't lefty, remember lefty's alpha you're not going to use that and just assign it to some random chimpanzee here however the other parameters are definitely relevant because they tell you about some statistical population you've learned about the variation possible in a colony of chimpanzees substantial and that would give you a prior for what you do to design an experiment here so let me give you an example using the chimpanzee model of how you might project different kinds of predictions out from this thing if you're interested in the same chimpanzees it works very much like every chapter we've come to before you just use those parameters to generate predictions but in other cases you've got some choices so when we do the new clusters as I say at the bottom of this slide you've got a bunch of different ways you can generate the predictions and different visualization options so let's consider as I say here at the bottom of the slide if we have the same actors you can just use link and sim as before it works just the same as forever you push samples back through the model and you get those chimpanzees this is like posterior prediction check to make sure the model worked if you've got new actors this is a kind of counterfactual thing imagine sampling new actors from the population now we're going to literally do that inside the model we're going to sample from the adaptive prior in the posterior distribution and get new chimpanzees out and there are different ways to draw these and I'm going to show you three you can think about an average actor in a new sample I'll show you what that looks like then there's something else confusingly called marginal of actor what does that mean? an average actor is literally like the average actor so we take the statistically average chimpanzee which is definitely a construction maybe a dead space but statistically average chimpanzee and we ask what the model says they would do that would be a chimpanzee at the population mean at alpha bar marginal of actor is different let's sample a bunch of physical chimpanzees and average over their variation that's what marginal of means you marginalize over the distribution and that's different as you'll see because it includes the population variation and then my preference honestly is usually just to show a sample of actors from the posterior is a bunch of lines you know I love these line plots the posterior distribution is full of chimpanzees now and we can plot them it's just full of chimpanzees infinite number of chimpanzees so one at a time what's the average actor look like this means actor with the population mean intercept alpha alpha equals alpha bar for the average actor and how do we do this? well since you know the model you know the model because you're taking my class and I have cruelly made you manually write the log posterior for every model we've done in this class you're welcome to do generic predictions for it leaving out any bit you like or plugging in anything you want and so we can do this quite easy as a trick here all we have to do is replace the intercept samples all with zeros now all the actors have the average intercept and then we can use link so the way I've done this is we can extract the samples and then just put in this big matrix of zeros and run the link model as well all this code is in the book and you can draw the outlet home you can also just directly simulate it so what do we get from this we get a graph like this you can focus on the code later when you go home focus on this graph here's the average actor it looks like a charlie brown shirt is it zigzagging and there's uncertainty about it but that's the average actor what about marginal of actor now we've got a sample of a bunch of chimpanzees and we have to average over them so we're going to generate a prediction interval for the whole population of them and as they vary and some of those individuals are left handed and some are right handed and that makes it wider codes in the book and you want to go home and play with it now it's the same charlie brown shirt zigzag but it's much much wider yeah? why? because the chimpanzees vary a lot they're really different from one another does it make sense? and finally the last thing to look at is instead of drawing this as an interval you can just sample a bunch of chimpanzees and draw them now these individual trends are chimpanzees you have 50 simulated chimpanzees looks like more than 50 it could be 50 let's say it's 50 and you'll see that one of the things that happens here is that because there's a posterior distribution for all these parameters and there's this big wide distribution occasionally you get individuals that are on the other end at zero we didn't see any of those in our sample but the model hasn't excluded them as possibilities you can get extremely right handed chimpanzees just like lefty and those are crowded up on the bottom ones all the way at the top but you still see the treatment effect echoing through this that's the exact effect that's something that you get I often like plots like the one on the right hand in this slide because you see more at least I see more but you see more than you do if you draw the compatibility interval the big smooth space you get to see that there's a bunch of individuals you get to see for example that as an individual approaches either boundary the treatments have less effect and you finalize one of your models and you hit the seal in your floor and you get less effect so it makes sense? good? okay you go home and you draw the owl and you'll see how this works and when you have your own data and model you make some decision about how you want to explore it and draw it this way it always depends upon the scientific context of how this works okay homework I have prepared a homework assignment and I still have yet to do it all but it is written and I will post it this afternoon and what you're going to do I call it frogs and contraception it's not the frogs that have contraception there's multiple data sets you're going to work with read frog data some more and this is a basic varying intercepts model but you're going to add predictors to the model I'm going to go ahead and give away the plot and say that the point of this is I want you to look at really focus on what happens to the variation among tanks as you add predictors to the model so you're going to fit multiple models some will have more predictors than others and then you're going to look at the sigma the sigma among tanks across the different models and there will be a pattern hint and that's what I want you to focus on so that you learn what's going on with these things the second data set is a historical D contraception data set it's a really nice data set and I know there's posters from India but this is all I could find with some quick googling not the same place I know and in this there will also be a varying intercepts model and you're going to explore shrinkage in much the same way you explored it with the tadpoles but now this is real data of incredible consequence for planning and I want you to explain the shrinkage pattern that arises in the data set okay with that as always thank you for indulgence when you come back on Monday we're going to extend directly from here to add varying slopes we're going to let everything vary and I call this adventures in covariance because it opens up the possibility for a huge number of exciting things including phylogenetic models and network models and other kinds of stuff thank you have a good weekend it'll be very warm here in Leipzig and I'll see you on Monday