 Welcome back everybody. We're going to finish up with the introduction to multi-level models today. So to remind you where we stopped last time, I had just gone through fitting our first varying intercepts model on the read frog survival data and gotten to what I hope is a useful punch line here that varying effects behave as they do. The shrinkage phenomenon that's created by pooling is useful to us because it is regularizing. It's part of this overfitting trade-off that we discussed much, much earlier in the course. And so I want to help you understand a continuum of models that's defined by the model we fit and where it settles down in the estimation represents a trade-off between these two monsters as I said before. You've got the whirlpool of underfitting. Your model can be too simple, right? It can meaning it doesn't learn enough from the data and the hydra of overfitting where you learn too much from the sample because the sample is special. You want to make predictions about the future. Generalization is usually what we learn, right? Usually, not always, but usually. So let me try to let's revisit this model and think about it that way first. So remind you that at the top this is our varying intercepts model. It's just a binomial regression with an intercept for each of the 48 tanks. But the common prior for these tanks has parameters in. This is what creates pooling. So the information is exchanged among tanks through those two parameters that they are all jointly constrained by the shape of that prior. Now the funny thing, of course, is that A and sigma are learned from the tanks themselves. So there's this joint constraint going on. And this is what Bayes is good for. You just have to set up the assumptions and it figures out how to do that learning. That's nice about it. And then we try to understand it through the behavior of the machine. So in the model comparison here, I think I finished with this on Wednesday. If we compare the varying intercepts model, that's model 12.2 to the so called fixed effects model where there's an intercept for each tank, but there's no pooling. Then what you find is that the effective number of parameters for the varying intercept model is lower, even though it has two more parameters than the fixed effects model. So this is this violation of something that I lied to you about back in the chapter six. I told you in chapter six that if you add parameters, you always get a better fit to sample. And that's false. That's true in classical statistical models where there's only one level, so to speak. But the relationships between parameters can be very complex in general. And models can do all kinds of weird things. And so in general, it's not true that adding parameters makes you overfit more. And here's an example. Now I want to explain to you why in this case you don't end up overfitting more. You end up overfitting less. It's because of what the parameters mean in particular. So model 12.2 has 50 parameters. I tried to give you the count, right? You got 48 tanks plus an alpha and a sigma. It's 50 parameters, but really, so there's 50 dimensions in the posterior distribution. It's a 50-dimensional hyperspace. Yeah? Good times, right? It's not Gaussian anymore. I guarantee you because sigma does not have a Gaussian posterior distribution. Take a look at it. Well, actually, in this case, it's pretty Gaussian. I'll show it to you in a few slides, but it's not exactly Gaussian. But it's a 50-dimensional space, but the flexibility, the estimated flexibility of the model is in the ballpark of a model with only 38 parameters. So why is that the case? So let's think about this model as really defining a spectrum of an infinite number of models where there are different fixed values of sigma. So imagine you weren't going to estimate sigma from the data, although that's the whole point of a very undersized model, is to estimate sigma from the data. But imagine you weren't. Imagine you were just going to plug in a wrong over there. Now, there's an infinite number of values of sigma you could put in on this spectrum. On the far left, sigma could be zero, right? We take the limit as sigma goes to zero, and then you get a model where all of the alpha sub-tanks collapse together into one parameter. They have to take on the same value because the assumption is that all the tanks are identical when sigma equals zero. There's no variation among tanks. You with me? Yeah? So this is what's called the complete pooling model. This would be the model you would fit if you just didn't have a subscript under alpha there. And then you're just estimating the average survival across all the tanks, and if you would dump them all into a swimming pool, right? Now, we dump all the tadpoles into a swimming pool and count up dead tadpoles. This is very tough, sorry. That's the complete pooling model where sigma equals zero. You with me? And so there you get, it's called complete pooling because there's total exchange of information among the tanks. They're treated like they're the same thing because the model believes that there's no variation among them. They're all representations of exactly the same process. They have the same probabilities of survival in every tank. Yeah? But then everything above zero is something that moves away from that, where there's more and more variation among tanks. And as a consequence, you get less pooling, right? But it takes you a very long time to get zero pooling. At the other extreme, infinity, right, because I was trying to explain to my son, is a number. In fact, they're an infinite number of infinities. And then he was like, no, that's too much. Not only is infinity a number, but they're an infinite number of them. And each of them is infinitely larger than the last. And he's like, no daddy, no. I'm sorry, canter, canter, preach the gospel canter. But so if we set sigma to infinity, or rather take the limit as sigma goes to infinity, what do you get instead? You get a model with no pooling. You get a model where the fixed effect model that we fit before, you get a model where there's different parameters for every tank, but they don't exchange information, right? So both of these extremes are cases where there's no information exchange in a sense, or well, the far left, it's like total information exchange, but you've got no distinct estimates, right? So it's just all blends together. And then on the far right of this dimension, sorry, far right, you've got the case where there's no exchange of information and only the data for each tank is used to estimate the parameter for that tank. Yeah, you with me? But when you fit these models, it won't surprise you to hear you never get a zero or an infinity, you get something in between. And so you get some intermediate result. And the location on this dimension is learned from the data. The amount of variation among tanks is what sigma is supposed to quantify. Right? So the variation within in between, right? So this is the variation between the standard deviation between tanks. And if that's a very small number, then there will be a lot of pooling, a lot of shrinkage, right? There will be a lot of regularization because the model is extremely skeptical of any tank that it's an outlier. It'll think that must be sampling variation, and it'll shrink it very strongly back towards the grand knee. If instead, sigma is a very large number, then the model thinks, well, look, I just visited a new tank here. This is like now it's a tadpole robot, rather than a coffee robot, with a robot from last time. It's not very skeptical of anything it might encounter about a tank because but there's so much variation among tanks, anything could go. It could be anything at all to make sense. But it's got to learn this it's got to learn both these things simultaneously from the data. And that's the goal. So what happens in this particular empirical case with this data set is what I've drawn up here are the prior posterior distributions for sigma on this dimension. Yeah, and obviously, I can't, you know, put infinity. I think this goes to 10 or something like that to figure it out. But the prior is the dash line there. That's our half coche. Yeah, which is super vague, really long tail goes out to walk super long way. The posterior distribution has a mode at 1.5 mean at 1.6 or so here. What does that mean? That's very strong regularization. It's equivalent to it's not exactly equivalent to because there's uncertainty, right? It's not exactly 1.5. But you can imagine putting 1.5 replacing the simple sigma in this bottle with 1.5 and you get just about the same estimates for the tank intercepts. Does that make sense? So it's like if you would if you would put in a fixed prior or very strongly regularizing fixed prior that was normal zero 1.5, you'd get very similar estimates. Does that make sense? But you didn't have to guess what the regularization should be. You learned it from the data instead, or your computer learned it from the data instead. But you're the master here, right? You commanded it to do so if you're in control of this goal. Does this make sense? Yeah, so this is the wonderful thing is this is the adaptive regularization of varying effects models. And this here, this is the simplest sort of model we can we can concoct. It just has one kind of intercept or clustering on tanks, but this same principle holds no matter how many different types of clusters. And you can do this with slopes. And that's what we're going to do next week. We're going to do exactly the same thing with coefficient beta coefficients as you do with intercepts. There's nothing special about this story that only applies to intercepts. It applies to any kind of parameter. You can do pooling line. You learn the regularization from the data so you don't have to guess it. Why do we do this test? We're terrified of overfitting. Absolutely terrified, right? Everybody overfits sometimes. It's okay. So what does this distribution look like that we learned from the data? So I'm plotting here are, I think that's, I don't know, some number 2050, however many lines this is. What you see on the left of this slide are samples from the posterior distribution of the regularization function, the normal distribution of tanks. So we take correlated samples of alpha and sigma from the posterior distribution. Two dimensional, right? Marginalizing over the other parameters. This is two dimensional distribution in the posterior for the Gaussian distribution of tanks. And that's what I'm plotting here. And so I, this is, I don't know, 20 or 50. It looks more like 50. Take 50 samples and then plot each of the normal distributions applied by that combination of alpha and sigma. You with me? So I do this so you can see the scatter. The robot here is not totally sure what it should be. But these are all pretty similar. The mean is somewhere above zero. What does that mean? Most tadpoles survive, right? And most tanks, most tadpoles survive, right? Because log odds is on the log odds scale. Log odds of zero means 50-50, right? You will get very good at thinking of log odds. You will, it will become natural. You will dream in log odds. It will happen. It's, math is easy on the log odds scale, right? You can do things arithmetically. And number, and four is, is everybody survives. And minus three is almost everybody dies. Yeah? So this is a distribution now. Think of it as, this is a statistical population of tanks that determines the amount of pooling that comes from it. So if there's a particular tank that's way in the tail of this, it gets shrunk towards the mean in the estimate because the model is skeptical that you could have a tank. It doesn't expect very many tanks to behave that way. So you get more shrinkage. Does that make sense? But this is a distribution that defines the shrinkage, defines the skepticism. Yeah? It was me. So then on the right, I just take this and convert it to the outcome scale. So you can see what this implies about observed frequencies. If, what does the model think about the future? Who would repeat this experiment? What would his expectations be like? So this is the posterior predictive distribution of survival in a population of tanks. And this is what it looks like. You get this massing up against one because you'll sample a bunch of tanks up here where most of the tank will survive and then there's sampling variation, right, which will pile up even more because you'll get a bunch of ones in some cases. That's not shown here. This is just probability of survival. But there's a long tail of very sad tanks, right? We're a lot of tadpoles died. This is what it expects. Yeah? You with me? Yeah? John has a very contemplative look. Is that approval or is that, yeah, makes sense? Okay. This is super weird. So I'm sympathetic if you want me to pause for a second and re-explain something or try a different metaphor. This is, I told you this is going to happen early in the course. We've reached the point where the posterior distribution contains functions now, right, because the parameters in it define distributions. So we have a posterior distribution of Gaussian distributions. And that's what we're showing on the left. With me? Yeah, and of course English lacks any grammar. And so it's very difficult to say since it's like a posterior distribution of distributions and know what's modifying what. But you're with me, right? This is one of these cases where we were discussing this mine for yesterday. We could use the dative case occasionally in English. It would be very helpful to know what is an indirect object. So you can look at the actual estimates and see the shrinkage phenomenon in any particular empirical case. But the argument I'm making with you guys is that varying effects are better because they give you better predictions out of sample. So let me show you that. In order to show you that, I have to fake something, right? I have to make some imaginary catapults and kill them. So that's what I'm going to do. And all the simulation code to do this is in the book. So take a look in chapter 12. I'll show you how to do it. Literally all we're doing is simulating catapult mortality. That's all we're doing. And the code is simple. This is something that's really useful to learn to do when you do your own analyses. Those of you in my department know that I always encourage people to do this to figure out how their analysis works before they run it on their own data. In fact, preferably before you collect the data, you should decide how you're going to analyze it. And the best way to do that is to simulate some data. This forces, I don't know what it does for other people, but for me, it forces me to actually think about the project, right? It's very easy to skip over a bunch of details about how the data will actually be processed. But when you do it like this, you really have to engage with all the assumptions. So this synthetic data is the way it was taught to me in graduate school. The synthetic data approach is the way to think through the problem. Here we're going to use it to demonstrate the small world advantages, right? In the big world, all bets are off. Again, large world. But in the small world, the clear advantages of bearing effects is that they give you better predictions. Your out-of-sample prediction error will be lower because of the adaptive reading recession. So let me try to show you what's going on here. So this is, we simulate a bunch of now we call them pawns instead of tanks, with different densities of tadpoles. Five, ten, twenty-five, and thirty-five tadpoles in different pawns. Fifteen at each density, so sixty pawns in total. We end up with a synthetic dataset where each row is a pawn, right? Numbered creatively, one through sixty. And in is the density of initial tadpoles, the number of live tadpoles in our imaginary pawn. And then I sampled from a Gaussian distribution of law gods of survival for each tank. So I'm pre-specifying the simulation, some alpha and sigma. That is the quote-unquote true population of mortality effects, or survival effects in this case, in the different tanks. And so there's a true, a true intercept for each tank. And that's what this column is. Yeah, down to, you know, silly precision. And truly silly precision. Then we get an actual number surviving. That's what s is, which just comes from coin flipping. The cruelest coin in the world. Each tadpole steps up and then the coin is flipped and it gets sapped or not, right? This is what happens inside the computer. And then we get various ways of estimating from the resulting simulated data. We force the statistical model to use only s and n to figure out the mortality probabilities. And the question is, what procedure for doing this gives us the closest match to the truth? Right, that's the question. Are you with me? Okay, so, and we have the no pooling estimate, right, the partial pooling estimate, and then the truth. And the question is, which is closer, and you can probably guess. I'm not going to force you to look at columns of silly numbers, but let's plot it instead. Okay, so what you're looking at here is this same thought experiment on the horizontal is the index of the different ponds, 1 to 60. Starting on the left we plot the tiny ponds where there's only five tadpoles to start. On the vertical axis I plotted the absolute prediction error for the different kinds of estimates. And the filled circles here are the raw proportions. So this is the fixed effect estimate. If we only take the data from that pond and just calculate the log odds of survival from it, that gives us the fixed effect estimate, right? So that's the no pooling model. Yeah? Or sigma is infinity effectively. If you take the limit of the model, the sigma goes to infinity, you get the blue dots. You think of this as raw empiricism, right? If you say like, I'm a maximally objective scientist, I'm only going to use the data for each pond to understand that pond. To do anything else would be biased, right? Sanitions used to say this in the first part of the 20th century, right? So anything else would be malpractice. They were wrong, right? You could tell I'm setting them up here. But the maximally quote-unquote objective thing to do is only use the data from each pond to understand that pond. And then you get the open circles. I mean, sorry, the closed circles, the raw, the empirical raw effects. Yeah? The open circles are the multi-level estimates, the pooling estimator, the partial pooling estimator. And then I put these, so there's a scatter, right? You can probably see on average the open circles are below for each pond the closed circles. That means they're doing better because big numbers are bad here because it's error on the vertices. And so you can probably see that from the scatter. Yeah? To help you, I've plotted these horizontal lines which are the average error in this category for this size of a pond. And solid line corresponds to the field circles and the dashed line to the open ones. And so the average error for the partial pooling estimators is lower. And this is what we expect because they're regularized better. They're being misled less by the sampling variation in each tank. That's the skepticism that helps you learn, right? Or you can think about it, the fixed effects are overfit to each tank. Why? Because they were maximally objective using only the data from that tank. This is sometimes called the James Stein phenomenon because the statistician, who I think was at Berkeley or Stanford, one of those places in the Bay Area, and James Stein was a frequentist statistician who derived a very famous theorem and an estimator that goes with it called the Stein estimator that illustrates this phenomenon. And it surprised a lot of people. It's very famous in applied statistics for this reason. And sometimes it's the James Stein estimator is explained as you can take seemingly unrelated variables and mix them and get a better estimate of the total error of the system. And it just seems mind-boggling, but it arises from this exact phenomenon because there's overfitting for each one. Okay, next category. Oh, sorry, I have labels, I forget. The next categories behave similarly, but what I want you to see is as we march across, the total error is declining, right? And why is that? Because there's more tadpoles per pond as we march across. So the sample size per pond goes up and so your risk of overfitting goes down because you've got a lot more data per pond to make valid differences from. And so the partial pooling, it still behaves the same way. In a typical, so this is only one simulation, and you do this across simulations, it'll bounce around. In a typical simulation, the very effects estimates are still better, but the difference between the two collapses as you get more and more data per tank, right? So it doesn't matter as much. So the partial pooling is the best help when your sample size is small, just like your overfitting risk is greatest when your sample size is small, per unit, per whatever unit you're trying to estimate some parameter for. Does this make sense? Yeah, yeah. Now remember, this is a small world, right? But this is why we do it. This is why we do it. It's a way of learning how skeptical we should be about individual units in the data, learning from the variation in the data itself how to do it. Okay, so the remainder of our time today, let me give you a more complicated example of a varying intercept model. So the models that I just showed you have only one type of cluster, tank or pond, and we put an intercept on there for each. There were no predictor variables. It's routinely true that there are a bunch of different things you could and probably should cluster on in your data at the same time. And this is one of the reasons that Markov chains on the desktop are so popular now is because they let you fit such models correctly. You can also fit them in other software without Markov chain, but you probably do it incorrectly. I won't name any packages, but you should be skeptical very often that other software can do it. So, but the full-base solution works. You just have to wait a couple minutes for your model to get a cup of coffee or tea, whatever you prefer. And so let me tell you about another dataset example. This is an example that I didn't lecture on before, but is from a previous chapter. So those of you in the room in front of me here read this because you're reading the book, right? Yeah, some nods and other expressions. Those of you watching in the comfort of your your own boudoir may not have read chapter 10 and may not know this example. So let me give you a quick example to the dataset, only the relevant bits, and then I'll show you what's called a cross-classified model, which is an extremely common thing. So these data come from an experiment on a comparative psychology experiment on the pro-social tendencies of chimpanzees, and the experiment had multiple groups or so-called colonies of chimpanzees in it. I'm only giving you the data from one of those, but the structure of the experiment was there are, there's this table and on the table you have to put put yourself in the shoes of a chimpanzee participant, and you're sitting here at the end of the table here, and you can reach out and grab one of two levers, and these levers are attached to some spring mechanism in the middle of the table that when you pull the lever it spreads out and it pushes two trays to the different maximal ends of the table. On these trays there may or may not be food grapes probably, I forget but it's used in this, but chimpanzees really like grapes. They'll do almost anything for a grape. It's really, and so the experimental, first part of the experimental treatment is that on one side either the left or the right it's counterbalanced across trials. There's food in both trays, that's called the pro-social option, pictured here on the right hand side of the table, and that means if you, the acting, the chimpanzee with agency, pull the right hand lever both you and a chimpanzee at the other end will get food. If instead you pull on the left there's only food on one of these trays and only you get food. So this is an experiment to see whether chimpanzees at no cost to themselves, right, it's equally costly to pull both these levers at no cost to themselves, would prefer to help the other individual while they help themselves, right. This is an experiment by the way, if you do this with human kids, they nearly always pull the pro-social option. Not surprising, right? There are child psychologists in the room. Kids, human kids are wonderful until they get over. I'm sorry, I should stop talking about things at some point, right? But children are wonderful. But so we want to know what happens in this experiment and I should say experiments like this don't tell us whether chimpanzees care or not about conspecifics, they just tell us whether or not they care about conspecifics in a context in which humans do, right? It's a very subtle thing, I just want to say that, about what exactly we learned from these things. But this is a context in which human participants see this as a very moral decision, this is heavily moralized. As you'll see, chimpanzees don't care. That doesn't mean they don't care about conspecifics, I think if there's a lot of evidence they do, just in this context they don't seem to make a distinction, yeah? Okay, so the major treatment is that sometimes there's no chimpanzee on the other end. You're just in the room by yourself. So the outcome is which lever you pull, but then there's a treatment interaction. Does it, right, does it make a difference whether there's another individual at the other end of the table or not? Because they could just be attracted to more food, right? They could just be pulling the right hand side because there's more food there. They're like, ooh, more food. And they're like, damn, I only got one of them, right? It's like that experiment where the chimpanzee has to be trying to point to the pile of food it wants the other individual to get. Some of you in the room know this one, right? And then they're just like, oh, because they keep pointing to the big pile. It's just very compulsive. And they know that that pile is going to go away. As soon as they point at it, they go, oh. Right? It's like the marshmallow test. Anyway, we laugh because we can sympathize, right? It's not so different. So anyway, let me try to summarize this experiment. It's a fun experiment. There's lots of experiments in this design neighborhood, right? Two conditions, partner and alone. When a partner, you've got that attractive chimpanzee at the other end. And when you're alone, that chimpanzee's not there and it's just you and the lovers. There are two options on the table, prosocial and asocial, and their counterbalance on which side. And so then the outcome is, does the participant, the chimp with agency, pull the left or right lever? There are six experimental blocks for sessions, I think which in this case are mornings on which experiments were done. If I remember the original way the experiment was done. We care about this as experimentalists because there are block effects, right? What if the weather was bad some morning, right? And things like block effects are the terror of doing the experiments. There are time correlations in experiments, right? So this is a big deal in genetics. I don't know if we've got a genesis in the room today, but you want to, you've got treatments in a genetic study. You want to randomize the order you feed samples into your machine across blocks. You don't want to do the one treatment and block together because there are time effects on the way your sequencer works, right? So you can confound block and treatment if you don't randomize them and just the order you feed them in the machine. So there are a bunch of published experiments which have to be ignored for this reason. So we worry about block effects. It affects every part of science. There are seven actors or individuals, they're called actors in the data sets. And so we're going to focus on blocks and actors as the things we want to cluster on. There are repeat observations in both, but they're not nested because every chimpanzee is in every block. And every block is in every chimpanzee? Yeah, you'll know how to say that, right? But you know what I mean, right? They're cross-classified. So if you want to focus on block, you've got every chimpanzee behaving inside of it. If you focus on any particular actor, you've got all the blocks. That is a better way to say it. This is where English lack of grammar is causing problems. But so we want to predict outcome as a function of condition and where the prosocial option is. There's an interaction. I'm not going to focus on that. That was the focus of Chapter 10 where these data were introduced. It's just a it's just a binomial regression, a logistic regression. And the motive of any question is do chimps prefer the left lever when the partner is present and the prosocials on the left? And the answer's no. But we'll just get to the end. They governs. But they do have handedness preferences, which are very strong. Some are left-handed and some are right-handed. Most of them are right-handed just like people. And but some of them are left-handed and boy are they left-handed. And so there are very strong actor effects. And the question is whether they're also equally strong block effects. We want to estimate both these things. So this is a context. Oh, this is what I just told you. So this is a context in which we need to cluster on multiple types of things at the same time. I want to show you how to set that model up. Yeah, so we've got pulls of levers in chimpanzees. We've also got pulls in blocks. Each chimp is in each block. And this is not a nested structure, but rather what people call cross-classified. I should say here that in some software packages when you fit multi-level models, this distinction between nested and cross-classified is a delicate thing. And there are syntaxes for specifying it. They're the same model. And so the model I'm going to show you works for nested too. The nesting or cross classification just comes from how you set up the ID numbers, right? It's whether each individual is unique within each block because they're actually different individuals or not. That's nested. It just has to do with your data science governance of your data set. And so the syntax in a package like LME4, whether it's nested with a slash or not, it just recodes the IDs so that they're nested. That's all it does. But the model is the same. This may be shocking, but it's true. So the model I'm going to show you is find for nested too. What's typically called nested? It just comes down to how you set up the index values for the clusters. There's another kind of nested model which I'm not going to show you today where you actually have even more levels, right? Where you take the alpha and sigma inside the regularizing prior and give those regularizing priors. You can do this. Yeah, so if you have, for example, districts within countries, yeah, but you don't want to assume that every country has the same variance across districts, then you want to pool your sigmas at the higher level. And you can make that model, but I'm not showing you that. That's what I call a truly nested model. Yeah, we fit some of those in my department sometimes. So here's the multi-level chimpanzees model where there's just varying intercepts on actor. I show you this first because it's perfectly analogous to the tadpole data, except that there's predictors now on the end. We can ignore that. That's just our interaction between condition and pro-social option. Yeah, so this was all in chapter 10. And then the action is from alpha sub actor. For each actor there's going to be a unique intercept. What I want to draw your attention to here is now I've taken the alpha that used to be in the priors and I've smuggled it out into linear model. You see that? That's perfectly legal. You can do that whenever you want to. And sometimes it helps the model fit better. Welcome to software. We will talk about this next week again because it's going to come up again next week. This is called the non-centered parameterization of a multi-level model. It often fits better than the centered version. Why is that perfectly legal to do? Because normal distributions are additive. You can always just translate their location by subtracting out the mean and now the mean is zero. And so you just got to put the mean back in some place. So you put it back in in the linear model. You can always do this. It's a property of Gaussian distributions. This is how we standardize Gaussian distributions. We just subtract the mean. And now the mean is zero. I see some of you were thinking about this and like you, I don't know, not happy with it. So imagine you had a Gaussian distribution and the mean was five. And standardization is one. We want to standardize that thing. So we calculate the mean. It's five. And we subtract five from every value. Now it's a Gaussian distribution where the mean is zero. When you put it back on the absolute scale, you got to add that mean back in. So now we smuggled alpha out of the normal down there. Now the mean is a fixed zero. Alpha still exists though. Now it's in linear model. We get the same predictions. Yeah, I know it's yeah. Like I said, next week we'll have an even more fun example of this because you can do this with matrices as well. Yeah, good times are coming next week. And no, it'll be great. You'll love it. Monceau-Coleski will make an appearance. Anyway, my favorite French-Polish mathematician will make an appearance and help us solve problems. So for now, just I want you to realize that this is the same bearing intercept model. It's exactly the same. And you'll often see it specified this way. In fact, packages like LME4, this is how they specify the model. And you'll get the varying effects will typically be offsets inside the calculations. They will be offsets from some grand mean. So they'll be zero centered. And that's what happens here is that the alpha sub actors now offsets from the mean of the linear model from alpha. They'll be zero centered. So to get them on the scale as the tadpole model, you just got to add alpha to each. And then it's the same. Yeah. So you just got to think of them as offsets. So now if you're going to specify this in code exactly as before, you just bracket on actor, right? And now there's a zero in the normal distribution of the prior. But this is the same pooling model and you get pooling exactly the same way. You with me? Before we look at any estimates, let's add another cluster. So we also have block and the strategy works exactly the same. You just duplicate some structure here, but now you put the word block in there. And we have varying intercepts on actor. Alpha sub actor is normally distributed mean zero and some sigma sub actor now. And now we're going to have varying intercepts on block as well. We can have alpha sub block whereas normal distribution means zero in its own standard deviation. It may be different from the standard deviation actors. Yeah and they both go into the linear model and they're both offsets from alpha. There's only one alpha here. If you put in more than one grand mean, you will have non-identifiable parameters right? Because if you had a grand mean for the actors and a grand mean for the blocks, there's some can be identified but not each component. Because they just get added together in the linear model. And we talked about this way back last year literally. Yeah. But the model is very straightforward otherwise. And so then you just got to make sure you set up all the priors. So you can get a very long list of priors in your model. But the structure is all the same. Good. Yeah. There's really no new concepts here, but it looks a lot more complicated. But it's the same concept since before. Okay this is what the code looks like. I think the index on block is called block num because block is an observed reserved word. It's inside stand. I think it's a function or something like that. And so you can't use it. I think give it a try. Maybe they changed that. And but it'll it may not compile if you just call it block because it'll be a reserved word and get an error. So you didn't have to change it to something else. So I think that's why num is there. And just to show you some of this text is green and some is red. Those of you in the room can't see that. But when you look at this in the comfort of your boudoir you will see that it's true. So all the green text is the actor effects and all the red text is the block effects. Yeah. So you can see that when you add varying intercepts on block you've got to add the regularizing distribution. And then you have to add the hyper priors they're called. A prior for sigma block down at the bottom. You don't need another prior for a mean because it's only one grade. Good. It's good. Don't let me go too fast. I'm going as slow as you need me to go. All you guys are doing fine because you read the chapter already. Okay. So here's what happens. On the right I'm showing you the what is called forest plot dot line plot something like that. And of the marginal posterior distributions of each dimension in the posterior. So all the actor effects are up top. There are seven actors. So we get seven intercepts. Each of these is an offset. They're zero centered is an offset from the grand mean alpha. And you'll notice that they're scattered around. There's one individual who has a very high value above zero. What does that mean? That individual is left handed. Larger values here mean more likely to pull the left lever. Like that's the outcome of this pulling the left lever. That individual always pulled the left lever. There was not a single trial in which actor number two did not pull the left lever. Very committed. We have to give credit to lefty there. Always pulled the left lever. So we can say for actor number two they were insensitive to the experiments. They're very attracted. Notice however that the law of God has pushed up against four. But there's no way for just based upon the model just based upon the likelihood to know how strong the preference is. The only thing stopping that from being infinity is the prior. That's the only thing stopping it. If you always pull the left lever then there's no way to say how strong your preference is. It's just really strong. If this was a maximum likelihood fit you'd get a very bad fit. It would be degenerate because it would want to run to infinity out there. This is where you need priors again. In a generalized linear model you always need priors to reduce nonsense. The other individuals tend to be right-handed. So they're below zero on average. But they're responding to something. They're responding to the presence of food. And then we look at the six block effects. You can probably see right away that the offsets are much smaller on average. There's a lot less variation across blocks than there is across individuals. There's no evidence of really strong experiment effects. Although block one more individuals pulled the right lever for whatever reason could just be random. And on block six more pulled the left lever again random. But none of the effects are very large. And then the actual effects of scientific interest come below but we'll ignore those because that was the lesson of chapter 10. But there's no action basically. There's not much going on here. Aside from the actor effects which are actually quite powerful. Handedness has a big effect in this. And then we get the sigma actor and sigma block to standard deviations across actors in the statistical population of actors. Right? It's not these actors specifically. It's some gives you expectations about if we put some more chimpanzees in this. If all the information we had to make guesses about the future chimpanzees was these chimpanzees. Here's the standard deviation we might expect in that sample. And it expects quite a lot actually. Standard deviation two on the log odd scale. It's a lot of variation. So that means you could expect anything. You should expect the whole spread of lever pooling tendencies basically. All over the map from this because handedness varies a lot and it has a huge effect in this experiment. If there was a you know a more powerful treatment from the perspective of the chimpanzees it would probably swamp out the handedness. Right? And then you get a very different effect. And there are definitely published experiments like that. Sigma block on the other hand is creeping up against zero. That's easier to see in the plot in the lower left where I have the marginal posterior is done as densities here. You can see the actor distribution with a mean right around two. Right? Long tail that's a lot of variation across actors. And but blocks it's there's some variation of blocks. The model doesn't think that blocks are all the same because they're not. Of course there are block effects. But they're small. They're very, very small. In this experiment with these chimpanzees. Let's keep that in mind. You guys see what's going on here? So that you can do this head-to-head comparison and variants for varying intercepts in simple models like this. And I want to caution you you can't do this with slopes because slopes get multiplied by something. So the variation they induce in the outcome is it depends upon the scale of the thing you're multiplying them by. It gets more complicated. But with intercepts this is safe. This is the only place it will be safe to do the head-to-head comparison of the amount of variation. Yeah. How do you do it in the more complicated case? Well you push predictions out. You look at the non-variation across predictions. You look at the scale that matters. Remember absolute deer? Yeah. Okay. Think of absolute deer. Respect the deer. Okay. So cross-classified chimpanzees. Right. So let's talk about model comparison for a second. So if you let me back up a step and say often people come to me and and say that they want to use model comparison to figure out the random effect structure of their model. They want to compare models that do or do not have random effects on different components. I'm not a fan of that. I think you can do that. It's not necessarily bad but I think it's unnecessary. Let me show you an example of why and then I'll say a little bit more about it. What we're going to compare here is the model we just fit. That is the model where actor and block are varying. To a model where only actor is varying. Yeah. And block is not there at all. We put in we can count up the number of parameters. The model with actor and block has 18 parameters. The one with only actor has 11. The effective number of parameters between the two is very very similar. It's much more similar than the actual parameter count. And you know why right? We added block and we only got effectively two more parameters but we actually added how many? Seven. We actually added seven. It's on the slide. I counted it in my head but it's behind me on the slide. The seven parameters because there are six blocks and then there's a sigma you have to add. But we can only get effectively two. Why? Because sigma gets estimated to be a very small number. And so those six block intercepts are very strongly regularized towards zero. Trunk really aggressively towards zero. And so your your overfitting risk is basically non-existent. So here's the lesson. If you cluster on something that doesn't matter. The adaptive regularization stops you from overfitting. Right? It's not it's not. I mean there's for your own sanity you might just leave block out. There's nothing wrong with that. But there's no problem putting it in either. Stand can handle it. It's no problem to do that. The adaptive regularization the model is learning where the action is in the data. And since it learns it there isn't much action about on block. You don't overfit on block. And it leaves it out. So you can you can either of these these models are basically equivalent. And the nice thing about the one with both effects is that you can actually show what block is like and what the variation is like. But you're not overfitting. Right? And so as a consequence these models end up with indistinguishable WAIC scores. These are the same. They make the same predictions. Almost exactly. Right? But the top one gives you an estimate. Right? Of the block effect. Which is something you might want. And might want to report. You with me? Yeah? Yeah. So when WAIC varies by one point you should never get excited. Right? That's what I would say. Another way to think about that is if you look at the the difference in WAIC between two. It's 1.2. The standard error of that difference is 2. Right? So don't get excited. Right? The error estimate in that difference swamps the actual difference. So that's the nice thing about WAIC compared to AIC is you get a standard error estimate. Right? That'll moderate your enthusiasm. Curve your enthusiasm. I'm supposed to say that. Right? Okay. Does this make sense? So, you know, I think where should your varying effect structure come from? It should come from the structure of the data. If you've got repeat observations cluster. That's the rule. Yeah? And usually when I get a data set and I'm going to start with it, I know the maximal model I'm going to fit because it's the model that comes from the scientific question. There's a series of predictor variables that I know I want to put in there, but the first model I fit is the skeletal model that only has varying effects, has varying intercepts. Varying intercepts for every kind of repeat observation inside the data set. And what you get from that is a nonlinear analysis of variants at multiple levels. The varying intercept estimates, those signals, tell you where the action is at the data at what kind of cluster. And that's incredibly useful. And then as you add predictors, those variances, those sigma parameters, get smaller because the predictors soak up variation. Do I make it sense? If I'm not making sense, there's a homework problem which is about this. You're going to take the tadpole data and you're going to start adding predictors. Predictors mean these predators here. You're going to add damselfly larvae to the model. And as tadpoles get gobbled up, you will see that the sigma gets gobbled up, right? Because there's the variation left for the varying intercepts to explain those down as you put in predictors that matter. Am I making sense? So that, I've got a homework problem I designed to help you understand this. But I think the varying, the varying effect structure should come from the structure of the data. It comes from the structure of the experiment or the observational study. If you've got repeat observations, you want to use pooling estimates. Yeah? And if you're doing the full Bayes estimates, you have to push pretty hard, at least with Stan, until you hit a machine limitation. With other packages, you can hit it pretty fast. But you can go silly with random effects with Stan. And then if you have a problem, you know where my office is, right? Okay, so let me close this by talking about posterior predictions. I keep saying in this course that for even reasonably complicated models, anything more complicated than an ordinary least squares regression, it's very difficult to understand the model by looking at parameters. It's like the member Lord Kelvin's type prediction engine. Let's think about the type prediction engine. So these models aren't all that complicated. You will get so used to them that you will think of them as simple. Absolutely true. And nevertheless, they're complicated in a new way that goes beyond Lord Kelvin's type prediction engine. That is, there are different kinds of predictions you can make for them now, because they have multiple levels. So usually we're not interested in making predictions for the same clusters from the sample we had. So think about like the tadpole data. We're not going to make predictions for the same tanks, because those were buckets making underbranches or whatever they were. Those things are gone. They have been cleaned and bleached and stored in a basement someplace or something like that. Or if they were natural ponds, you know, they dry up seasonally and then they reappear. They're never going to exist again. You're interested in the population and the predictions you want to make come from the second level of the model. They come from the regularizing prior, not from the top level parameters. So the varying intercepts are disposable to you in that case. And then I think this is often the case in the sciences, especially in what I call the population sciences, the social sciences and population biology, is that the entities in the sample we got are temporary ephemeral things to us. We're interested in making generalizations to some new subpopulation. And so the way we think about prediction has to change. When you make those sorts of predictions, you don't get to use the varying intercepts from your model. You have to use the population. So let me show you some examples of this procedure. And there's code to make the plots. I'm going to show you in the book as always. So to outline it for you, if you were going to make predictions for the same cluster, as you proceed as usual, you get to use the varying intercepts from those clusters to make predictions. And this does happen a lot. So if your cluster is a country, then you want to use the varying intercept for that country. Yeah? Imagine varying intercepts for punctuality. Engaging some rich European stereotyping. And you're going to use some European countries, I won't name any, and some of them will have very high values of the punctuality of the random intercepts. All right? You want to use that to make predictions from that. Yeah? Now people know rich stereotypes about these things. Punctuality is not a virtue, by the way. I'm an anthropologist. It's just the norm. Right? If you're living someplace where it's not normative to be punctual, being punctual is bad. It's rude. Yeah? No kidding, right? It's incredibly rude to be punctual in a place where punctuality is not a norm. I got used to this. Every place I've done fieldwork was like that. I'm a very, you know, citizen of the common wealth. Punctuality and queuing are deep in my blood. Like, I want to form lines and be on top all the time. And at every place I've done fieldwork, neither of those things is a norm. And it's incredibly rude to try and enforce those things in places where, so sorry, this is just to make it clear that there are stereotypes, but none of them are, you know, positive or negative necessarily. But you have a random intercept for punctuality, and if you want to make a prediction from citizens from that country, you want to use that random intercept, right? But there will be many cases, like the chimpanzee experiment, where we're not interested in making predictions about those individuals. The varying intercepts for their handedness preferences are things that help us understand the treatment effects, right? They get in the way, they create experimental noise, and we need to control for those tendencies. But the exact estimates aren't of interest to it. But if we were going to make predictions for a new group of participants, new chimpanzees, we would care about the distribution of handedness tendencies, because that tells us how much noise we expect, how much interference we expect from those tendencies. Okay, so when we get to new clusters, there's this question of how we make predictions. And the easy thing to say is what I say on this slide, we should average over the distribution of varying effects. But there are different ways to do that. So let me show you some new ways to do that. So the new clusters are going to be predictions for new chimpanzees or population of chimpanzees. So if it's the same clusters or same actors, you know exactly how to do this, this top part of this slide. If it's new actors, we're making counterfactual predictions. Now we want to sample an imaginary chimpanzee from the Gaussian distribution that's in the posterior area, right? Because actually there's a distribution of distributions inside the posterior area, and we sample for that, and we get a new plausible chimpanzee with some hint of this tendency, an actor. And then we can make predictions. But there are different ways to do this. You might want to plot predictions for, say, a typical chimpanzee, in which case you're just sampling from alpha, right? There's a kind of typical individual in that distribution, and we want to plot predictions for an average chimpanzee, right? What's the most typical actor? And that's sensible. That could make sense. But it's a different kind of prediction than something, let's say, is marginal of actor, where you're averaging over all the variation in chimpanzees. And it's also different from showing the sample of actors for the posterior, because each actor is acting differently. So if you just average over them, sometimes you can't see what's going on. I want to show you all three of these now, to see how they're different. Each of them is useful in a different way. So the average actor means we sample from alpha here, our grand name, the mean of the statistical population of chimpanzees. And then we replace the varying intercepts in your predictions. So in the link, you can actually just replace things. You can just replace a parameter of any values you want. So the codes in the, how to calculate this, is in the chapter. But since you know the model, right, you always know how to make predictions. That's why I bothered you guys so much in this course, but actually writing the model every time. You know the linear model, you can plug values into it, you can generate predictions for anything you please. That's all this is. It's just doing that. And so we were, but I get this actor zeros here, and we plug it in, and then the actor offsets are all zero. And so now we get predictions that just use alpha. So we just have kind of typical chimpanzees and we get predictions from those. And that's what this looks like here when we plot it up. On the horizontal axis here, we have the different conditions in the experiment. There are four combinations of pro-social left, right, and partner, president or absent. So there's four possible combinations. And there is variation because individuals are attracted to two food items. So they do tend to pull the pro-social option, but they don't pull it more when there's another individual present. That's what creates the Charlie Brown shirt pattern. Is that something anybody understands anymore? Yeah? Okay, a few people. Good. I'm not sure. Sorry. And the proportion pulled on the left is on the vertical. So the gray region in this is showing you the uncertainty. So for a typical individual, there's still an average chimpanzee from the statistical population of chimpanzees responds to the treatments in this fashion with that kind of uncertainty. So there's definitely zigzagging, but there's no effect of partner. Now let's think of what marginal of actor. Marginal in statistics means average. That's all it means. It's not a demeaning thing. It doesn't mean worse somehow. Sometimes it means that. It just means it just means it's margin means the edge of a table where you typically put summary statistics. That's why it's called marginal. So the averages or sums typically go on the margins of old actuarial tables. And so that's why it's called marginal. It's horrible. I'm sorry. It's like history, the burden of history. I can't do anything about it. But you're going to see this all the time. We even, it's a verb. You can marginalize. Right? And so marginal of means averaging over the variation in actors. So now we have to extract a bunch of statistically plausible actors and simulate from them and then plot that distribution. And then this is what we get. The marginal of actor graph is in the middle there. And you'll notice now what's happened is there's a huge amount of variation because now we're averaging over all the handedness preferences, which creates a bunch of scatter over the whole outcome space. Right? Because remember lefty? Yeah. Lefty's in here now. We've sampled some lefties. And not very many, but we've sampled a few individuals who have really strong left handedness tendencies. And that creates this huge spread over basically the whole outcome space. Yeah? All of these individuals though are still responding to the presence of two food items. And you can't see that the way this is plotted. So now here's the final third way. Instead of just showing this gray region or of the distribution, marginalizing, let's just sample 50 simulated actors and plot each of their mean posterior predictions as a zigzag or not. It doesn't have to be a zigzag, right? And show them in the plot. So that's what I've done in the far right. These are 50 simulated chimpanzees from the statistical population of chimpanzees. You can see the huge variation in handedness, right, that spread across this. But you see that they're responding in the zigzag fashion in the whole region. That's what's nice about this plot. You can still see that the effect of they're attracted to two food items. Which side of the table is two food items? They like that side a little bit more. Yeah? But handedness is the big story in this data across the whole thing. Yeah? Makes sense. So this is just how it is. In multi-level models, what a prediction means, you have to be a scientist again. And so congratulations. This is the case where the statistician cannot tell you the right thing to do. And can only offer what I call horoscopic advice on what to do. So I can show you the different possibilities, but then it comes down to what's the question. And what's a good way to visualize the implications of the model. And it's just not as straightforward as it is in a simple single-level model. Okay, I have this example in the slides that I will skip over, but it's in the chapter about using multi-varying intercepts to model what's called over-dispersion. And I encourage you to read this. I'm not going to lecture on it, but I think this is a routine thing that is super useful to do. Even when you've got only one observation for clusters, sometimes you want to put varying intercepts on the clusters. There's no repeat observation, but there's still variation coming from the different tendencies, and you can estimate that variation. So take a look at this section. I know it sounds like madness, but it's not. It works. So take a look at this section actually. Okay, homework for the next week. I want you to do the first two medium problems at the end of chapter 12. I think they're actually hard enough. They're both questions that use the refrog data to help you understand varying intercepts in the meaning of the sigma. And then the first hard problem, which is a human demography problem about contraceptive adoption in historical Bengladeshi contraception adoption. And next week, this will set you up next week to understand chapter 13, where we're going to add varying slopes. We're going to do all the same tools, but we're going to do it for any parameter you like. We're going to do pooling on slopes, and it works great. And this will also lead us into what I call other wonders. We can pool on non-discreet clusters, so you can have continuous predictors like age, and we can do pooling on those things. Because they're still in balance of sampling across age groups, and so you still want to regularize correctly. But now there's no discreteness to the age, right? You just expect individuals of similar ages to be more similar. And I want to show you how to do that, because it's another really common thing to do, and we're going to do something that's actually borrowed from machine learning. It's called a Gaussian process. It's a really common machine learning technique, and I'm going to show you how to specify it in a Bayesian model, because it does good work for us. All right, with that, have a good weekend, and I'll see you guys on Wednesday, right? Wednesday morning. Okay, thanks.