 So today, we're going to continue with multi-level models, and I'm going to teach you about varying slopes. This is a part of the material that I have intimately entitled Adventures in Covariance, because now we have to take those populations of varying intersections for and make the multi-dimensional. And what's going to happen here is we end up with distributions of parameters, and we're going to model matrices. We're going to have posterior distributions of co-variance matrices. This is a little bit of a mental hurdle, very simple touch to that, but it does come online. You've already been working with multi-variant distributions, the posterior distributions, and we used to assert that it was multi-variant normal. Now we will actually have multi-variant priors in our models. That's the major project today, and that'll let us get through varying slopes, which are a powerful extension of the pooling concept to regression coefficients as well as intercepts, so that the effects of treatments can vary across units in the data, as they nearly always do in reality. And then we're going to hopefully get started today, but maybe I'll finish, but get started on an extension of the multi-level concept to continuous categories. And I'll explain that when we get there, but the mechanism that makes this possible is something called Gaussian process regression, which some of you probably heard about. It's kind of in vogue in certain areas of biology in general science are now really useful. So I want to give you an introduction to that as a way to model spatial autocorrelation and lots of other things as well. But it's a natural extension away from our discrete unordered categories like tadpole ponds to continuous categories that may have different amounts of similarity. Okay. This is subtle. This is a great question. Can I punt on it? Because it's very subtle actually, and I think I'd need like 15 minutes of some slides if you just skip it. Let me see if I could like put together a slide or two on it maybe for next week. It's subtle because it in a multi-level context when you center only within a cluster you change the meaning of the data. It's a different model. And it's obviously a bad one, but it changes what's going on. So maybe I can come up, I'll make a little note to myself to you. This is a great question. Okay. So, all right, very quickly last thing to say about ordinary varying intercepts. I jumped over classic over dispersed count models to beta binomial and gamma croissant, otherwise known as negative binomial for no good reason last week because I said we would deal with that same problem over dispersion with multi-level models and which are generally much more expandable than those beta binomial and gamma croissant models. So I'm going to do just three slides very quickly on this. There's no new technology here, but this is a way to get over dispersion of counts. So the classic problem is empirical count distributions even after you condition on all the predictors typically have variation in excess of what the likelihood expects there to be. So I remember with count distributions the variance scales with the mean, both for binomial and croissant. You don't have a separate independent parameter like you do in the Gaussian. Gaussian is a weird special limiting case. All the other distributions, the variance and the mean scale together. There are parameters that are common to both of them. And so you can predict the variance knowing the mean or models as simple as the croissant and the binomial. And what we find is even after we condition on everything, there's excess variation. And this is easy to understand. There's stuff that we haven't measured that's creating diversity across the cases. That's why there's over dispersion, nearly always. I have yet to see an ecological data set. I keep challenging people that does not have over dispersion. I know I'm looking for one. So if you guys know one, please, I would love to include it as a homework problem or something. But I have yet to see one. It's easy to understand how you can get over dispersion. That's your head of geniety. You can get under dispersion through actual correlated effects, auto-correlation. You can get it. I haven't seen it in a real data set. So what varying intercepts that you do is they let you model that head of geniety. They give you distinct intercepts for each case. And then since we're in a multi-level context, we simply estimate the population of those intercepts. We get a variance parameter for how much over dispersion there is. So now that's sigma in the very intercepts distribution is the measure of the over dispersion. And you can use that to simulate out the variation you'd expect, just like where we ended on Tuesday when I talked about simulating to new clusters. It's the same mechanism. All the code for this example I'm about to show you is in the book. Of course, I won't show all the code here. So let's revisit the Oceanic Tools data. You remember this from last week? Two weeks ago? How many weeks ago? Last week? Two weeks ago. Two weeks ago. It was. It was two weeks ago. Yeah, time flies for me for you guys too, right? You're suffering more than I am, but I'm suffering too. And I live in a timeless world of math. So Oceanic Tools data, remember, this is their 10 cases here, 10 different Oceanic Societies. We're interested in the association between the magnitude of the population size and the complexity of the toolkits. And there's definitely over dispersion in these data. If you model it as a pure Poisson distribution of the tools, conditioning of population size, you find there's a lot of excess variation in the data of way beyond what the Poisson distribution expects. The Poisson distribution has a crazy small variance. The variance equals the mean in a Poisson distribution. So it's actually really tight around the mean. And we nearly always see more than that. So how can we cope with this? Well, add some varying intercepts for each, as I say here, island, but it's really society, each of the 10. So you think of this, each case now gets its own varying intercept. And yes, that's okay. That's no problem at all. And we can estimate it. You might think of this as having a parameter for the residual. And we estimate the variance in the residuals, and that measures the over dispersion. But still, the model ends up predicting way better because it models the over dispersion. It doesn't expect the constrained variance, but it also estimates the additional variance we should expect from new observations. The technology here that I've introduced is just in the red boxes. And it's the standard varying intercept technology. There's nothing different about it. But the purpose here is different. It's the code with over dispersion. Does this make some sense? And then the code to produce these kind of factual plots is in the book. There's nothing new about it. In the upper left real quick, this is the old model we did before just with log population in it without any, with no varying intercepts. So it doesn't expect any over dispersion. And then what I'm showing you are the 50%, 80%, 95% posterior intervals for the mean around the mean in the, is the black curve. And then the actual data, you can see there's a lot of dispersion around it. I'm not showing you the plus on distribution of predictions, but it's pretty tight around it as well. Then on the upper right, I'm showing you kind of factual predictions simulating out, marginalizing over that variation that's there in the sigma, posterior estimate of sigma. Again, the code to do this is in the book. It's just like the chimpanzees example we ended on on Tuesday. We get a lot, it expects a lot more variation now because there's actually quite a lot of remaining variation after it has with the log population effect. And the varying intercepts pick that up, and they give you parameters to describe how much of that variation there is, and that's the value for it. So it corresponds, if you had done this with Gamble plus on model, which would be the old technology for doing this, you get very similar posterior predictions. And there, but what you wouldn't be able to do there is have the particular estimates for each society as you don't get those parameters and those models. You just integrate over all that uncertainty. But they have a very similar purpose. It's just this framework is a lot more extensible than the two. And then very quickly, at the bottom here, I'm just doing a quick model comparison to show you that unsurprisingly, the varying intercept model fits a whole lot better. And it gets all of the UK weight. And you can see in the graph version of the bottle, even accounting for the pretty big standard errors here, it's a lot better than the other. Because there's a lot of over dispersion in the count. So again, I've never seen at least an ecological data set with a Fossum variable. The overdueer dispersed model always wins, just hands down. And so why should you care? Well, because not accounting for over dispersion can be a masking effect. It can hide the effective predictors that are in there. Because it's just like anything else. The variation data you have in accounting for it could be curing the effective predictors that are there. So accounting for the overdue dispersion is really useful, even though it causes some additional grief. It's worth worrying about. You don't have to do it in a multi-level framework. A negative binomial or the Amal Fossum model would work here just as well. But these are these are a lot more extensible because you can start adding varying slopes too. And all the other stuff we'll talk about today. So there's the motivation for this, Claire. There's a short section in the chapter, the end of the, I guess it's chapter 12. Yeah, in chapter 12 about this. I heard you take a look at it. No new technology just a new intent and way of thinking about it. Okay, we're going to spend the rest of the time today talking about varying slopes. Let me give you the quick phenomenological intro to them. And then we'll take the long slow tour of building them up and constructing them. So varying intercepts allow the means to vary by the cluster. So in the top graph on the right hand side of this slide, you see varying intercept what you do is every unit is represented here by a different regression line. But the slopes are still the same because the slopes are still constant across the clusters. But the varying intercept parameters give them different levels. So that's all they let them do. But obviously, slopes could vary by cluster too. And I'll give you some examples, lots of examples on the next slide about why that line be true and probably nearly always is. And then you get a picture more like in the bottom right of this slide where intercepts and slopes can vary across units. In fact, remember you learned early on in this course the intercepts and slopes are nearly always correlated. And this is especially true in nonlinear outcomes in GLMs because you've got ceiling and floor effects. So if an intercept pushes the baseline rate of something up really high up towards the ceiling, then that unit can't respond as much to the treatment. And its estimate of the slope will be closer to zero. It's just a necessary consequence of the way ceiling and floor effects work. And so these things become correlated through nature, through the biology or the sociology of the phenomenon and the way the behavior gets generated. But we'd like to take account of that and measure it so we're not tricked by these things. And so intellectually you can think about this. If you've got some parameter and it's assumed to be constant across a bunch of separable units in your data with exchangeable indices, you can always split that parameter up into a big vector of parameters and assign a distribution to them and adaptively estimate that distribution, get a regularizing prior. And that's the multi-level strategy. So you can, the easy way to construct varying slopes is just to treat slopes exactly like varying intercepts. Assign a Gaussian prior to them, learn its mean and standard deviation from the data, get adaptive cooling, you'll get shrinkage for them, and you'll get better estimates of the slopes than you would if you treated that as completely separate. Or if you assumed that they were constant across all units. Constant across all the units is maximum underfitting, right? And completely separate across all the units is maximum overfitting. We'd like to be just right if we like our forage just right or whatever that fairy tale is. I don't remember. And so the adaptive approach gets us closer to that at least than the two extremes will. So to put some biological and sociological meat on this, what a real phenomenon in which slopes vary. Well, the one that I'm into now is this personalized medicine movement is a big thing now. And it looks like the federal government's going to put a lot of money behind research on this. And this is something a lot of physicians have known about for a long time is that people respond very differently to the same drug. So aspirin doesn't induce the same pain relief in everybody. Not everybody, ibuprofen doesn't work for everybody. So if you have a good physician, they work with you to figure out which medications actually work with you. So this is famous for a lot of the psychoactive therapeutic drugs that came, came in vogue in the late 80s and big in the 90s like Prozac. Prozac helps almost some people a lot and a lot of people not at all. That's one of those classic things. It became popular because it has very little side effects. So you could prescribe it basically at low risk. It's the idea. But lots of people clinically who doesn't do anything for it all. And then the big story, scandalous story going back centuries is that it almost, it used to be treated pharmaceuticals were tested entirely on men, white men, and then shipped around the world. And white men are not the species. So the big thing, even if it's not about sex, it's just body mass. The average guy is bigger and has a lot more blood volume. And so we're all getting these standard sized pills and taking the same amount. So you read your ibuprofen bottle and it says adult take two. But adult body mass is blood volume, which is the thing that matters here because it's delusion. It varies orders of magnitude, maybe not in this room, but almost. I mean, I don't have a lot of blood. But some of you have twice as much as I do, right? Because you're normal people. And that's a big effect. I have a colleague, Joan Silt, some of you know Joan. Joan's also a physician size, she jokes. She's like five feet tall or something like that. She's a spunky primatologist. And she's worked in equatorial Africa most of her career and never had malaria because if she takes a standard anti-malarial prophylaxis, it's like it's been effectively three times the dose for me because her blood volume is lower. And meanwhile, the rest of us take the prophylaxis and we still get malaria. We just don't die. That's the benefit. But she actually fights it because she's getting a massive dose. So the personalized medicine movement is about measuring the variability and response to these treatments. And this applies to surgeries and tons of different things. There's a tremendous amount of variation in the therapeutic value of medical interventions. These are varying slope effects. Even in a case where on average, a medication or a surgery does nothing for the population, it could still help a lot of people. So now the average effect is not what's of interest, but rather the variation in what explains it. Does that make some sense? So varying slopes help you pick up things even when the average beta coefficient is zero, the treatment could still have mattered a lot for some of the units. And you want to measure that. This is famous in educational programs. Back to school programs help some people. This is my next example. They don't work for others. And it depends upon language preparedness and background and other issues. In particular, they don't help boys nearly as much as they help girls. And I was talking about testosterone being a poison just before class started. And that might seem something to do with it. But that's a big deal. And so if the testing pool is balanced for sex, and it mainly doesn't help boys and it does help girls a lot, the average effect might be zero, but you may be exactly the wrong inference about the value of back to school programs. Right? So it makes some sense. Okay. So not every unit in an abstract sense is going to have the same relationship with a predictor. Variation matters whether we're trying to make inferences about what did happen or predict what will happen if we intervene. And we get, of course, we're going to get better estimates of slopes to here because we're going to use pooling. And pooling helps here for exactly the same reasons it helps with intercepts. There's really no difference about it. So let me give you an extended example where we're going to build up by construction a variance slopes model. And this is going to take a little bit of time because once you introduce variance slopes, you want to also tie them together in the population to the intercepts because remember, intercepts and slopes are nearly always highly correlated. One in four is the other. So let me give you a contrived another simulation case. The contrived cases are good because they're simple. You know all the moving parts and they can prove to you that the thing works. This will be a contrived simulation case where we construct up a very basic variance slopes model so you can understand all the pieces to it. But really all we're doing is we're going to take that adaptive prior from hearing intercepts on Tuesday and we're going to make it two-dimension in the basic case. But it could be higher dimensional than that. It could have a hundred dimensions if you've got a bunch of predictor variables and you want to let them all have varying slopes. There's no problem. Stand to handle it. I know that I made it do it. And it can take it. So let's build up the basic case. Remember back on Thursday of last week when I introduced multi-level models, I talked about coffee and how for clad wearing he can remember the motor memory to make a cup of coffee but he's never had coffee before in the rest of his brain. So every cup of coffee is the first cup of coffee he's ever had. He's like, wow, this is great. What is this? And then he puts it down and walks away a minute later and he's like, oh, what is this liquid that I can drink? And I asked you to imagine, for example, or I don't remember if I did then, but now I will. Imagine we have a robot that's programmed to visit cafes and order coffee and record things about it but we'll take a simple case like how long it waits to get a cup of coffee. So think about the Paris and Berlin cafes again. And what I asserted in the basic example for varying intercepts was that pooling data across the cafes let you learn faster about the true average weight times at both cafes. And the order you visit them is irrelevant. You do all of the information should move between them and this is what varying intercepts models do. Now we're going to complicate this measurement again. Now imagine we have a robot. It's like a fancy Roomba, right? Like a little coffee holder on it if you guys know what a Roomba is. Yeah, they're great, actually Roomba, Google it. And it's going to go around the cafes and it's going to order cups of coffee and it's going to record how long it takes them. It's going to visit each cafe in the morning and the afternoon and the wait time varies depending on time of day. So now we've got a predictor variable, what time of day it is. And the slope parameter in this context is the difference in wait time between the afternoon and the morning. And what I'm going to assume here is that the average weight across almost all cafes in a population of cafes, the average wait time in the morning is longer because more people are getting coffee in the morning than it is in the afternoon when they're less crowded. But the intercepts, let's say the intercept is the average wait time in the morning and the slope which is the difference between the afternoon and the morning are correlated in this population. Because the busier, more popular a cafe is, the busier it is in the morning, say. So you have to wait a long time, the Roomba has to wait a long time to get a cup of coffee. Then there can be a bigger drop in the waiting time in the afternoon. But for a cafe which basically nobody ever goes to, you get your coffee instantaneously in the morning and the afternoon either way. So these are the two extreme examples I put up on the slide. On the right hand side of the slide, there's Cafe A which is popular, it's where all the hipsters go. And the wait time in the morning is like seven minutes on average and the wait times in the afternoons drop down to around four or five. There's a pretty big drop, a couple minutes, because fewer hipsters are, they're sleeping in the afternoon, right? Their hip heart is to go to. It all has a paper about it. So Cafe B is unpopular, this is Cafe I go to. You basically wait two minutes all the time. There are some days where on average, if there is a difference, it is faster in the afternoon. You can get your coffee basically instantaneously because no one's in this horrible place. So it's still true that on average, slopes are negative. You wait less time in the afternoon. But the difference is smaller now because there's a floor effect, right? The physics of the thing, of the very phenomenon, constrain the slope, the change in slope depending on what the intercept is. And this creates a negative correlation between intercepts and slopes in the population of categories. Does that make sense? And these things happen for good physical reasons, good mechanistic reasons in all kinds of systems. And why we'd like to model that correlation now is because it enhances pooling. Now we actually get transfer of information across parameter types. We can learn a cafe how long you're going to wait at the cafe in the morning. Even before you ordered a cup of coffee in the afternoon, you can make a better prediction about how long you're going to wait in the afternoon because of knowing something about the structure in the population. Does that make sense? So we'd like to program our little Roomba to model the correlation as well. And then pool across intercepts and slopes, right? So you can imagine, for example, that it now functions and there's some cafe it doesn't visit in the afternoon, but a bunch of other cafes it has visited both in the morning and the afternoon. You've got a good estimate of the correlation. Using the intercept at that cafe, it only visited in the mornings, you can make a great better prediction of how long you wait in the afternoon because you exploit the knowledge of the population structure between these parameter types. Does that make some sense? We're going to pull this out in an extended fashion and do it computationally now. But this is cool. It gives you pooling across parameter types. So not only are slopes pooling information across one another to enhance, to do adaptive regularization, and intercepts doing it, but intercepts and slopes are doing it across. And it's awesome. So this is the value of remembering and pooling, and it arises purely from just defining a multivariate prior for the population and then learning it shape. And that's the varying slope strategy. So here's one way you could look at it. Imagine an idealized statistical population of cafes in which they have Gaussian distributions of intercepts and slopes on the on the logit scale. Rather, yeah, these would be on, well, I just centered these. So they're on some some crazy scale, don't worry about it. And Gaussian distribution of intercepts, Gaussian distribution of slopes with less variation. They don't have the same amount of variation. And I've constructed here an example where they have a negative correlation. This negative correlation is actually pretty strong because slopes don't vary that much. So this is like minus 0.7, something like that. I think it's what I drew it as in the population of cafes. So each point in here is a cafe and arising from the sort of physics of waiting times and the work capacity of the hipsters that work at the cafes induces these negative correlations. So learning this negative correlation, the idea is you want to learn about this population at the same time you estimate the intercepts and slopes for each cafe because then you can do adaptive pooling across those types of parameters. You with me so far? We're just in concepts. Yeah, Katrina. If these are standardized, why does the slopes have a smaller range? Well, they're not standardized. They're just, I don't know, I just drew a picture. That's all. That's fine. We'll do the real thing in a second and there's code in the chapter to do the, we're going to do a full simulation of this and we're going to do it all on actual like duration scales. So then nothing will be standardized and it'll be a little easier. Right? So the units here don't matter. They're something, whatever. Okay. So let's talk about a population of cafes. This will be a two-dimensional Gaussian distribution now. And to define a two-dimensional Gaussian distribution, we need a vector of means, the mean for each variable in it, and these variables are the means now we're talking about is the average intercept and the average slope. Then we need a variance covariance matrix which defines how the intercepts and slopes both vary and covariate. And you've been doing this forever. This is member quadratic approximation for the posterior distribution. It is exactly the same description. Now this will be a prior inside the model where you'll get a posterior distribution of this distribution. I told you this was coming. Right? You've got distributions of distributions. Exhibit is happy. Right? We, we have pimped your model. She's just like, yo, I heard you like distributions. So I put distributions in your distributions. I'm glad people still remember exhibit because he is one of the great minds of our civilization. No, he's very entertaining. So, says more about me than him maybe. But anyway, so here in the bottom left of this slide I'm trying to show you the basic anatomy of a covariance covariance distribution. You've seen this before but it's worth reviewing it for those of you who haven't done a lot of linear algebra in your life. In the two-dimensional case it's pretty easy. There are four cells in the matrix. Along the diagonal here from the upper left to the lower right you have the variances which are just the squared standard deviations. Sigma sub alpha is the standard deviation of intercepts. I'll show you how that works in a moment but that's it's the same kind of parameter we had before. We square it we get the variance. Sigma sub beta is the standard deviation of slopes. We square it we get the variance of slopes. The other diagonal is the covariance between the two which is just the product of the variances times the correlation between them. So, we've got three parameters to estimate to estimate this matrix. Two standard deviations and a correlation parameter. And we need to find priors over this thing too which is going to be one of the mental hurdles. But we'll get there. We don't have to do that right now. We'll get there in a minute. Does this make sense for now? This is our task and we learn these parameters from the data and we assign this distribution as a prior, a multivariate prior, to the intercepts and slopes. So, we can have exploit the multidimensional pooling that arises from it. So, let's do this by construction which means let's I want to teach this to you as a simulation and sense where we know what's happening so you can see how it works. What I've done here is simulated 20 cafes. We're in a population of cafes in which intercepts and slopes are negatively correlated. So, on the horizontal we have intercepts in minutes that you'll wake for your coffee in the morning. And on the vertical we have slopes which is the difference between how long you wait in the afternoon and how long you wait in the morning. And the each point is a cafe. There are 20 of them and the gray ellipses there are the Gaussian distribution that they've been simulated from. So, you can sort of see just the contours of it. Now, you need to code to draw that because it might be something you want to do sometimes to draw that. It's really easy to do to draw those contours in R. There's a package called ellipse that makes a trivial and I give you the code to do it. But that's just the Gaussian distribution. So, nice thing about two dimensional Gaussian distributions, they define ellipses at the different confidence levels. It's really nice. Conic sections, we love you. So, we're going to say our coffee robot spends five days going in the morning in the afternoon to each of these 20 cafes and ends up with 200 observations and then we ask what is it learned and it reports back to us and it dumps out a bunch of samples. Right? So, this is how you're used to this in your homeworks, right? You ask this very sophisticated question of the data in the form of a model that you have carefully manicures and then the answer is a posterior distribution. You were like, shit. And this is, again, remember the interview with the golem. It's always like this. So, we're going to do the interview with the coffee robot now and figure out what it's learned. It's just the thing about probability theory, right? You ask your question in distributions and you get an answer in distributions. It's not really what you wanted but it's what you've got. Okay. Here, we're going to take this model piece by piece but this is the robot's program and this is how it learns. And this is, in this case, the model was exactly the data-generating mechanism I know because I generated the data. In the real world it never will be but it's still a good way to learn about batches of parameters. So, the likelihood probably needs no explanation. The waiting time for observation eyes normally distributed with some mean which is a function of covariates for that and a standard deviation within each cafe, right? The sigma at the very top is the standard deviation and waiting times within each cafe. I have to think of it that way now. The linear model now has intercepts and slopes. It has varying intercepts, the alpha for the cafe for case i and the slope, the beta for the cafe at case i times m sub i which is an indicator whether it is morning or not, right? And here's our multivariate prior. There's a pair of parameters alpha and beta for each cafe cafe one, cafe two, cafe three. There they come in batches of two, right? Because they're features of each individual cafe. And what we say is for any particular cafe, the prior for its alpha and beta, the pair of them, the vector of link two of them, is multivariate normal with mean alpha and beta, right? That vector has a mean alpha beta which just means the mean of the alpha is alpha, is the mean of the beta is beta, and then a variance covariance matrix which I'm going to call s here. You with me so far? So this is our first multivariate prior but it's you've seen these sorts of distributions before, right? You've played with them before. So what is s? There are lots of ways you can specify s and I'm going to teach you a way that I think is most useful that lets you specify priors independently for the pieces of it. We're going to factor it apart. So let me explain that now on this slide. I call the covariance matrix shuffle. There are lots of things you can do with covariance matrices. Lots of ways to factor them. And those of you who've had a linear algebra course recently in your life, you know this, right? There are all these great things like Koleski factors and things that that can make the computation sufficient. And so we're not going to go that route, although that would be useful if I had more weeks in the course. I could I could sacrifice the time to do it. But let me give you a very robust and useful way to do this. So you probably want to make inferences based upon the sigmas and the correlation row. So let's have those be our sort of atomic parameters inside this thing. The covariance matrix itself is not a parameter. There are ways to do that. We just make the matrix a parameter. But we're going to instead put priors on its little bits and it'll get constructed from the sigmas, the two sigmas in the row. So you think about it this way for an m by m covariance matrix. So in this case this is two by two, but as we get more slopes and we let them vary you could this would be 10 by 10, right? If you have 10 predictors and you want all their main coefficients to vary in one intercept you've got a 10 by 10 covariance matrix. And that's no problem. It's not just no problem. You'll have to wait. You'll go get a couple of coffee or two while your computer's running. But Stan can handle it. And this is what Stan was made to do, his big high-dimensional multi-level models. Specifying the matrix the way I'm going to encourage you to require m standard deviation parameters, or you can do them as variances. It doesn't matter. One's just the square of the other. And you're going to need m squared minus m over two correlation parameters, the little rows. So in a two by two one there's only one row. Once you expand to three by three you need three of them and so on. It goes up that way, right? Because they specify the binary combinations of the different effects. So you need a total of m times impulse one over two parameters. So this will get big pretty fast. This is what we mean by high-dimensional. Lots of parameters in the model. So there are several ways to specify a prior on these things. The classic one that packages like bugs and jags use is the conjugate prior. In this case it's something called the inverse-wisert. And the inverse-wisert is the multi-dimensional generalization of, well, it's the inverse of the multi-dimensional generalization of the gamma distribution, which we haven't done a lot with. You just heard me say that and you're like, whoo, I've never not doing that. You don't want to do that. The only reason to use inverse-wisert is because it's conjugate. That's the only reason. It has no good features otherwise. It's a very annoying prior. And the only reason to use it is because it makes you a sampling efficient. But it has terrible features, especially near the boundary of zero. When you get a standard deviation near zero, it has a lot of pathological sampling behaviors. So I'm going to discourage you using it. And one of the nice things about stand is it doesn't care at all about conjugacy. It doesn't help it at all if something's conjugate. It'll do a great job. So we can do better than that. We can actually use informative priors and regularizing priors that have more sensible behavior. So instead, we're going to do this strategy at the bottom. This is a classic factoring of a covariance matrix. And if your linear algebra is rusty and you don't quite get this, it's okay. Lots of people in this class I know for factoring explain to you how this works. Matrix multiplication is delightful in maddening because it can be defined in any number of arbitrary ways. But basically, all I'm saying here is if we get this thing called the diagonal matrix, which has zeros everywhere except the diagonal where we put the standard deviations. And we have a correlation matrix. So capital R correlation matrix has ones all the way down to diagonal because every variable is perfectly correlated with itself. And then the rows are in the off diagonal positions that specify the correlations between pairs of parameters. That's a correlation matrix. That's usually easy to think about, right? And it turns out you get the covariance matrix back by doing the matrix multiplication in this order. So this one does it by columns and this one does it by rows and then you get it all ends up being composed as this matrix. It's just matrix multiplication order matters. It's not like ordinary scalar multiplication. So if you remember this, right, if you've had this recently, you're like, yeah, I got this. Move on, Richard. You guys with me? So we're going to specify, this will let us specify prior separately on the correlation matrix and the standard deviations. And that's what we're going to do. So this line is just here to define how we build the covariance matrix in terms of the more atomic parameters. And then we have a bunch of fixed non-adaptive priors for all those atomic bits. There's the alpha and beta, weekly regularizing priors for them. And then our yielding half-coshi distribution, a good very weekly regularizing priors for the others. The hard one, this is all easy, right? You guys with me so far. This is all, there's more of these, so that's weird. But there's nothing new yet. The new one is we need a prior for a correlation matrix. And here's what I said, we're going to have a prior over a matrix. So it looks like a weird thing. But what this does is it creates a joint prior for all the little rows inside the little, you know, rows. They look like peas, but they're rows, row. That'll be pronounced in ancient Greek, row. And there's this very useful prior that I call the LKJ core prior. Rather, that's what the stand team calls it. It doesn't have a name that's in general use. Let me spend one slide explaining it to you and why it's not. So this prior is something that comes from a computational statistics paper from 2009 by Lewandowski, Kuroika, and Joe. This is why it's called LKJ, it's just the author's initials. They call it the onion method, which sounds pretty cool actually. So I might change it to that. But anyway, so what does this thing do? Well, this is a nice, it's a distribution over correlation matrices. And it's controlled by one parameter eta that specifies how concentrated these matrices are towards what's called the identity matrix. The identity matrix has no correlation, right? It's just ones down to diagonal and zeroes everywhere else. So when eta equals one, you get a flat prior over all correlation matrices of the size you specified. Absolutely flat over every possible correlation matrix in Valley, which is awesome. It is absolutely awesome. So it's completely uninformative prior in the classical sense. And so if we were looking at a single correlation matrix, the marginal distribution of a single correlation parameter in the matrix up at the top here in the right column for eta equals one, it's just a uniform, right? That is, any correlation between minus one and one is equally probable in the prior. That's what flat means. But it's doing that for all of them at the same time. The margins of each will look flat like this. And then what you want to do instead, I encourage you, is to use something greater than one, because when it's greater than one, it's skeptical of extreme correlation. So this is regularization, a little bit of regularization. This is very weak regularization, but it's good and useful. You want to keep it off the boundaries, make it skeptical that anything, any of the intercepts and slopes are perfectly correlated, that no, I'm skeptical of that, right? So you don't, if you might get a sample in which they are perfectly correlated, or at least in which the perfect correlation explains the data best, but you should be skeptical of that, because that's overfitting, guaranteed, right? So use something a little bit bigger. Two gives you something really gentle like this. By the time you get up to five, there are these little flat regions near minus one and one. You should play around with it in the book. I give you some code to just sample from this prior, so you can find it out to get an idea. But something around two or four or five, depending upon how complex the model is, works great. There may be rare cases where you expect really strong correlations. If you go below one, you get spiked probability of the extremes. I can't think of a case where it's useful, but someone might dream of one if you do let me know. But that seems odd to me. So all you need to know about this is that it's a good weekly regularizing prior over a correlation matrix. Solves our regularization problem in a matrix case. Specifying this in the R code looks a lot like the model. I'll walk you through it because there are some new little, what do we call them, widgets? I don't know, little things you got to do in the code. But it's the same spirit as before. Notice, so I'm going to put the varying intercepts in red on this slide and the varying slopes in greens. You can see the correspondence in the linear model. It's the same as before. If you want varying slopes, just put the bracket and the cluster variable. And Matt Distan recognizes that and says oh you want a big matrix of these things. Think about how many unique cafes there are. I can do that. And then when you specify the multidimensional prior, you use the little c collection constructor, the vector constructor for R there. And you put a cafe and b cafe in there and then you bracket them both on the same cafe at the end. Let's just say there's a pair of them. There's this vector of an alpha and a beta and those pairs are unique to each cafe. That's what it's saying. And then you have a multivariate normal and the two on the end means my version that does the factoring. And you can do without the two and then you can't it doesn't do the factoring for you. So this version just does the construction where it multiplies the correlation matrix by the diagonal matrix twice to reconstruct the covariance matrix for you. And then you can input a vector of sigmas and a correlation matrix which I'm calling your row. Which is what that bold capital R was. With me so far, notice there's a little vector of means a and b also inside there. Yeah. So is sigma sigma is a vector? Yeah it is a vector. I don't know that. Yes. How does it know that? Because it knows that there are two elements in the output. Okay. On the left hand side of the tilde there are two elements so it knows there has to be two standard deviations. Okay. And you're putting the prior for them to be the same. Exactly. And then so later down I think I get to that on the next slide. Yeah actually yeah great this is here we go. So sigma cafe now in red both of all of the sigmas get the same koshi. If you if you don't want them to have the same prior you can put like bracket one and bracket two in the priors in map to stand and then you fix each element the prior for each element of that vector separately. It'll recognize that. I know because it was maddening to debug that. I remember it. Late night with a pot of coffee. Intercepts and slopes as deviations from the overall. Do you include the overall in? No. Oh no no. Yeah good question. You can it's same as before you can take that mean vector out you can put it in the linear model still. Yeah and I'll do that later today. So you can see an example of it. Then you just put a zero there. And that the same goes oh you want a vector of zeros. I can do that. And again it's the same. Then you get offsets instead of these are going to include the offset. So you had a question about this before right. Yes. So is this generalizable to multivariate multiple regression? Is this kind of this same thing we've been doing with multiple regression but only predicting the single outcome variable? Yeah yeah. But we have to predict multiple. Yes good question. So let me translate that real quick and you tell me if I got your sense right. And so in a in a multivariate multiple regression there's multiple outcome measures and you may be using the same or different predictors to model them all simultaneously. Yes absolutely this is compatible with that. I had a paper last year where we did something like that. I mean it's cool because what's great about it is you get pooling across the outcome types because this multivariate prior underneath has parameters in it from a population that includes all the outcomes on top. And so you get data about one outcome filters through the joint probability distribution into the parameters and form the other outcome. You get pooling in every direction at greater than the speed of light. It's awesome. Yeah it's great. What's that off-chain paper that I published with Jeremy? It was like that. We have two outcome types and there's I don't know there's there's this big like 8x8 covariance matrix prior in the model or something that does it all. No? Okay yeah after that I wouldn't talk either. Okay now so this this extends up to madness really fast. You can this is Sparta if you will. Okay and then row dlkj core 2 down there at the bottom and follow up the Katrina's question. How does it know the dimensionality of row because it knows there's two parameters in the vector so it knows it's a 2x2 matrix. You can also give it the start value and define its dimensionality that way but it figures it out. It's got a map to stand has a limited artificial intelligence that tries to figure things out when it doesn't have info. But if you give it explicit start values it just takes those and stands and gets dimensioned from them. Okay so you run this model and it samples and I should say you should read in the notes there's this general effect in when you run this and stand it's going to give you some warning from the interpreter about an absolute log absolute determinant of Jacobian. Read the note in the notes about that. There's nothing to worry about. It might freak you out at first and then there's a little overthinking box to help you understand what why you might want the log absolute determinant of Jacobian has to do with changing the geometry of the space and I view this inside the stand model so that I can do this more efficiently. That's all that's going on and stand doesn't know that I'm playing by the rules and so it does it does what I like stand to do. Stand is fussy. When it thinks something might be wrong it always warns you. I really like that actually. I like my software to be fussy. So it's looking out for you but you should check the notes for that explanation. Okay working with these models is the same as always. You can extract samples and go to town. You know the model. You can push them back through the model. Link and Sim work and mainly because Link and Sim don't even see the prior. Link and Sim just use the likelihood so they don't even interact with the varying effects in a direct way. Not with the prior at least. So first thing I want to show you is just the posterior distribution of row compared to its prior. So the prior there that's the lkj core 2 where 8 equals 2 so weakly regularizing right in the dashed line. And then the posterior distribution really strong evidence of a negative correlation between intersets and slopes. I'm surprised because that's what I said. We have a lot of data right. But that induces shrinkage in both dimensions now and that's what I want to explain you next. You guys with me so far on this? Okay so this negative correlation now goes together with the alpha and beta estimates and the sigma estimates to define a posterior distribution of the distributions of intersets and slopes. And that's what I'm showing you here now on the left. Each of these well first let's attend to the ellipses. The ellipses are that posterior distribution. It's the posterior median regularizing prior. Joint regularizing prior to the intersets and slopes. And then the pooled and unpooled estimates are shown on top. The blue points as before in the tadpole example are the unpooled estimates. That is what you get if you just took the data only for each cafe and estimated its average morning wait time and the average difference between afternoon and morning. You get the blue points in this graph where intersets are on the horizontal and slopes are on the vertical. The open circles are just like in the tadpole example. The pooled estimates and the lines connect the estimates for each individual cafe. So what you can probably see is there's some gravity induced by the pooling distribution here, the multivariate prior. They shrink not exactly towards the center but towards kind of a regression line that has the right correlation to go along with it. And you can see there's kind of a plane in there, a line in there that was shrinking towards. If you squint hard you can see it. Those of you at home can't see this now that I'm waving my hand across it, right? There's a line here that they're kind of shrinking towards. And so if you had data that had this came from this spaghetti distribution, the best bit regression line would pass right through there. That's what these points were shrinking towards. And the shrinkage again, the more extreme an estimate is from the center of the inferred population, the more shrinkage you get. So look down here at the bottom, right? Lots of movement. I can zoom in. I should have done this before to show you pooled and unpooled, yeah. And then I labeled the contours. 10, 30, 50, 80, and 99% of the distribution. So the raw blue points which are further from the center here shrink more. There's longer line segments. And then you'll notice that there's this negative slope of them all. If you thought about those little line segments as little regression lines, they have negative slopes, right? And that's because there's a negative correlation between the two. So let's focus to understand why and what this does to shrinkage. So let's focus on this one up here. So that cafe has an average morning wait time, right? Pretty much average. It's empirical. The blue point is pretty much right really close to the average morning wait time. But it has an unusually high slope, right? Which means there's very little difference between this morning and afternoon wait time. It's unusually busy in the afternoon. So the model is skeptical of that slope estimate and wants to shrink it towards the mean because it's an improbable slope. Probably sampling variation given the amount of data it has. And so it shrinks it and the slope goes down. It moves towards the center of the Gaussian distribution. But because intersets and slopes are correlated in the overall population, the fact that the slope moves induces movement in the intercept in the direction of the negative correlation and then it infers a weight. But if we got the slope wrong, probably got the intercept wrong a little bit too, the intercept is probably bigger than it was. And that's why the open circle is down to the right from the blue dot. Does that make sense? That's the pooling across the parameter types that arises. And you see this in other places as well, but it's most obvious in these cases where one of the parameters is average and the other is extreme. Then you can really pick up the logical consequence of what we know about the population is that we also, when we shrink one, we got to shrink the other. And then that joint shrinkage obeys the correlation structure between the two. Does that make some sense? I know this is very meta, man. The due to buy, so this is good stuff. And this gives you better estimates. You get information transfer not just within slopes but across intersets and slopes too. So if you specified separate parameters for each of these, it would shrink the slope of probably not the other. That's right. They would be the distribution of say posterior mean varying the intercepts and slopes would still be correlated. But you wouldn't have a prior that told you what the correlation was and you wouldn't get shrinkage in both dimensions. But I guarantee you they'll, the fact that the negative correlation will still show up in your estimates. But your model won't understand it and won't be able to do pooling across the types for that reason. That's the big aid in doing this. And it doesn't cost you much in parameter complexity actually because it's adaptively regularizing. So if there's not much correlation then it doesn't add hardly any flexibility to the model. It's what's nice about it. So, yeah, you can see in this is happening to different amounts of different places. Those cafes that are close to the middle, there's hardly any shrinkage going on at all. But the ones that are over here in extreme, they can in extreme cases move in the other direction. So it's not destiny. It's not where things land. But there's this overall tendency to have that negative shrinkage that obeys what we know about the structure of the both types of parameters. This is the parameter scale where the shrinkage actually happens. But we could translate this to the outcome scale. How? Just take these parameters and push them through the linear model. And then on the horizontal here I have morning wait time which is just the alpha. Right? It's the same axis basically as the other graph. And now we're calling it the morning outcome. And the afternoon wait time which is the sum of the alpha and beta. And the Gaussian distribution also transforms. And now it looks different but you still have shrinkage right where all the estimates have kind of come towards the center of gravity of this joint outcomes. There's a positive correlation on the outcome scale expected between morning and afternoon wait times. Even though on the parameter scale they're negatively correlated. And I just, this is just a fact of how it works. And I put this here just to tell you that shrinkage happens on the parameter scale. It's how you define the priors. You define on the parameters. If you did it, if you defined the parameters on the outcome scale so instead of a slope parameter you had two intercepts and one was turned on in the morning and one was turned on in the afternoon but only one was on at a time. Then you get shrinkage on that scale. It would tell you the same story. You end up with the same effective predictions. But the correlation would look different. They'd be positively correlated in the population. They'd be negatively correlated. Does that make a little bit of sense? But notice they're all below that dash horizontal line which is the point where the morning afternoon wait times are equal. So afternoon wait times are shorter in the population. Do you agree? Yeah. So why doesn't the shrinkage appear to scale with the distance? Well it does a little bit. At the bottom left at the time. Yeah it does a little bit. Some of those points are really close to the invisible regression line that things are being attracted to. So they're not moving much. So they don't. It's like distance from that line that matters now in the two dimensional space. And in like a three-dimensional space there's a plane that everything's shrinking towards. Hyper planes. We're off in Hilbert space. And all of that. But yeah that's the answer unfortunately is that the shrinkage is complicated. It's shrinking towards a structure. And in the two-dimensional space there's a linear structure that's shrinking to moving towards a three-dimensional space. There's a two-dimensional structure and so on. So it's distance from that that matters. It's complicated though. There's no exactly a line. It's like a little ellipse or something in it. Yes? No? We're standing before you. You're like I got this. Anyway you've got to be patient with yourselves guys about this. There's a huge value in model types of this because they use all the information in very savvy ways and you get better estimates because they're regularizing. Regularization is great. This is adaptive regularization in multiple dimensions. The cognitive trick to grasp just right now, it'll take, you can spend a lot of time getting the mechanics and the computation of it. But the trick to get conceptually and motivationally is that what we can do within single parameter types we can do across them because there's population structure in the effects in a population. So this is also true. I used a bunch of motivating examples about pharmaceuticals. Nasty fact about the efficacy of pharmaceuticals is healthier people are helped more by them. Right? So the healthier someone is on average, call that their intercept. The better their response to treatment. This is the worst fact about medicine ever. Right? The people that most need the help are the ones that get it the least. Right? So it's just a horrible thing. And I'm not talking about health insurance. Although we could have that conversation. But I'm just talking about the biology of it. If your system is functioning well you get more out of, you can stand the surgery better for example or the pain reliever works better. If lots of things are malfunctioning at once because you're old then the treatments don't work as well. So you get a correlation in these things. So learning about one feature of an individual helps you predict other features like how they will respond to an intervention. And that's what we get out of this. And the posterior covariance matrix has that information in it. Okay. So here's the summary slide. We have a joint distribution of varying effects and intercepts and slopes now. And this pools information across both types of parameters. Because in the population they're correlated. You could have a sample in which there is no correlation or population which there is no correlation. Then the model will learn that. And you won't get shrinkage across the types as a consequence. That's right. It's doing adaptive regularization. These models assume there might be some correlation. You give it a prior for that correlation that's weekly regularizing. And it tries its best to learn that correlation structure. And so the correlation between effects gives you shrinkage. If they're shrinkage in one dimension it's going to induce some shrinkage in the other as well. And this gives us improved accuracy. So in the extreme case you could imagine, and I gave you this guys before, that you didn't have data that would let you estimate the slope for some particular cafe. But you can still make predictions about it because you get a prior for that from the population distribution that you have. That's invaluable for filling and missing data. Which is what we're going to do next week. We'll be able to fill in missing values through the same technology as we go. You with me? Yeah? You guys seem to be paying attention. You always pay attention, but I'm saying you're on here. I can see your eyes are glow and you're liking this. And again, I call this adventures in covariance because it's very meta. Right now we have possible distributions of distributions. All right. Let me give you a most elaborate regression example we've done so far in the course that adds everything that we've ever done together. And still this is a pretty simple model, but we'll have everything we've ever done. Let's do cross-classified varying slopes. We're going to come back to the chimpanzees data now because you know that data example and I don't have to reteach a data set. And we're going to put varying slopes on both of the coefficients that we're in the sort of main model of interest here. The model that wants to predict pulling the left lever as a combination of some baseline tendency plus a main effect of where the prosocial option was, whether it's on the left or the right-hand side. And then an interaction with the condition that is the presence of another individual at the other end of the table. So there's three parameters. Two of them are slopes, so to speak. And once they intercept, we're going to make them all varying. We're going to make a three dimensional prior that contains them all. Just extrapolates up from the previous example. So the basic way to think about this is when we add more slopes we just need a higher dimension covariance matrix. We've got we've got alphas and two kinds of betas. We have a three by three matrix. That's all it is. And it really is trivially easy to get map to scan to generate this stuff for you. So let me explain the meat of the model here. This is just the likelihood in the linear model. But I've broken up the linear model to make this cognitively easier to think about. You can write the models this way. It works nearly always, I think. It might be a hiccup sometimes, but it seems to work all the time. And you just think of if you read it from the bottom, you substitute each linear model up. So each of the three on the bottom go into the top one. We have a model. Fancy A there. Fancy A sub I is the intercept model. And it's an average intercept plus an offset for actor I. Plus an offset for the block. Remember we did blocks on Tuesday, right? Remember two kinds of bearing intercepts across classified model. And then there's a fancy B sub P, right? This, there's a model for that coefficient. And it's a sum of an average effect, beta sub P, an offset for the actor, and an offset for the block. Those are both bearing effects. And then the final slope, the interaction term, is also a linear model combination of three things. An average effect for two offsets, one for actor, one for block. Makes sense? And this, this makes, cognitively makes things a lot easier. I do it this way. Otherwise, because, you know, I don't have enough working memory to keep my long crazy line of R code going, right? Go on for a while. The spaces make some returns or something and make it work. You don't have to do it this way. At least use some white space and put returns in the line. Keep yourself sane. So we've got, in each of these three, we've got three average effects now to model, as well as three actor offsets and three block offsets. So you can see what, what I've highlighted there constitutes a three by three matrix of parameters. And we're going to make, we're going to go high dimensional now. So, well, it's not all that high dimension. Three is not high, right? Three is just one bigger than two. It's not high. High is like 20 or 100, which you could do the same way if you were mad. So, we're going to need two multivariate priors now because there's two cluster types and each gets their own population distribution, their own statistical distribution. We're not assuming that the actor, the features of actors are related to all of the features of experimental blocks, right? Because they're determined separately, right? The actors have their recurrent features across the blocks and the blocks have their features which are due to things other than the actors, right? That's why we're modeling them as separate. The block effects are supposed to be like weather effects or whether they didn't fed them or not that morning or something like that. And the actor effects are things like how handed they are and how much they care about the other chimpanzees. So, we get two three-dimensional multivariate normal priors. One for each actor, there's a batch of three parameters that describe their behavior, their intercept and their two slope offsets. And this is where I'm putting in the zeros for the two so that both of these priors are centered on zero and we're going to put the average effects in the linear model. They already were, right? Back on the previous slide. Same for the blocks, there's three block parameters. For each block, there are three parameters, intercept and two offsets for the slopes. Mean is a vector of zeros and the covariance matrix for each. Modeled the same way as before. Let me walk you through this code real quick. There's a lot of code here, but it's just moderately exactly what we just showed. It's just the natural logical extension of what went on before. So, here I'm just highlighting for you this cribbing where you can make up symbols inside the model and then just define them on lines below. And they just get substituted up, basically. These models run from bottom to top, essentially. So, as long as the linear models are in that order, it should work. This gets compiled to stand, actually, where they're going to run top to bottom, but it handles the ordering, right? So, if you look at the stand code, you'll see that it doesn't in the right order. And the things inside each of these are the parameters of interest. So, let's just look at in red, the actor effects and in green, the block effects. And you see there are two priors. For the red one, we've got three parameter types inside a vector. So, it's a link three vector bracketed on the actor for each actor. And then there's a vector of sigma, sigma sub-actor, and a vector of row and a correlation matrix row sub-actor. Same for the block effects, but now all with the word block substituted everywhere that there was actor before. Yeah, and it's really just that easy. Again, more cluster types, copy-paste, go to town. You can do a lot of stuff here with this. You might have to sample for a while. We'll get a cup of coffees, but it'll work. Is that a hand or hand? No, I'm just... Okay, yeah tracing it out. Yeah, yeah, okay. You want me to wait a little while? No. No, okay. Sorry. Yeah, the notes, this is all the notes, and you want to sit down and run it and get some sense of it and play around with the priors a bit and see how strong you have to make them before you start nudging the posterior around. Do experiments like that to get you some of these models. Any questions about this code, though? The only trick to note about it is... You'll notice that I have... I've called this M13.6 prep, and I only set it to two iterations. Obviously that's insufficient, right? Two samples. The first one will be a warm-up sample. The second one will be an actual sample. It'll be terrible. That will not converge. I'm confident. But all that does is it compiles the model down to a binary executable that will be hidden in your temp directory. And then I call resample, which is also in the rethinking package. I described this in the Markup Chain Money Carlo chapter, and it takes that compiled model and distributes it across cores and runs three chains in parallel. It doesn't have to recompile the model, which is most of the time that you're waiting, probably, right? So that's what I do here. And I sample this thing to death, 5,000. Each chain, each of three chains, 5,000 warm-up samples. We take 20,000 samples total. This converges nicely, though. But you do have to sample a lot longer to get good convergence with this model, okay? All right. And most of you guys, all of you guys, I think, probably have computers that have lots of cores in them, right? And one of your cores, like I always joke, is playing Spotify. The other one's got running Twitter bots or something. And I know some of you have, like, eight cores in your computers, and those cores just sit around wanting to do something. So you can run markup chains. No problem at all. I like to keep my... My computers have four cores, and I always like to keep one core free for the operating system, otherwise things would get pretty chunky. So three would be my advice. You can run more chains. You could distribute eight chains across three cores, and as each finishes, it'll keep farming them out. It has a task list, and it keeps going. Okay. Does using more cores to chains help you at all? No. Yeah, that's a good question. Yeah, if I had like two chains and three cores, I think it would just send out two tasks to two cores, and then maybe insult you. I don't know. I don't think it will insult you, but... It's... The Parallel Library in R is what makes this easy. I mean, there is a little bit of footwork in doing it, but this is some... Distributing processes across cores is so much easier than it was even five years ago. Five years ago, man, I had to like go into something called MPI, the Multi-Processor Interface, and write my own C code to do this stuff, and it was a nightmare. Now it's like, oh, wow, where are you? All my life parallel. I love you, right? It's just so much easier than it used to be. It's nice. Okay. So cross-classifying varying slope. So let me try to summarize this model, and then we can... We've got a little bit of time to introduce Gaussian processes, motivate it, and then we'll... Spend some quality time with it on the first part of Tuesday. That'll keep us on track, actually. So there are 54 parameters in this model. I think... I think I counted them upright, but you might have to check me, because they're all over the place. So I think I did the math right. Three average effects. Right? There's an alpha and two betas. Those are the average effects. There's three times seven equals 21 varying effects on actor. Why seven? Because there are seven actors, and there are three parameters per actor. Three offset, three factor. There are three times six. It was 18 varying effects on block. Why? Because there are six experimental blocks, and three parameters reclock. There are six standard deviations. Why? Because there's three for each cluster type. Right? So that's six total. And six three correlation parameters, because there are two correlation matrices of rank three, which implies three correlation parameters, atomically inside of them, as drawn by my completely useless diagram. Mystical diagram. Looks like a... I don't know, ruby skew. That's a correlation matrix. Diagnose all ones, and then I'm shading in how many free correlation parameters are implied. Right? I think I did it right. You know, check my work. Kind of some of you are checking it. I appreciate that, because this is... It's hard to... But anyway, 50-something. There are a lot of parameters. WAIC says there are about 18 effective parameters. And remember, and it's not trying to cap the parameters. It's telling you a measure of the squishy, the flexibility of the model. Why has it... Why so fewer? And the reason is because if you look at the sigmas to get estimated out of this, most of them are really small. The posterior means, as well as being shown in the pracey outfit down at the bottom, there's only one that's really big, and that's the intercepts across actors, which measure their handiness preferences. There is some variation in slopes. It's probably important for some of the individuals. But in general, you get a lot of regularization from the fact that these standard deviations are small, so it creates a lot of shrinkage. In fact, it regularizes a lot. So the different intercepts, all the offsets across actors and blocks are not very free. It's like they have a really strong regularizing prior on them, so they can't overfit the sample. And this is the great thing about adaptive priors. As long as you've got the computational power, you don't have to guess how much regularization you need. You can learn it this way. Okay. You excited? Everybody, I can tell that. You're excited to get outside, to get out of this room. Yes, so am I. But no. Not because I want to run away from you guys, but because it's nice out. It's California. Those of you watching at home sucks not to be in California, that's all I can say. I've seen the weather on the other side of the country. Anyway, so very quickly, multi-level heuristics before I shift into just a conceptual introduction to Gaussian processes. This is my attempt to be horoscopic. And here's the way I work that in the sense of the most transferable advice I can give you across cases, knowing nothing about the data you're working with. It's often nice to begin with what I call the empty model, which is in a sense, there are no predictor variables in it, but you have varying intercepts on each of the cluster types of interest. So in the Timothy data, that'd be accurate and blocked. And what this does is it gives you a hierarchical analysis of variance with cooling. And it gives you an idea where the action is. Is the variation among actors? Is it among blocks? And in this case, the answer would be all the actions across actors. And there's really nothing going on with blocks. And then you might decide, okay, I'm going to put blocks aside now. And I'm going to focus on actors. And then you can start putting the extra things in. And yeah, these are good horoscopes, by the way. You can read them to do free time. I can tell Katrina I've already spotted the interesting part of the slide. But add predictors in and I like to let everything vary. So I'd like to say, let's assume there might be some variation in slopes. And let me try to measure it. If it turns out the sigmas for a particular varying slope are small, then it's usually harmless to just drop that. It makes it computationally more efficient. Or you could leave it in if you've got processor power to burn. Predictions will be pretty much identical if it's small. Predictions will be really different if the variance is big because that means even if the average slope is nearly zero. So there's no average effect of the intervention or treatment or whatever it is, the exposure. There may be substantial variation in how the individual units in the data have responded. And that can stimulate your thinking about what unmeasured covariates explain that variation in response. It implies there are interactions that you can chase and measure next paper. Papers are what keep us alive in this business. But no, it helps you discover science. It helps you figure out the nature of what's going on. So and then always consider two sorts of posterior prediction in these models for the same units. You're asking what happened in these data. It's a direct inference problem. And then there's the forecasting issue. What might we expect for new units? And that's a question about the population parameters about the what you've measured about the adaptive prior itself. And then as always, you may know things that trump all of this, right? You may have a particular purpose and you got to trust your gut because you know more about the system than any statistician does. It's your system. Questions? All right, thank you. Yeah, question. Well, this is a case where I think this is a safe statement. So if the signal is close to zero, that means there's almost no variation across the closer intercepts for that cluster. And you're going to get the same posterior predictions, whether you include that or not. So that's why I say it's safe to do it. You should try the experiment and see. If you if you're ever down, you don't one of the things I love about this modeling business is it doesn't require you to be clever. If you want to know something makes a difference, do it both ways. That's what I do. I avoid trying to be clever because they need to out-clever yourself usually. At least I'm speaking for myself. So I try to be clever. I just out-trick myself. But I just like, then I just like to be, I'm going to emphasize this next week. What I love about the Bayesian approach to inference is that it obviates the need to be clever because brute force, relentless, ruthless application of the laws of conditional probabilities. That's all I get. If people ask me a question, the information stated, I condition on that. I get an inference and it's always right. It's just I love it. And it means I don't have to be clever. And I don't want my career to depend upon being clever. So that's why I do it. I'll emphasize this next week because this is what's going to help us construct measurement error and missing data models. You don't have to intuit the consequences of measurement error and missing data. You can let the model do all the logic. All you have to do is state what you know about them and it just does it for itself. And that's kind of what I'm encouraging you to say here is that you don't have to intuit what's going to happen. Just do the comparison. But I think this is the safest advice I can give you. Yeah. Did you have your hand up? No? Okay. Go ahead. Go ahead. Well, then it gets tricky because you could, for that particular row, you could just fix it at a zero or do something. But then it might get tricky computationally too because now you're going to have some covariance matrix where you're plugging in. You're going to have to get structured a different way. In raw stand code, that'd be easy to do. In map to stand, I think that'd be a chore. So if you want to do that sometime, come to me and ask me. I can show you how to do it. Yeah. You can build these covariance matrices from scratch inside the stand code a whole bunch of different ways. But we're going to do that with Gaussian processes as well. Anyway, other questions right now? Questions will occur to you guys. So bank them and bring them to me next week, especially as you start in the homework for this. The homework for this content is already up on the website. And there are two questions. I tell you exactly what to do in both of them. And they're both very intercept, very slow problems. And I think they'll be pretty... You'll have fun with them. They're not so bad. But it will force you to learn this. Is this the last homework? Yes, good question. And I plan to talk about that at the beginning of Tuesday. But yeah, this is the last homework. Next week, on Thursday, you're getting the final exam to take home and spend a glorious week with. It's easier than the homeworks. It is. It's easier than the homeworks. It has to be. But I like you guys working groups on the homeworks. Right? So you can feel the burn. Right? And these will be easy. Some of you are going to the Grand Canyon assholes. And so... No, as I said, I would go with you. You could probably do this, you know, on a donkey. Anyway. All right, I've got five minutes. Let me spend five minutes getting the beginning of Gaussian processes. And then I'll start on the formal definition of them on Tuesday when you come back. And that'll set us up to keep us exactly on track for the content, actually. So here's a paper I really like. And I'm sure some of you have seen this before. It's a neat effect. I am Gen X. Any other Gen Xers in here? You're all millennials. Yeah, you're Gen X. Yeah. Ooh, Gen X. Right? It's a writer generation. And reality bites. Or is it sucks? Bites. Bites. Yeah, I can't remember. It does both. So a peculiar thing about Gen X Americans is that we're pretty conservative on average. I mean, not me, right? But they're always outliers. But we're the most conservative American cohort still alive. Age-justed. Because there's also an age effect. People get more conservative of the older they get. And so you can see this in election year. So we're looking at here on the graph on the left-hand side. This is from a paper by Dietz and Gellman on cohort effects in socialization effects in the American electorate. Each of these colors is a different election. What you're showing is across birth years of individuals who voted in the election, the proportion of Republican vote that they contributed. So this is a partisanship graph. It's often called in political science. But the 50-50 line drawn here. And so all of them have a similar kind of shape. Now there are offsets. Because look, for example, the 2008 election in which Barack Obama was elected was less conservative than basically any other election in living memory for most people. It was a landslide election, which immediately landslid the other direction at the midterms. But that's the American electorate. Right? I'm still not happy. Let's look at the other guy. Right? Sorry. That's just how it goes. But what I want you to see is this incredible symmetry across a bunch of elections. There are overall shifts. But the cohort effects are very persistent. And individuals at the same age have very similar partisanship. This is a fascinating phenomenon. And so individuals, so Gen Xs, you know, around here somewhere, and we're this partisan lump right here that's very conservative. And what's that about, you might start thinking? And what explains the other effects? Because the millennials on the other side are much more liberal, and then the baby boomers on the other side of me are more liberal as well. And then you have the great generation, which is the greatest generation they call themselves. Which is almost as conservative as my generation, but not quite. And what's going on here is, to be explained in part 5, the graph in the lower right, the party in power and how popular they are, the party that holds the White House and how popular they are when you get near voting age as a lifetime long effect on your partisanship. Maybe not your particular, but on the average person. Right? So obviously it didn't on me. But on average. And so what you're sorting here is they did, Dietz and Gelman did these nice age-weighted regressions. So they tried to figure out using age as a continuous category. So you imagine individuals who were all around 18 years old when, say, Reagan was in the White House. This is the Gen X effect. Reagan or Bush won. The Republican Party was popular despite the fact that it was selling weapons to terrorists and then fumbling cocaine in the L.A. and all the other stuff that it was doing. It remained insanely popular. Some of you don't know that story. Ask me sometime. But it was incredibly popular. And so if you became a political age around that time, you identified as Republican. And that has had a lifelong lasting influence on many Americans. It's incredible. And if you became a political age right around the time of the second term of the second Bush, then around the 2008 election, it was the opposite effect. Now your lifetime way more liberal. And that's how a lot of these, you know, and then there are other effects as well of the popularity. And it's who's in the White House and whether they're popular or not has these lasting effects. It's not everything. So what they figured out is where the ages are that have these effects. And you need to continue. This is kind of a continuous category because it isn't, the idea is there's this exposure, this popular culture exposure around the time you become a voting age. But it's not that magically exactly at the same age everybody gets that exposure and no one else does. There's diffusion around it. There's kind of a range of ages. And the further you get from 18, the less you're paying attention. So you're not retuning your partisanship. This is the theory of age pants. But some people do. So there's a diffuse kind of effect away from distance. Some ages are politically more similar to others. The more similar the ages are in calendar years. So this is kind of a, there's an issue here where every age has its own varying intercept, which is its partisanship. So they aren't discrete categories now because there's covariance between intercepts of different ages as a function of how close they are in years. So this is what I mean by continuous categories. And in data situations like this, you can extend the varying intercept logic to continuous diffusion processes like this where there are unmeasured exposures like who was in the White House at the time, you turn 18 and how popular they were that diffuse out as a function of distance from some particular agency person or location or anything else you want to categorize units in your data across. This is a very powerful family of models called Gaussian process regression. And when you guys return on Tuesday, I'll show you how to do it. Thank you.