 When they come in, can we shame them? Okay, so join in a public shaming ritual. No, welcome back. We're going to finish talking about adventures and covariance today. I have a lot to get through, but it's all the same concepts you got last time. So we took a leisurely stroll through random slopes or varying slopes or whatever you want to call them. Partially pooled slopes. Let's call them that. On Wednesday, and I want to give you more examples of how to use those today to help you stretch into this and see how it works. And then we're going to talk about Gaussian processes, which is a natural extension of the partial pooling to much more interesting and commonplace things. Okay, so we had finished on Wednesday. I had just introduced, reintroduced this data set you've analyzed before. Remind you this is UC Berkeley admissions data from the 1970s. However, this topic is always fresh. And the topic is, is there vendor discrimination in higher education? In this case, it's applications to PhD programs. And the outcome variable A sub I here is accepted applications. It's a count. And these are binomially distributed under maximum entropy with maximum of N sub I, which is the number of applications received in that department. And piece by the probability of an application being accepted. And this is what I'm trying to estimate as a function of a department specific intercept, alpha sub department, and a department specific slope, which is the difference in probability of acceptance between a male and female application. It's the decrement or increment in law gods that arises from the application being submitted by someone who identifies as male. Yeah, you with me? So N sub I is a dummy variable, zero one dummy variable. And now the department is different about this model from what we've done before is we had this adaptive prior of bivariate normal distribution that bundles together the department specific effects, the random effects or varying effects, the partially pooled parameters specific to each department. They come in pairs. There's an alpha and a beta for each department. Then they go into a vector. And these pairs are sampled from a bivariate normal distribution where the mean is alpha beta. And then there's a covariance matrix, S, which models the covariance between the alphas and betas. And it's the modeling this covariance that lets us do the pooling across the intercepts and slopes. Yeah, you with me? So this is just like the coffee robot thing we did on Wednesday, but now it's actually real data rather than synthetic data. But the model has the same structure, same idea. We end up with two scale variables, the sigmas that tell us the variation in intercepts and slopes across departments. And then a correlation matrix, which in this case is just a two by two correlation matrix so it has exactly one correlation inside of it. Yeah, because a two by two correlation matrix has got ones on the diagonal. And then on the off diagonals it's got correlations. So there's only one off diagonal self here to worry about. So it's a correlation between alpha and beta. Across departments, right? There's some average correlation across departments. And I'll show you that correlation later because we're going to get to infer it. We'll get a posterior distribution for it. And then ye oldee fixed priors at the bottom, right? Down through the bottom. These are chosen to be weekly regularizing, as always. Does this make sense? This is just review, right? You've seen this model before, but now it's real data. And so this is what varying slopes models look like. They always look like this. The thing that's usually different about them is the top two lines. The top two lines change and you get different models and then a lot of the stuff at the bottom is basically the same. So the code and if you're going to fit this with map to stand you can probably guess what it's going to look like. And it looks like our previous examples here. We're bracketing on department ID and that instructs the code to make a vector of parameters with the same length as the number of unique departments it finds. And then we get this vector for the intercept for each department and slope. And those are also indexed by department ID. And then there's our DMVNorm model. Two is just a form of multivariate normal, which takes a vector of sigmas and then a correlation matrix because we want to have them be separate parameters in the model because we want to specify prior something separately. It's just what we did before on Wednesday. Run this model. It samples very smoothly. And now what we're going to look at on the left, this is the posterior distribution of the correlation between the intercepts and the slopes. Since it's a correlation on the horizontal axis the possible values are between minus one and one. A correlation of zero means on average no association. When you learn a department's intercept you learn nothing about the slope. That would be zero. If it's very negative when you learn a department's intercept you should expect that if the intercept is high then the slope will be low and vice versa. In this case there's a lot of evidence that it's negative but we don't know how negative. Could be very negative. Could be mildly negative. Actually could be slightly positive. There's not a ton of information about the correlation among departments in these data. And I put the question to you why? Jeopardy. That shows still on the air. The reason is because there aren't very many departments. You've got A, B, C, D, E and F. Those are all the departments in the data set and those are the ones you have to interpret the correlation from. Think of it that way. You can precisely estimate the alpha and beta for each of these departments. Very precisely because there are a lot of applications per department. Your sample size per department is high in these data. But the number of departments is low. So the sample size of the second level at the clustered level is small. And that makes it hard to estimate the correlation. Does that make sense? This is very typical with data like this. That you can have power at one level and low power at another. And that's just how it goes. But of course since we're doing Bayesian inference that's fine. When we have no power we get the prior back. And that's what we test for. And so we know. But you still get valid inferences. It's not like an asymptotic method where you don't know what to think at all. These estimators are only valid asymptotically. These estimators are valid at all sample sizes. But you need to check whether the model has learned anything or not. Here a lot has been learned. But still this is not a very peak posterior distribution. So you shouldn't bet your house on this being negative. But it's a hint about something. Remember I think maybe this is a point to say something I've said many times in this course. But the business of science is not to have confident predictions. The business of science is to figure out what the data say. Conditional on a model. We're not supposed to be soosayers. When you can't make a conclusion that's our job to report that. That makes sense. So this is what you report. It's the uncertainty. On the right you can see the scatterplot of the intercepts and slopes for each department done in this shrinkage contour map again. So I'll walk you through this. On the horizontal the intercept for each department. That's the average log odds that an application from a female identified application is accepted. Sorry, that was a very awkward grammar. But yeah, was that parsable? Did that make sense? Yeah, so 0 means 50%, right? 50-50. Greater than 0 means the majority of applications are accepted. So you see most departments are rejecting most applications from female candidates and male candidates. Turns out because there's not much difference, but we'll get to that in a second. There are two departments, A and B, which accept a majority of applications. Yeah? A and B I think are physics and engineering. Turns out. They've been anonymized in the data set but I looked it up. You can find these things out in the original publications. And then all the others are different. And F is like awful, right? So it's almost minus 3, which is only like 10%, something a little less than 10% acceptance rate. Something like that. It's really low. And I think that social psychology or communication, one of the two, it's very tough to get admission. So, and then on the vertical is the slope. This is the difference between male and female application. If it's above zero, that means an advantage to male applications if it's below zero. Disadvantage, you'll see that there are two departments which are a little bit above zero. C and E, right? See that? Sorry, zero is not labeled. I should have redrawn that axis so that zero was labeled on the axis. I apologize. But it's between 0.2 and minus 0.2, right? R has the worst defaults. Like zero is special, right? You could redraw the axis. You could do it. I should have just done something about it. And then the other four departments have negative ones. Three of them vary slightly. And there's just one very special department, department A. Which has a radical disadvantage to male applications. So this is where there's variance in the slopes. And this is what you're seeing here. It isn't the same. The effect of an application being applied to this male is not the same in all places. We shouldn't say the effect. The description of what happens to the chances. We don't know about the causal effects here, right? Because this could be because different kinds of guys apply to department A than others. It doesn't have to be. It's not necessarily evidence that there's discrimination against dudes in this. It's California so it's dudes technically. It's the legal term. Back to actually, I remember what I taught in California Dude is gender neutral for lots of young people in California. It's just dude. Girlfriends refer to one another as dude all the time. And it's fun. So the international audience watching these videos, keep that in mind. Dude is gender neutral in California. It's fun. Does this make sense? The way this graph is drawn? What's going on? And the varying slopes are being picked up here? Yeah, so this is a nice thing about varying slopes is that you can consider these possible different associations between predictors in different clusters. And this gives you hints about how to investigate what's really going on. It gives you causal leverage, inferential leverage to figure out what's happening. So you could look at these estimates in this forest plot. Maybe it's a little easier to figure out the slopes for the six departments at the top here. You'll see most of them are near zero and overlap at a good count. Not much evidence that makes much difference. If it does make a difference, it's not a big difference. But one of these departments, Department A, that is pretty good evidence. This could be spurious. It's just one year of admission state. If you're going to really do a study on this, you should get a decade or more and see what's going on. But in this one year, that's pretty good evidence that there's something going on in that particular department. And that gives you something to look at. Lots of evidence of variation in the average rates of admission. And there's another causal story there that's something to investigate. The identities of the departments tell you something about what's happening in these things. Basically, the departments down here with the low values get way more applications. So they reject more. That's what's happening. And the story, which we heard before of course, is that at least in the 1970s women predominantly applied to those departments. And so their aggregate chances of admission to graduate school are lower as a consequence of that. That's the Simpsons paradox that we talked about before. This happens with, as I said in a previous lecture, there's some evidence that this happens in a milder form with grant applications for senior people. It's way harder to get funding in social psychology than it is in physics. And that disproportionately affects the genders because of the way the gender is distributed. Is that grammatical in the sciences? So there's a lot of interesting sociology to plumb out of this. But back to statistics, the boring part. Let's move on to an example that adds another layer to this. Same type of model now. But let me show you the most complicated model in the course. I think. This is the most complicated model in the course. And it's still pretty simple because it just builds all the tools you've already seen and puts them together in the same model in a logical way. We're going to go back to the chimpanzee experimental data before. I won't re-explain it, apologies. So if you're joining me now, you'll have to go back and read chapter 10. Remind you, our goal though is we've got two kinds of clusters. We've got actors, which are individual chimpanzees, and we've got experimental blocks. We should cluster on both of those things because there are repeat observations on both. And this generates what's called a cross-classified model. Now we want to do that with our slopes as well. We want to allow the effects of treatment to vary by individual and by block. Imagine on some particular day the treatment didn't work at all because of the time it was done, the chimpanzees had all just eaten. Something like that. It could happen. I've seen experiments go awry for exactly reasons like this. If all the chimpanzees had just had a bucket of carrots, then you try to get them to do something for a grape, it might not work. That's a block effect. But then there can be individual effects. Some chimpanzees just don't care. Just give me the grape already. That's how it's going to work. There could be varying slopes on actors, and it could be varying slopes on blocks. The way we consider that possibility is of course by putting varying slopes on everything. Absolutely everything. Let me show you how to do that. It's just a logical extension of what you've already done. Here's the model we're going to build. Let me explain this piece by piece. This looks complicated, but this is just the linear model part. I've just broken it up so that you can see how these things get constructed. The top is the likelihood. The outcome is pulling the left hand lever. We have a logit. We're going to model everything on log odds, because it's a logistic regression. First, there's an intercept for pull i. The intercept for pull i is a function of a bunch of stuff. It's a grand mean alpha and then offsets for actor and block. This is just the very intercept model we did last week. Yes? Okay, there's a nodding. Thank you. Now the new part is we're also going to have varying slopes. Now the effect of the partner being present, that's what this beta sub p is, was the partner present? Was there another chimpanzee at the end of this table? Was there a chimpanzee at the other end of the table who could receive the other piece of food or not? This is an important thing. Is this right? I forget which is coded as which. Anyway, there's a slope. There's a mean effect, beta sub p, and then there are offsets for actor and block. So this is how the random slope gets composed. But that sum is then the thing that appears in the linear model as a regular old slope. It's like the slope gets adjusted by offsets that are specific to each actor and block. And then finally, the interaction effect. We've conditioned Yeah, p is pro-social. Sorry, I missed about p is pro-social. Whether there are two pieces of food on the left and c is whether there's, is the condition whether there's a partner or not. So the interaction is bonus when it's on the left is pro-social and there's a partner present. Again, same idea. There's an average effect, p sub p, sorry, that should be a c. I should proofread this slide. It should be a pc, sorry, and then offsets for actor and block. Hypo's aside, does this make sense? Yeah, the code is correct, I promise you. And so what you see is in all, we have three average effects, three actor offsets and three block offsets. But it's easier to think about it if you split the linear model apart like this, I think. Yeah, does this help? Instead of putting all these symbols in one line, which would be a nice, I cert. Yeah, try it at home. So the tricky part of this is then what do we do with all of these offsets? Well now there are three effects per cluster type. So we're going to have two separate adaptive priors, just like before when we did this with only intercepts. Remember we had two normal distributions that were adaptive priors that were functions of things? We've still got that, but now they're three-dimensional. Well, because they're three effects. So it's just like the model we just did, but now I said two by two matrices, we have three by three matrices. There's no limit. You can just keep making them bigger. No problem. Yeah, until your computer runs out of memory. Yeah, or you get bored and retire. Something like that. So for each actor the way you could read this first one on the slide is for each actor there's an alpha, a beta sub p, and a beta sub pc for each actor. Those are normally distributed in a three-dimensional normal distribution. They're zero centered because we're going to model them as offsets. Yeah, offsets from the grand mean with a covariance matrix for actors. That s sub actor is a three by three matrix. And then the same thing for blocks. For each block there's an alpha offset a beta sub p offset and a beta sub pc offset. Likewise this is another, a different multivariate normal distribution with zero means and a different covariance matrix. The code looks basically the same. There's just more of it. More delightful code to see here. So at the top part I've labeled this. You can put comments in these models if you want. I do with my own models. They're replete with comments. In fact my models start out as a list of comments and then I fill in code between comments. I hope the, I think I made this joke in my department a couple weeks ago. I hope the science archeologists of the year 3000 are happy that I'm putting in all these comments in my code so they can figure out what the purpose of this weird ritual activity is that we're doing here. But my code is just full of comments. I do encourage you to do this. When the model gets compiled the comments just get stripped out. So you can put them in as much as you want. So the top is the likelihood then the linear models you can, in that stand you can do this. You can have as many linear models as you want. Just break them apart. So BPC is a symbol that's up in the top one. So it's no problem. It just gets substituted. It's all good. Yeah? It's no problem. Many linear models as you like. And then the adaptive priors down here. And you can see it looks messy but it's the same structure as before. You're creating a vector with the C function. It means collection I think in art. And you put in the three effects and intercepts and two slopes cluster them by actor. And then the DMVNorm2 line looks the same. Yeah? And then so like how does R know it's three dimensional now? Well because there are three effects on the front. So it knows it's a three dimensional normal distribution and sets it up correctly. Yeah? If you coded this raw yourself in stand you'd have to specify that. You'd have to specify it's a three by three but map to stand just figures that out. It just guesses. It has a primitive very extremely primitive artificial intelligence that says oh you have three effects on the front of this. You must want a three dimensional normal distribution. That's how smart it is. You could beat it in chess. Right? So and then the fixed priors. Does this make sense? Yeah? It's familiar? And this code strategy works no matter how many effects you want to put in there. The major limit is not the code. It's your, it's interpretability. I'll say that again. The major limit is not what you can do with the code. It's what you can understand. Yeah? That's the barrier. But in principle with a sampler like stand that uses money, Monte Carlo, you can be very under strain with how many things you let be random. That's not a problem. Yeah? Okay. Alright. So what do we get? In total in that model there are 54 parameters. Let me count them for you. Right? These are like the days of Christmas. See if we know that song. Right? Three French hens. We have three French hens. Now three average effects. Right? Three times seven equals 21 varying effects on actor. Why is it three times seven? There are seven actors and there are three effects per actor. So 21 actor specific parameters. Yeah? Three times six is 18 varying effects on block. Same idea. There are three effects per block and six blocks. So you can see right away in a big data set you count up, you ramp up parameters very fast. Right? So imagine you had a substantially large sample like 200 individuals and there are three effects per individual. This is an ordinary sort of situation in the sciences. Yeah? Not a problem at all. It's not even a very big data set. Then you've got suddenly 600 effects on individuals. No problem. Stan will do it. I just finished up, well just finished up a couple months ago I finished up models on a project where we have 27,000 parameters. There are about, I don't know, I forget how many individuals like 8,000 individuals. Something like that and then a bunch of random effects and it samples fine. So it's, you're sampling from a 27,000 dimensional surface. Yeah? Your computer can do it. It can handle it. So this model is no sweat, right? Six standard deviations and six free correlation parameters. So the, if you're classically trained or you've had a standard specific education, which is fine. I mean there's nothing wrong with that. I mean I have any application there is. It's just you don't think it's wrong with it as you get the wrong intuitions about these models sometimes. And I think my experience is one of the mistaken intuitions is, oh my god there's 54 parameters you're going to overfit. No, the parameter count doesn't tell you you're overfitting risk. It doesn't because we have priors. And in a hierarchical model the priors are learned from the data. It's adaptive. And so if there's not much variation on a cluster, the parameters that are specific to that cluster type have almost no effect at all on the fit. Yeah? It makes sense because of the regularization. If the sigma ends up being estimated to be small, those estimates will be shrunk towards zero very aggressively. And the model will be very skeptical that there's anything. And this is what's going to happen with block here. They're going to see in a second is that there's no action at the block level. You saw that before because it was a good experiment, right? Well at least in that respect I won't judge you in time. I actually think it is a good experiment, but I can't. I'm only talking about the block effects, right? So you can get a hint of this. If you fit this model and you look at WAIC, the effective number of parameters according to WAIC in this case is 18. And then the question is, that's a lot less than 54. Yeah? And the question is why? And the reason is when you look at the sigma estimates, all but one of them is very small. So sigma actor 1, which is this, so now there are indices, one is the intercepts on actor. It's the first effect in that vector of effect specific to actor. Yeah? Does that make sense? Why? It's the number 1. And it's substantial. We saw that before because there are tremendous handedness preferences among the individuals in this study. And that explains most of the variation in the experiment. Not all of it, because there are experimental effects in here, but that's most of the variation. But then none of the other sigma's are very big above zero at all. They're 89% intervals are from zero to some number. Basically you get the prior back with these things. Well, there's some learning from it, but they take shrinks towards zero. So you do learn it more than the prior. The prior is very dispersed. There's not much action at either at any of the block effects or at the slope variation. Not much evidence that individuals differ in response to the treatments or that block has much effect at all. And this is why the effective number of parameters is so much smaller than the actual parameter count. Because those sigmas are small, that means the prior has a very small standard deviation and that aggressively regularizes the parameter effects estimates towards zero. Yeah? This just builds on, this is pooling. Remember pooling? Shrinkage. There's a lot of shrinkage going on in this model. And so you can fit the model where you drop all these effects and you give basically the same estimate. Because that's what adaptive regularization is going to do for you. It's going to give you good estimates, even for complex models. That's the idea. It's the whole principle behind it. It's tuning these things. So I've got time. So I put in this as optional, but since I'm doing well on time here, let me talk to you about it. So there is a very common technical issue with fitting multi-level models, especially those that have complicated covariance structures to them. Is that there are different ways to specify exactly the same model in the code. So the mathematical definition of the model would be identical, because math don't care. But the way you write the code makes a difference for how efficiently it samples. By that I mean how long you need to sit there and wait or how many coffees you need to drink downstairs before your computer finishes or whatever. How much your office overheats in the summer. In my department we have servers in the basement within the cooling towers so that we don't overheat our offices. But this may not be a luxury I understand for everybody. You will kill your plants in your office from the heat, the exhaust heat from your computer. This is a time that sometimes with grad students they learn that their computer has a fan. That's when they start running these models. This noise your computer is making for the first time ever because it has a fan that it's never turned on before. How long does that process have to go on? It makes a difference. Let me introduce you to this. This is a thing called the non-centered form of a hierarchical model. This is a huge topic in the literature. This is actually a really common thing. In math to stand there's this convenience that the flip between the so-called centered which is what you did before and non-centered you just use a different multivariate normal distribution. D, M, B, N, C is the so-called non-centered version. There's a section in chapter 13 in my book about this. What does non-centered mean? Non-centered means you factor all the parameters out of the prior. Non-centered means you factor all the parameters out of the prior. Where do they go? They go into linear model. Let me try to explain what's going on. First, why do you want to do this? Then I'll explain what it is. You want to do this because you get radically different efficiencies. This is that same chimpanzee model with all the random effects. We've got three on actor, three on block. Two, three-by-three covariates, matrices. The centered version, the first version I showed you where there are you put the sigmas, the covariance matrix is in the adaptive prior down there. When you see it in the model it's down in the adaptive prior. We're looking at the effective number of samples for the different parameters in the model from that. It's running the model for I forget how long, way too long, way more than you need. And it's around 2000 on average. Then I use the non-centered version and we get way more higher effective samples. Which means in principle you don't have to run. If you use the non-centered version of this model you don't have to run it very long. The effective number of samples is what you're interested in, not the actual number of samples. Why does this arise? Well that's a complicated topic that I will not actually attempt to explain today because it has to do with geometry and things like that. But the simple version of it is to say and you can think of this, I apologize, as magic is that Hamiltonian Monte Carlo works best when every dimension in the posterior is approximately a unit normal. That would be great. You could get everything to be unit normal. So this is a big Gaussian hypersphere with standard deviation 1 in every direction. The void then sampling would be super smooth. That would be great. That's ideal. That's the best thing. And so it turns out by factoring, refactoring your priors you can approach this goal. You can never get there, but you can approach it. It's like enlightenment. You never arrive. You're just trying to get a little closer in this lifetime. So here's the trick that makes this legal for lots of distributions. This works, I'm going to show you only for normals, but this works for lots of distributions. You can factor things out of them. So think about a normal distribution. Why is this distributed normally with I mean mu and sigma? You probably know that in lots of stats textbooks, not mine, but in lots of stats textbooks this will be written this way. Y equals mu plus something Z times sigma, where Z is distributed as a normal 0, 1. Z's are at YZ because it's a Z score, a standardized Gaussian residual. And these are equivalent statements. These are mathematically equivalent. Exactly the same. Notice that in this version we've taken the mean and put it in the linear model. So now Y is some average plus an offset, which is some scale parameter sigma, which is the standard deviation, times the Z score specific to that Y value. They're exactly the same information. Yeah? You with me? I know this is not why you got into science. It's not why I got into science either, but this is the science minds now. You get out your mathematical pickaxe and you factor. And there's gold in those hills, and we will learn things from this. Yes? This metaphor is not working, is it? Alright, I'll try another later. In our case, we're thinking of the cases where, say, varying intercepts is the simplest way to think about this. In the centered form, you'd write the models, mu is some actor specific intercept plus stuff. Stuff is your predictors and your slopes and other stuff. Technical term. And then you have the kind of prior I showed you before. This is called a centered prior because the A and the sigma are in the prior. It centers the prior. It locates it. It puts the scale on it in that line. Then the non-centered form, which is perfectly equivalent mathematically, looks like this. Now we take that A and we put it in the linear model. Yeah? You've already done this part. Right? Then we have a z sub-actor. Now z scores each actor, cluster on that actor, times sigma, plus your same stuff. The stuff is the same. And now your prior is z-actor, cluster on actor is normal 0, 1. All of the parameters have come out of the prior, but these are the same model. Isn't math great? Don't you love it? Yes? No? So page 408 of my book has a lot about this and 408 and 409 go on about this because this is actually routinely useful in fitting these models. And each of these forms is useful in different data sets because of, well, again, think of this as magic, geometry. Because of the details of the sample sizes at the different levels. Sometimes the centered version is better. Sometimes the non-centered form is better. The UC Berkeley data set, for example, the centered form samples better. For the chimpanzee data, the non-centered form samples better. So you need to know both. Math to stand makes it easy to flip between them, but it's nice to know that these exist because it will save you some time and give you more reliable samples. Okay. So very quickly then the last thing to say about this. What I just showed you, okay, you can probably see how you can factor the mean and the standard deviation out of a normal because that's standardization. You do this with variables all the time, right? If you just had a column of values and I told you to standardize it, what would you do? Well, I hope what you would do is you would calculate the mean and then you subtract it from every value. Now the mean of the column is zero. Yeah? So that's when we take alpha out of the prior. That's all we've done. Now we've made the mean of the prior zero. No information has been lost. We just smuggled that mean into the linear model and then you should calculate the standard deviation of those mean-centered values and you should divide each value by that standard deviation. Now the standard deviation of the column is one, but we've got to put the scale back on it to make predictions. So where does the scale go? It goes in the linear model, right? It's got to go back up there in the linear model. The linear model then has got to give you a z-score and then you multiply that z-score by the standard deviation to get back on the original scale, right? So you divide the standardize, you must multiply at the end to get back on the original scale. Yeah? It's just the inverse operation. Does this make some sense? Hopefully my hand gestures are helping you understand. My family makes endless fun of me for the way I talk because I like this all the time. All right. But it's probably not easy to see how you can take this strategy and extend it to a covariance matrix. Now how do you factor a covariance matrix out and put it in the linear model? That's a good question. I'm glad you asked. The answer is we appeal to Montseau-Coleski. Montseau-Coleski was an officer in the French Army in World War I. Unfortunately, he died in a battle just before the end of World War I and he was a very accomplished mathematician specialized in geodesy, the study of how to measure the earth. And he invented, he was working on artillery. So a lot of good useful math comes from monstrous things. But he was working on artillery and he invented a method for factoring matrices that is fast and incredibly useful. It's called the Coleski decomposition. By the way, it's a Polish name but he was French. His parents were Polish, I think, or his grandparents, and they immigrated to France. That's why you get things. The history is actually really quite interesting about these things. So a colleague of his posthumously published his method and it has since become extremely useful in applied mathematics and it is used to fit linear regressions and all kinds of things now because it's an extremely useful and efficient way to factor matrices. And it turns out, there's this wonderful fact that if you take the covariance matrix and compute what's called its Coleski factor, you can then multiply a vector of z-scores, uncorrelated z-scores by the Coleski factor and then they'll have any correlation structure you like. It's a way of compartmentalizing, summarizing the covariance structure in something to generate correlated sequences of random variables. That's how you use it. So again on page 409, I give you some citations to this. Since I'm doing okay on time here here's the simplest example in two dimensions of what happens from this. The Coleski factor gives you factors terms that you want to multiply the different uncorrelated variables by to give them a correlation structure. So in the two-dimensional case, this is what it would look like. So at the top of this R code, I'm going to generate, what is that, 10,000? 10 to the fourth? And values, we're going to have a bivariate normal distribution where the first dimension has standard deviation two, the second dimension has standard deviation of a half, and they're correlated 0.6. This is what I want. But I'm going to start by generating totally uncorrelated z-scores. So z1 is our norm, 10,000 random normals with mean zero standard deviation one. z2 is 10,000 random normals with mean zero standard deviation one. You can probably see why they're uncorrelated. They're just different lines of code. Here's the Coleski magic now. If you calculate the Coleski factor for the covariance matrix implied by those two sigmas in that row up there, it tells you that the operations you need to do to construct now two new sequences of random variables, a1 and a2, that have those standard deviations and exactly that correlation on average. This is what you should do. You should take the first one and multiply by its sigma. That probably makes sense to you, right? So that puts it on the right scale. And then the second one is the fun and there's all this stuff and that's what the Coleski factor gives you. It's this stuff in the middle here. It's row times first z-score plus the square root of 1-square times e2. Yeah, it's obvious, right? Anyway, this is what comes out of the Coleski factor. It just, it comes out of it. So in higher dimensions you get more of these fun square root and squares and things. But you get the answer and then you just, literally just plug it in and there's a simple matrix algebra formula for inducing the correlation. And that's how it works. So give thanks tonight to Montserrat Coleski. He has done a great service in this. There are other ways to do this. It's just, this is the best. Yeah? So I have a blog post about this. If you want to read even more about it, about the wonders of algebra and different forms of multi-level models, let's say this is, it seems like an exotic thing but it's not. This is a routine thing that helps you stay sane when you're fitting multi-level models. The members of my department know this, right? This is the standard thing. I know John in particular has had lots of fun with Coleski factors. Yes. So now I think it'd be useful before going on to Gaussian processes to say a little bit about part of the applied statisticians challenge is that people come to us wanting very specific advice but we usually don't know enough about their science to give them very specific advice. And so I always call this the horoscopic advice problem is that people come and they say so I've got these data, what should I do? And I don't know and why do I call it horoscopic? It's because I don't have enough information to be fully useful. It's like doing horoscopes, right? If all I knew was your birth date then I'm supposed to forecast your life right? That's how it often feels when I do statistics consulting is that people come and they tell me a few things like I've got a couple variables, what should I do? It's like well I don't know. Considered investing right? It's tough but I do think there is some general advice that's useful and so I tried to put the advice that's specific to multi-level models down on a slide and a lot of this advice is embedded in the way I've given you guys homework problems in fact to try and teach you some of these lessons so for example just quickly go over this it's useful to begin with the empty model that has none of the predictors but has the clustering structure that comes from where there are repeat observations in your data. This is two reasons to do this. First it helps you stay sane. In complicated models you don't want to build it all at once because you'll make a mistake. And then it will be very hard to figure out exactly which piece of it's wrong. Those of you who use R or any other statistical software know the error messages are not always so helpful. All you know is that something has gone wrong but you don't know what. So you want to build these models in pieces, in little bits and so get the clustering structure first and make sure that works. The second reason is that tells you where the action is. It tells you which cluster has the most variation and that's useful as well. And then like with that homework problem I gave you, with the tadpoles, you watch the variance shrink and you learn stuff from that. This is where model comparison teaches you things about the data set. Just for the sake of getting things to predict well, I mean fit well, you want to standardize all your predictors before you run the model. You don't want predictors to have values like 400 on them. Just standardize them so that they all have mean zero and standard deviation one. Makes the computer work better because then it doesn't have to figure out the scale of all these things differently. If you put them all in the same scale, the model adaptation is more efficient. That's all. You could get it to fit with these inconvenient scales but do yourself a favor. It's also typically easier to specify reasonable priors once you've standardized things because you can think about a standard deviation change in a predictor. It's easier to put the prior on that scale. Use regularizing priors. This makes the model fit better and it prevents impossibly wild overfitting as well. Then you start adding in predictors and vary their slopes if they're in clusters and you go. If you get varying effects that have tiny sigmas, you can drop them. It's harmless. You could also leave them in. You'll get the same predictions. So I tried to show you before. Then look back at the end of Chapter 12 where I talked about the sort of posterior predictions of interest. Are your inferences about the specific clusters that were in these data? Or are they about some generalization to other clusters? You need to be careful when you inspect the predictions which of these you're concerned with. Then at the bottom, as always with horoscopes, you may know things that make all of this advice silly. You're the scientist. You have to drive. I'm the back seat statistician. Don't let the statistician drive. Okay. Last 15 minutes I want to tell you about Gaussian processes and give you a really simple example of how they work. So it comes up all the time and the data on the slide illustrates this that there are predictor variables which have big effects in the data. But their relationship to the outcome is complicated and continuous. So you get repeat observations but not necessarily at exactly the same values. So I'm going to call these continuous categories which is an oxymoron. It's like jumbo shrimp, military intelligence. I come from military family. I'm allowed to make that joke. Both of my parents worked in military intelligence actually. Jumbo shrimp, military intelligence, what's another one? Continuous categories. So the point is age. Age is the classic one. Nobody has exactly the same age. Right? But your age is more similar to other people's. And age, it isn't that age has a causal effect necessarily on something like here, political preferences. But it's a proxy. It's something that helps us make useful predictions because people who were born around the same time have experienced some of the same events. So like for my generation of Americans, the fall of the Berlin Wall this was a memorable time. And also for people who live around here if you're of a certain age. And that has lasting effects on political ideology for people's whole lifetimes. But it isn't, you can't cluster people by age because the ages are unique in your data. If you, you know, you could round them and then say, but still you're going to have, then you have a hundred different unique ages or something like that and you don't want to uniquely cluster in all those. Well you could, that probably worked fine. But then you're throwing away the information that ages closer to one another are more similar in prior expectation. Right? Someone who was born a year before me and a year after me should, in prior expectation before we see their voting record, be more similar to me in political ideology than someone born 20 years after me. Yeah? That's the prior expectation. So we want to put that information in the prior somehow. And so Gaussian processes is a, it's a technique, it's not the only technique, but it's a very useful machine learning technique for taking the partial pooling strategy to variables of this type. And again, you want to do it because it helps you regularize. You'll have massive imbalance across age and all kinds of other things. And so you want to regularize appropriately dealing with that imbalance. It's the same sort of issue. Yeah? So you can do better using this. And what these methods do is they estimate functions that relate age to the outcome. And they do it with the partial pooling. They're super cool. Love these things. And so here's an example from Andrew Gellman and his colleagues where they look at the example I gave, my generation, the so-called Reagan generation, people born around 1970. And Americans born around 1970 crown us this special, special group. Right? And we're a lot more conservative on average than other cohorts. Age-justed. Now we all get more conservative. At least all Americans do. I don't know in general if it's true, but Americans get more conservative as they age. I don't know what it is. Calcification or something like that. I don't know. But it's an effect. I think it is, there's a global tendency towards this actually. It's probably accumulation of stuff that you don't want to be taxed. I don't know. I should stop talking before someone sends me angry mail. But. And so regardless, this is what we're looking at on the left. You look at birth year around 1970, and then these different color curves are different elections. American presidential elections. And you're looking at offsets proportions of the Republican vote in those years by birth year. And this is raw empirical what you're seeing. This is not a this is the raw voting data. And so you'll see even though there are big shocks in different elections that move the curves up and down you can also see that when you line things up by birth year there are persistent cohort effects of birth year. Yeah. And it's not the year you're born in that matters and this is what the rest of this paper is about. It's when you're approaching voting age, what's going on? That's the thing. Who's in office? And the thing that they find if I remember the paper right is which party controls the presidency and how popular are they? Those are the interaction of those two things has a big effect. And so it seems to be that there's this sensitive window for the acquisition of political ideology. And so the Reagan generation there was this big boom towards Republicanism because you know, Reagan wall, stuff like that. And then the generations after that there was a big hit to Republicanism. And so the Obama cohort for example, the forecast from this is that there will be a very persistent the people who are nearing voting age or just reach voting age during the first Obama election or will be strongly oriented towards democratic party for the rest of their lives. That's the prediction and we'll be able to test that as time goes on. Yeah. But this is the idea of the causal effect. I think this is fascinating. Maybe you find political science boring, but I think this is really interesting. I'm sure if we looked in Germany we could find really fascinating effects like this. Of course here you've got way more parties. So it gets hard. But I bet you could find effects. Anyway, this is a fascinating process can really help us with. So this is the summary slide of what I just explained to you. So age is an example I said also income. Yeah. Individuals with similar incomes may have similar incentives in lots of ways, but it's not a discrete category. It's a continuous measure. What's a function that would let us relate income to some behavioral outcome? Your location, where you live, right? Again we're using locations of causal itself. It's a proxy for a bunch of unmeasured stuff. But this is always what clusters are. It isn't that the ID number of a chimpanzee is causal. It's that it's a proxy for the personality of that individual which we haven't measured directly. Yeah. Phylogenetic distance, network distance, many other things have this nature. They're continuous measures and entities close to one another should be more similar in prior expectation. So the clustering shouldn't be discrete in the prior. You could do it that way. It won't wreck things. The inferences would be okay. But the argument is we could do better when we exploit the prior expectation that more similar values are more similar. Okay. So Gaussian process is the way to do that. Let's return to a previous data example as always to understand how this works. Remember the Oceanic Tools dataset? I introduced Poisson regression with this. We're going to predict the historical complexity of the toolkit in different Oceanic societies. And as a function of population size and contact between the islands, one of the potential confounds and data like this is space. Some islands are closer to others. And tools move because people carry them. And so some small societies might have more tools because they live close to big societies. And in fact, I think that's probably true. And so the question is when we account for the spatial, this is called spatial autocorrelation sometimes, we account for the spatial autocorrelation as it's still an effective population size. So let's see how to put that in. Again, you can't cluster discreetly on location because every island is at a unique place. Your clusters would have membership one in every case. So Gaussian processes are natural for this. There are other ways to do this, but this is a really nice way to do it that handles the regularization. So what we need to make this work is a distance matrix here. So that's what I'm showing you. I have built this into the package so that you don't have to compute it yourself. But I made these data. How? By going to a website that gave me the distances between islands. That's exactly how I got it. Now these aren't, this is as the crow flies, but Polynesian voyagers were not crows, right? So these are not necessarily the most relevant distances. We could do better as to say, what would you want? You'd want travel distances. And the travel distances will depend upon trade winds and other things that are, we could figure this out. If I were actually a scholar of Oceania, I would be able to do this, but I'm not. So I just used a website that gave me as the crow flies. Great circles on the globe, right? It's not a straight line because that would take you through the Earth. There were great circle distances at least. So it's not the worst I could do. It's just the second worst. You've just got pairs of societies in here. The diagonal is all zero because you're distance zero from yourself. Yeah, you can quote me on that. And then pairs of islands have particular, these are distances in thousands of kilometers. You with me? So with ages, it's the same thing. You do distance in age between any two people. You can do it that way. Okay. So what's the model look like? It looks like a very intercept model. We've got a Poisson outcome up top. Poisson models conventionally use a log link. That's what I'll use here. So on the log number of, the log average number of tools for a society comes from a linear model where there's some average number of tools. And then there's an island offset. I'm going to call this gamma now. But it's just like a random intercept, right? It enters the linear model exactly like a random intercept. We have a random intercept for each island. And then there's the effect of log population on the end. It's a standard fixed regression effect. Yeah, you with me? So all the action in the Gaussian process comes from how you specify gamma. And so this is called the Gaussian process prior. It's still a multivariate normal. The gammas are in a single multivariate normal distribution. And the dimension of this distribution is the number of unique entities in your data set. Which in this case is the number of islands. Which is how many? Let me go back. 1, 2, 3, 4, 5, 6, 7, 8, 9, 10. Right? So this is a 10 by 10 multivariate normal distribution. Yeah? Sorry, here. This is a 10 by 10 multivariate normal distribution. Means are all zeros because we're going to do these as offsets. Right? The average is taken care of in the linear model. These are offsets. And then there's a covariance matrix K, which is a 10 by 10 covariance matrix. This is what we're modeling. We're modeling this covariance matrix now. And they're going to...this covariance matrix hanging onto your hats is specified by a very small number of parameters. So even though it's 10 by 10, and I just conditioned you to the idea that, oh my god, there's going to be so many parameters in that thing, no. Because we're going to calculate this from a formula. And that's what the Gaussian process does. You find the function that gives you this covariance matrix that explains the data well, given the regularization. So each entry i, j, i and j are islands. The covariance between those two islands is modeled as some maximum covariance, a to square, times this stuff, where e to the minus rho squared d squared i, j. d squared i, j is the square distance between the two islands. This is a Gaussian function. I'll plot it on the next slide. So hang on. This is just to show you the map. And rho squared is a parameter that determines how quickly the covariance declines as the distance increases. If it's a big number, then distance doesn't have much of an effect. It has very little effect at all. Or proximity has very little effect at all. Put it that way. Yeah? Dominique, is that a question? Or is thought okay? You'll find when you guys are teachers that every minor gesture from a student is something you attend to. It's like a pathology. And then this last term is if you've got repeat observations for the same entities, you want to let those vary. And that's what the sigma squared is. It's the residual variation within each unique island. In this case, we only have one measure for each island, so sigma squared doesn't matter. But in general, it'll give you that residual error. Okay. So, oh, I'm sorry. I always have these slides where I have all the words on it. And then I forget I have them. So, this is what I just said. All right. Let's plot it. So, all the action comes from that e to the minus rho squared d squared. That's a Gaussian. Remember, a Gaussian distribution is just an exponentiated parabola. I know. It's like, well, thanks. That helped our church, right? But not what it is. So, yeah, if your brain works like mine, that's like, what else could you ask for? That's perfectly clear. But no. So, all that means e to the minus of some squared measure of the data gives you a bell curve. That's where bell curves come from. And that's the Gaussian distribution. That's what it is. All the other junk in the Gaussian probability density is just normalization. It just makes sure that it sums to one. That's all it does. But the shape is exponentiating this minus squared thing. And so, to show you what that means, how we're modeling, is on the horizontal axis here, we've got the distance between any two given islands. And what we're going to do is find a function represented by this, that has a shape, as represented by this solid curve here, that where the correlation between those islands falls off as the distance increases. And what we're going to model is the rate at which it falls off. Yeah? And that's what the term in brackets does. Eta squared just translates this correlation into a covariance. But it gives you the maximum correlation, so to speak, between the two. Yeah? The maximum and obvious correlation is obviously one. But the typical, how correlated are islands that are really close to one another, it may not be one. So this curve can be shrunk down. But I'm just showing you the term in the rectangle up there. If you take the square out from the D, then you get the dashed line. It's linear. And you can do that too. That's fine. It's just a different assumption. Yeah? But the squared one tends to be much easier to fit. And the consequence of it is that there's very little decline in covariance for very small distances, but then it accelerates as you move away. Okay? As opposed to the linear one where the decline is maximized at the origin. Which is probably not reasonable. That's my hunch about it. You with me? So to really understand this, you guys are going to have to sit down with your console and do the calculation. This is the full model. I encourage you to go home and sit down with a cup of coffee and run this. This samples in no time at all. Don't be afraid. It runs really nice. And the model code doesn't look particularly more complicated than anything you've done before either. Why? Because I wrote a function gpl2 which does all the hard part for you. Yeah? You're welcome. And then the model output looks like a varying intercepts model and you process it like a varying intercepts model. The posterior distribution comes out to you just like any kind of typical thing. And so now for each island we've got an offset, but those intercepts have been regularized given this covariance function which has been inferred. What does that covariance function look like? Well, we can draw covariance functions from the posterior distribution. That's what I'm showing you on this slide. Again, distance in thousands of kilometers on the horizontal and then the covariance between those islands at that distance, the expected covariance on the vertical. And each of these lines is a sample from the posterior distribution. And the solid one, that's the posterior mean. So there's some effective distance in these data. Almost certainly. Right? It's not huge, but there's something going on here. That makes sense. History matters. These are humans after all in these days. So, apologies, taking a couple extra minutes here, but let me get through this. You can use, I show you all the code in the book to do this. You can then use the distance matrix to just plot up the correlations on the map. So let me finish with a slide where I show you how to do that. This is all in the book, but I think that you'll see that you want to visualize these things. So now I'm here spatially, longitude on the horizontal, latitude on the vertical. The latitude and longitude are in the data. They're in the package. I give them to you so you can make this plot right away. So these are where the islands are. And now I've drawn edges to connect these nodes where the shading represents the correlation that comes from the Gaussian process prior. So you can see the effective distance from this. So you see that there's this little triangle here with lots of stuff being traded around historically. So what happens to some of these islands is that they get more tools because they're near other islands that had more tools. Also the opposite can happen. You can get fewer tools because you're near islands that were small. But mainly there are positive effects of being near big places. Another way to visualize this is on the prediction scale. The relationship to log population now on the horizontal. There's still a big effect of that. The effective log population is still very powerful after we've accounted for the spatial autocorrelation. And then same thing, same edges, but now they're drawn in this different outcome space. And you can see the thing that these islands are being tugged below expectation because they're small and they're near one another. Some of these others, BG is probably being pulled up by Tonga, which is very large and so on. So the correlations aren't very powerful, but you can plot them with these edges and get some sense about what's going on in the data. Again, all the horoscopic stuff applies. With other data sets there'd be some other sensible way to do this. This may not be the right way to do it. Okay. So, apologize for going over a couple minutes here. I hope it was worth your time. Homework. You love homework, yeah? So we're going to take next week off. Next week? So you can take your time with this homework. And then we'll finish up the week after on a Wednesday and Friday lessons with chapter 14 of the book. The homework I'd like you to do to practice chapter 13 is M3, M4 and H1. I think these will help you be the most valuable ones for you. And then, I should say next week, sorry, not next week, the week after next, you'll come back. We'll talk about missing data in measurement error and enlightenment. Enjoy your week off and thanks for your indulgence.