 We're gonna continue multi-level models this week. The goal this week is to take the principles from last week, the principles of pooling and shrinkage benefits it brings to harder problems, more complex structures. The principle of pooling can be applied not only to multiple intercepts but to other kinds of parameters as well and you want to do it. It's not only that you can, you should. Why should you? Because you get better estimates if you do it. So I want to give you a number of examples today of the strategy, it's usually called varying or random slopes, slopes being the parameters that quote unquote effects. But we'll also, at the end of this week on Friday, talk about an extension of these principles to continuous categories instead of discrete categories as well. Okay, terminal logical background to begin. Last week we talked about models that contain varying intercepts. These are cases where the intercept parameters for different clusters in the data are given unique parameters, unique values in the posterior distribution and we estimate them through partial pooling. We do that by giving them a parent prior that has parameters in it and we estimate those parameters from the data at the same time as we estimate the intercepts and that creates pooling and shrinkage which handles the imbalanced sampling and all the other things that would normally create terror and error in classical fixed effects models. And to remind you that assumption of a common adaptive prior arises from a quite simple and reasonable assumption that the order you visit the clusters in should not matter and that while they are different from one another, they're also alike in some way. So if you think that previous cafes you've been to should give you some reasonable expectation about the next cafe you go to, then you should estimate things about those cafes using a pooling estimator. To do otherwise would be, well, it would be bad. You'd be throwing away information and you would overfit each cafe in that case. So that's varying intercepts. And varying intercepts give you the stylistic diagram on the top right of this slide where the different regression lines, if there's some predictor variable on the horizontal axis in this graph and then an outcome variable on the vertical, you get different regression lines for each cluster, but they're all parallel because they have the same slope, yeah? The slopes are still the same. The only thing that's varying across clusters is the intercepts. However, if the intercepts vary, chances are the slopes vary. Because slopes and intercepts tend to co-variate in these things. So the thing about lines, if you change the intercept, it can change the slope. Otherwise it won't pass through the data anymore. Right, it's a feature of these things. Does it have to be true? You could invent some bizarre scenario in which the slopes are all the same across clusters but the intercepts are different. But it isn't obviously the most reasonable thing to do. So if the average can vary across clusters, then the effect of some treatment or intervention can also vary across clusters. And we want to try to understand that. And that leads us to varying slopes. Cases where the change in the outcome for a given change in the predictor also varies across cluster and that's the stylistic diagram in the lower right. For now, there's a regression line for each cluster in the data. So this is a cafe or a school, any number of things. And as we change the horizontal axis, we get different relationships. So in the school context, there's a famous school database that's often used to teach these concepts. I will not use it. But where, what you're thinking of is the economic background of students inside different schools. And the horizontal axis here would be that, it'd be household income of various students. And the vertical axis is test scores. And then each line is the regression line for a school. And so in some schools, the income of a household has a big effect on the student's test scores. And in other schools, it doesn't because, well, the teaching's good. Something like that. I mean, that prejudges the conclusion, but that's the idea. Why educators are interested in these things. They want those lines to be flat, right? And when the lines instead go up as they usually do, it's reason for panic, right? So you may remember around the year of 2002, in Germany, the PISA results came out. There was national panic over the fact that income had a huge relationship to a school performance in Germany. Yeah, I remember that. I was at the institute that did the PISA thing at the time. So yeah, it was all alarms were going off about this. It was a full blown panic, right? So these are the kinds of regressions you do to look at these things. Okay, but the general principle here that I want you to think about is there's nothing special about intercepts or slopes. It's just any parameter, if you have clusters in the data for repeat observations on some units, you can take any parameter that applies to those units and split it into a big vector of parameters, one for each unit. And then you want to estimate them with pooling. That's the general strategy. And this happens in a lot of circumstances. In all kinds of different sorts of models. So yeah, this is the recipe, right? One and two here, you split that parameter into a vector of parameters, one for each cluster, and then you define some population distribution for all of them. So they have some distribution that forms a common prior and that distribution will be estimated from the data. The practical reasons to do this are that you get better estimates. And you can ask questions about distributions in the population of interest rather than just average effects. So this is hot right now in medicine. Is this idea that average effects can be misleading? I would say hot right now. It's been hot for decades, but I guess now people are trying to make money off of it. So, but it's not silly. It's the idea that, well look, it's a fact that for lots of common pain relievers, they don't affect everybody the same way. So things like common things like aspirin. Some people get a lot more benefit from aspirin than other people do. And so it makes sense to pay attention to which common pain relievers actually work best for you. And so you can test people on these things and the pain relieving abilities. In that sense, the average effect in the population isn't a plentiful importance, right? It can easily be true for various medications that the average person gets no therapeutic benefit of a medication. Nevertheless, thousands of people can benefit from the medication. This is not unusual, especially once you start considering drug interactions. In fact, that you're taking some other medication. Sorry, I'm getting old, so I think about this all the time now, but. But, prior to her aging parents, I think about medication a lot. This is how it goes. But this is not silly, right? This is a tremendous potential benefit that we can figure out the information of requirements to make this work at scale. And, but it applies to all kinds of things, but regardless of what you study, I'll try to give you an example later on today. If I'm not gonna rush, but I think I'll make it in time. If not on Friday, where the average effect of a predictor is, is basically zero. It's not exactly zero. It's never exactly zero, right? But so close to zero that you think like, ah, I could ignore this predictor. Nevertheless, there's enough variation across units in the effect that you don't wanna ignore the predictor. So this is the thing about multi-level models and varying slopes models is that the average effect of a predictor variable, an X-variable, can be indistinguishable from zero. Extremely close to precisely estimated around zero. Nevertheless, you would get better predictions if you pay attention to the predictor. Why? Because for some units, there's a big effect. Yeah, do I make any sense? I'll give you an example when we get there. Okay. Yeah, and so pooling estimators, all the same things apply. It helps us with the overfitting, underfitting trade off. So let me give you this, continue with this coffee robot example from the previous chapter and extend it now into thinking about slopes. So I remind you, I had this toy example where you're imagining your job is to program some robot to go around and sample waiting times to get coffee at various cafes. Don't ask why, it's a stats class. Get examples like this, stats classes. And the question is how you want to program the robot to use the information? And I tried to convince you, and I hope I succeeded, that you want to program the robot to use a pooling estimator, right? And so when it visits a new cafe, it uses the previous cafes as a prior, but then it updates that prior, given the experience it has at the new cafe, but hang on, it has now to simultaneously update the prior because it's got new data about the population. And then it has to update all the previous cafes as well because the prior has been updated. Yeah, and so how do you do this? Well, Bayesian model pooling estimator, all that happens automatically. Bayes formula does it for you, you don't have to be clever. You just set up the assumptions and start counting ways the data can happen. So let's take the same coffee robot now, and now we're going to ask the robot to distinguish between morning and afternoon at cafes because cafes tend to be busier in the morning than they are in the afternoon. And I think, is that the way I did this? And yes, they're busier in the morning than the afternoon. And so your average wait time in the morning is longer at most cafes than it is in the afternoon. And so the coffee robot is now keeping separate ledgers at each cafe for the amount of time it took for it to get its cup of coffee. You can imagine the barista giving a cup of coffee to the Roomba that has come in, right? I actually Googled for Roomba with a cup of coffee on it and failed to find it, but I tried, that's all I can say. So I tried, maybe next time I teach this I'll get Photoshop out and go to work. But, well, we have multimedia department desktops, I'll get them on this right and put a ticket in for this. But anyway, back to statistics. So on the right-hand side of this slide, I've given you some synthetic data to illustrate what I'm talking about. There's two cafes creatively called A and B, and on the vertical axis we have the wait time and minutes, and on the horizontal are sequential visits, in for morning, A for afternoon. And so each, the data points from each day are connected with a line. You can see in cafe A, on average, there's a much shorter wait time in the afternoons, right? Fewer people are there getting their morning coffee. And it's probably the mechanism we don't know, but just the data, the robot just has data like this, that's all it's got. And cafe B, it's also true that when there is a shorter wait time, any kind of substantial shorter wait time, it tends to be in the afternoon, but there's much less difference. Cafe B is just a lot less busy or more efficient, it's hard to say, could be less popular. But the wait times in cafe B are really short, and the difference between morning and afternoon isn't very large. This is a case where, now where's the slope in this? The slope is the average difference between afternoon and morning. The slopes will vary across the cafes. And they, in fact, they have to, because the zero boundary on the wait time is a real boundary, you can't go below it, right? You can't have a negative wait time for your coffee, I assert, as part of the physics of the measurement of this, right? They don't telepathically, well, I get with an app, I guess you can say I'm coming for my coffee. But let's leave that out of the example. And so the minimum is zero. So as you approach that, as your average wait time in the morning approaches zero, the slope has to get smaller. It has to, and that creates a correlation between the intercept and the slope. And this is a routine phenomenon in linear models, is that slopes and intercepts often must co-vary just because of the basic physics of the measurement system that you're studying. Does this make sense? That's why I use this example, right? And it's obviously of no scientific value, right? Now, think about how we should estimate these slopes. So there's two things to realize. The first is that, well, now we've got a slope parameter, a beta coefficient, and we split it across a bunch of cafes, pooling, right? So you're primed for this, right? You know you want to use a pooling estimator. So we want to have some distribution of slopes. There should be an average slope, and there should be some standard deviation among slopes, that's an adaptive prior. It'll look just like the other one that we had before for intercepts, and we should estimate it and we'll get pooling. Yes, that's the right thing to do. But then there's one more step, and that step is, since intercepts and slopes are correlated, you pool across them as well. And this is the fun part. Yeah, you ready for like the fourth dimension? I'll say that again. Because intercepts and slopes are correlated, you should pool across the parameters as well. So let me take the perspective of the coffee robot again. You're the coffee robot, or you're a chaperone, or whatever. So as you're moving along cafe B now, so say you do a bunch of cafe A one week, and the next week you're doing cafe B, and you're marching across cafe B in some particular week here at the bottom, marching across. On the first day that you visited cafe B, the prior for your morning wait comes from cafe A. Yeah? And you've got that. But now when you visit, but then you notice, you get your first data point for cafe B, and the mean is really low. In fact, you've never observed in cafe A a wait time that short. It's the best coffee you ever had. And now the first afternoon you visit cafe B, what's your prior, or what's your best prior for your wait time? It's not the average afternoon wait time in cafe A anymore, because you've already learned that the average wait time at cafe B is short. And say that there's a, you know, the correlation between intercepts and slopes in the population of cafes. To make this example work, you have, there needs to be cafes like A through M, and now we're coming to N. But since there's a correlation between intercepts and slopes, now your best guess for the wait time in the afternoon is also short, because they're going to correlate. Does that make sense? Your expectation is drugged down. It's not just the raw slope thing from the previous one. That is, you've got a prediction for the slope before you ever have to sample it from that time, because you've learned something about the intercept for that cafe. And since the intercepts and the slopes are correlated with one another, you can make an even better prediction for the afternoon. Does that make sense? Yeah? We'll have plenty of examples with real data sets coming up, so bear with me. I understand that this is, again, we're in the fourth dimension here, but it's the extending layers of what rational expectations would be, given the model, right? Okay, so to think about this statistically, we have a statistical population of intercepts and slopes. These are your cafes that you're visiting. And I'm picturing it here. This is a two-dimensional Gaussian distribution. I've sampled some cafes from it, a lot of them. Yeah, put them on there. And on the horizontal axis, we have the intercepts for a given cafe. This will be on some standardized scale. That's why zero's in the middle, right? Who doesn't think you have negative weight times? It's just zero's the average here. And then on the vertical is the slope. And they have a weak correlation, right? Negatively correlated, but they're correlated. Yeah, with one another. And what does that mean? That means that cafes that have long morning weight times tend to have smaller differences between morning and afternoon because the slope is a difference. Yeah, it's hard to think about. We'll put it on the absolute scale later. That's what the negative correlation is. And each of the marginal distribution of each of these intercepts and slopes is just Gaussian. Yeah, but the correlation is the extra bit of information that we need to estimate inside the model. So the pooling estimator in this case, and I'm gonna spend the next 30 minutes explaining all the details of this to you, but the pooling estimator in this case is this distribution. We estimate the bivariate Gaussian distribution now. What do we need to do that? You need two standard deviations and a correlation. Well, then the means. But the means you can always subtract them out and put them in the linear model, right? We don't care where the mean is. The mean's arbitrary. We can always relocate these things wherever we want arbitrarily. But to get the scale and the correlation, that's what we need to estimate. Does this make sense? Is it enough to be with me just well enough to keep going, at least? Yeah, you're not gonna flee the room? Yeah, it's good, all right. So let me walk you through a full synthetic data exercise with the CAFE robot, and then we'll do some real data examples as well. So we'll do no fewer than three examples of this kind of pooling estimator. So here's our population of CAFE's. It's a two-dimensional Gaussian distribution. You've done plenty of these before. Your posterior distributions have been higher dimensional than that, right? You've had all kinds of posterior distributions, which were 20-dimensional Gaussian distributions all throughout the course. This is just a two-dimensional one. It's a hill. Yeah, a little Gaussian hill that you're looking at. And to describe a two-dimensional Gaussian distribution, like I say, to describe a Gaussian distribution, you know all you need is a location and a variance. That's it. No other information needed to describe a Gaussian. That's it. That's what's wonderful about Gaussians. Two pieces of information, you're done. It's wonderful. So for a two-dimensional Gaussian distribution, you have two means, two variances, and a correlation, and you're done, right? And then as the dimension goes up, you need more. We'll talk about that later, but let's just stay a two-dimensional land for now. One correlation. So there's gonna be an extra parameter inside the model to estimate, and that parameter is the correlation between the intercepts and slopes. And sometimes all that correlation is there for is to do the pooling for you, so that information about intercepts helps you estimate slopes and vice versa, that they come between, right? So think about the cafe robots again. What if the sample size in the morning and afternoon was radically different for some cafe? Then you'd overfit whichever of those had a smaller sample size if you didn't do the pooling across. You wanna go across and handle the imbalance in sampling. So this correlation parameter is what does that? Accomplishes it in the model. So let's information flow across the two kinds of parameters. It uses, it exploits the fact that there's a correlation to improve estimates. But other times, it isn't just that it's here because it does partial pooling for us, it's because it's a target of inference. We have questions about the association at the population level between these things, right? Like say between socioeconomic status of a household and test course, right? Sometimes we care about the estimate itself, something we really wanna know. And then it's nice to have a parameter in your model that actually is the posterior distribution of that correlation, yeah. So sometimes it's at a real target of inference. Okay, so here's some simulated cafes. We're gonna do a full simulated example and build up the multi-level model that does all this and take a look at it. So here's what I've done is I've sampled 20 cafes from that population that has this negative correlation and data over five days, morning and afternoon, 200 observations in total of all these cafes and the job of our robot now is to use that data to try and estimate not just the morning and wait time in each cafe, which is in some sense in this example, the key inference, but also the population distribution because that'll improve the estimates as well. So what you're looking at here is sampled intercepts and slopes each dot is a cafe and the ellipses, that's the population they've been sampled from to show you the relative probabilities of getting cafes in different regions. Okay, here's our model. Don't panic, I'm gonna take this piece by piece. These models are always the same. Sometimes they get long because you have to list all the assumptions and the strategy in this course has always been to make you do that rather than hide it from you, it's all here. But this model isn't any different structurally than the other models you've done, it's just longer. So I'm gonna take it piece by piece and show you that you already understand it. So the first bit of the top of course is the likelihood and we're gonna move past that without comment. There's some Gaussian distribution and wait times, yeah, yeah, who cares. And linear model is where the interesting stuff starts to happen, this is a linear regression. So, but now it has varying intercepts and varying slopes. There's an alpha for each cafe and there's a beta for each cafe. And then m sub i is a dummy variable and zero and indicator variable for whether or not it's the morning. Now we have the pooling prior and this is the two-dimensional Gaussian distribution. So the way you wanna read this from left to right is for each cafe, there's an alpha and beta, they're paired together, there's a pair of parameters and alpha and beta for each cafe, they're in a vector. That's what that's called, there's a vector. And those pairs of values are distributed as a multivariate normal with means, alpha and beta and some covariance matrix S, which we will spend way too much time talking about today. Well, no, we will spend the adequate amount of time talking about it, we will give it the respect it deserves. Yeah, did you with me? This is, so this is just describing the hill that we looked at before. This is just the way to denote in your model that there's this two-dimensional hill. So there's for each cafe you get a pair of parameters and intercept and a slope and across a large number of those cafes then they'll have means, alpha and beta and there'll be a covariance between those pairs described by this covariance matrix S. So that's where we're gonna spend some time. Oh, sorry, here's the labeling of all the stuff I just said. Yeah, what we need to spend some time talking about the covariance matrix is this, I understand given the variation in backgrounds in the class some of you will not dream of covariance matrices all the time, but these are simple things. Yeah, I can tell from the title space that she does, but sorry, when you make expressions like that I will comment on them, but no. These things are simple descriptive engines that you actually use all the time, but it's worth having some standardized notation about what we're doing. So covariance matrices just describe the shapes of these multivariate Gaussian hills, hyperhills I guess if it's three dimensions or more. And so if we have an M by M covariance matrix where M is the number of parameters in it, so let's think about two by two. So for M equals two, for a two by two covariance matrix, how many parameters do you need to specify the matrix? All the covariances inside of it. Well, you need M standard deviations because for each dimension, you need to say how stretched it is. It's the scale of each marginal dimension of it. So the scale of the intercepts and the scale of the slopes. So you need two standard deviations for that, do you think? So for M equals two. But if M equals 10, then you need 10 of those. Yeah, I think this is the linear thing, makes sense? Now the fun starts, but now we also need correlation parameters. So for a two by two Gaussian distribution you only need one because there's two dimensions. So there's one correlation that describes how correlated they are with one another. But then you start adding some more dimensions, right? And you're gonna need more. So there's an expression for the number of correlations you need and it's M squared minus M divided by two, or M quantity M minus one divided by two. So since there's a square in there, this goes up fast. So for M equals two, this is, I leave this as an exercise for the student to verify what it is, it is one, right? Go ahead, take your time, do it in your hands. But for three, what is it? Well then it's three squared, which is nine, minus three, which is six, divided by two, which is three. So you need three correlations for a three-dimensional Gaussian distribution. Then with four you need even more and so on. It's because all the pairs that you have to, for every pair you need to specify a correlation. Does that make sense? So you're counting pairs is what you're doing. Counting unique pairs. So that gives you a lot of correlation parameters pretty quick. In total you need M quantity M plus one over two parameters. So for small matrices this isn't very many or big matrices, it's actually quite a lot. Quite a lot. So, but that'll describe all the shapes and all the correlations. The fun part of this, and that's easy, that's just description, it's no big deal. The historically challenging part of this is specifying priors. So now we have a matrix and we need a prior for a covariance matrix. So I've been telling you all along that eventually in this course we would have distributions inside distributions. So now we need a prior, which is a distribution for a matrix. So now we need distributions over matrices. Yes, and this is where we're gonna reach it. Works intellectually, it's just like all the other priors you've done. It's just now the object that's being spit out of the distribution is a matrix. No problem. Math is fine with that. It's just us that have, we have a problem with it, but your computer doesn't. It has no expectation of what should be coming out of the distribution, a matrix is fine. You can spit out a scalar, you can spit out a matrix. It's no big deal. So I'll spend a little bit of time explaining this to you. The point, the motivation though, is just so for this thing, this covariance in the two by two case has got these three parameters inside of it. We need some prior expectation before the data where we're gonna program our robot to have some expectation about the covariance. What should it be? And we want to regularize, right? We always want to regularize inference. So there's no sense in which we can have some uniform distribution over matrices. It makes no sense. So we have to do some regularization here. And there's an easy solution to this. So I wanna show you how we're gonna do it. Historically, and you'll still see this in people's code, what people have done is used what's called the conjugate prior for a covariance matrix. This is this thing called the inverse wizard. You should not use it. I'm just gonna say that. And expect hate mail or whatever. The reason you shouldn't use it is because if you use the inverse wizard, you can't independently specify priors for the standard deviations and the correlations. And often you have different information about those things, the scales and the associations between the dimensions. And you wanna specify priors separately for them. But you can't do that with the inverse wizard. The inverse wizard forces you to make joint assumptions about those things. And it also does drastically bad things near the origin, near the zero points. It's not a good family of priors. And you should never under any circumstance use it. You will see it. And I spend a lot of time in reviewing paper telling people not to use this prior and giving them citations. There's a literature on why you shouldn't use this prior. And the good news is, the alternatives to use are easier to use. They're easier to think about. But to explain to you how we do it, we're gonna have to do some matrix factory. We're gonna take this thing, which is our covariance matrix. And we're gonna factor it into two other matrices. So, back in secondary school, everybody did linear algebra, right? Yeah. And then you forgot it because you're a psychologically healthy person. You moved on, right? Because I mean, what good is math in the real world after all? So, well, now it's back. So I wanna give you a very brief refresher on linear algebra. And so, the reason is because we're gonna factor this thing into two other matrices. One, which is a matrix where the scale parameters are along the diagonal and everything else is a zero. So the diagonal matrix with the scale parameters along the diagonal. So it's just a matrix representation of the standard deviations and then a correlation matrix. And it turns out, you can get a covariance matrix by taking that matrix of standard deviations, multiplying it by the correlation, and then again, by the matrix standard deviations, you get a covariance matrix back. So it's just matrix multiplication rules. We want to assign priors to the sigmas and to this thing, which I'm calling R, the correlation matrix. Okay, that's where we wanna assign our priors. Then we can have a prior for the correlations, how strong the correlations are, right? And we probably have, I'm gonna argue we should have prior expectations that the correlations are not big, especially if the matrix is large. And then you make information about the sigmas to do regularization as well. So how's this work? Here's your, I'm gonna do 10 minutes on matrix algebra, okay? It'll be gentle and fun and you'll remember secondary school and all the fun you had in secondary school, right? So matrices are nice. To remind you what the work you did with this back in time, matrix algebra is just a bunch of shortcuts for working with lists of numbers. It's convenient, you don't have to do it. There's nothing you can do in matrix algebra you couldn't do with systems of equations. It's just easier with matrix algebra. There are fewer steps involved, but it's just a bunch of shortcuts. It's a grammar for making compact operations on systems of equations. It's all it is. And so there's just a few simple rules and those rules are structured the way they are because they compress down operations on systems of equations. That's all it's for. So matrix multiplication is just a recipe. It's among the easiest recipes and so I was trying to think, like what's the easiest recipe for food and I think it's an omelet, right? Is there anything easier than an omelet? You just spill some eggs in a hot dish and wait a minute, right? It's about, I think you got an omelet. Nobody can mess up an omelet. I don't know who's thinking about this, like invent a way. If you can invent a way to mess up an omelet, let me know. But I thought, so if you can make an omelet, you can multiply matrices. That's my mantra right there, okay? So let's make an omelet. So we want our ashtray tasks as we got two square matrices. We're only worried about the easiest case. We're gonna multiply two matrices, two square matrices of the same dimension. The answer will be a matrix of the same dimension, right? So easiest case, nothing exotic at all. Here's the rule. If you've got a matrix here, this is what I'm gonna call the capital letter matrix, A, B, C, D, and we're gonna multiply it on its right by the lower case letter matrix, A, B, C, D. The mental trick here is that you just make a grid like this. Put the matrix that's on the left, on the left-hand side here and the other one on the top, and then your answer, we need four cells to fill in and we're gonna just put the answer there. You with me? This is the trick I was taught in secondary school, yeah? And then the rule is, you just take each of these four boxes and you pay attention to the row and column that connect to it. So in this case, we've got the first row of the matrix on the left, A and B. There is color on this slide and those of you watching at home can see that and those of you in the room can't because this beamer is dead or something. No longer shows the color red, obviously. I should have chosen blue. Anyway, the A and B and A and C are red, trust me, on my laptop they are. And so what you do is you take each of the elements in this row and you multiply them in order by each of the elements in that column they connect. And then you add those products together. This is called a dot product, which is something you don't have to remember but it's fun to say. And so what that means is capital A times lowercase A plus capital B times lowercase C. See how that works? You take the big A, you multiply it by the little a and then you add that to the big B times the C because you're moving down that column and across that row and doing it in sequence. Yeah, this is it. This is your omelette is done. You just gotta do it for the other three now. And so the other three works the same way for the cell in the upper right. Big A, now we move over to the column that connects to that cell. So it's big A times B, big B times little D and that's the sum that goes there. And then you do the calculations. Yeah, that's it. And that's matrix multiplication you're done. And now you are among the 1% of all humans who know how to multiply matrices. Congratulations. 1%? Yeah. Yeah, well I don't know the percent. It's less. Yeah, yeah. Anyway, it's not a survival skill, right? This is, it's better to learn how to make fire if you're gonna learn something. But I teach you this not because you're gonna need to do this by hand. You know, although it's healthy, if you wanna practice on the bus on the way home, I think doing some matrix multiplication is good for you. I'm teaching this to you to demystify what the notation is so that when you see it, you know why it's like that. And actually not only why it's like that, but why it's a good thing to represent things this way is because it makes it compact and it's clear and it's a universal way to describe your assumptions and say what's going on. And this is the only thing you need to remember to understand how the operation works. So let's go back to this thing, this factoring and let me show you very quickly how to apply this rule to get back. So if we, we're gonna take the one on the left and then one in the middle, that is our, our matrix of scale parameters and we wanna multiply it by the correlation matrix again, set them up. So the ones on the left, the others on the top we're gonna have an answer in the middle that's the same dimension. Apply the same rules, right? So sigma alpha times one is sigma alpha plus zero times row gives us sigma alpha, right? So zero times row zero. So sigma alpha plus zero, which is sigma alpha. So the answer is sigma alpha. For this one, it's sigma alpha times row is sigma alpha row plus zero times one. So we get row sigma alpha. You with me? Isn't this fun? No, it's not, but this is, this is just here. Not because you're gonna need to do this to do the statistics, just so you understand why this factoring makes sense. Yeah. And then the bottom row of the same idea, zero times one is zero, sigma beta times row gives you row sigma beta. And then zero times row plus sigma beta times one gives you sigma beta. And then the second step, we take the answer from that previous thing, we put it on the left and we gotta multiply it by the matrix of scale parameters again. Same deal. And now we get sigma alpha times sigma alpha sigma alpha squared plus row sigma alpha times zero sigma alpha squared, which is the variance of the intercepts. Right, it's a squared standard deviation, it's the variance of the intercepts. That's what goes inside a covariance matrix on the diagonal. And on the, at the other end, we get sigma squared beta, that's the variance of the slopes. Yeah, because it's the squared standard deviation. And then off the diagonal we've got covariances and a covariance is exactly a correlation times the product of the standard deviations. That's a covariance. It's the definition of a covariance. That's what it is. And this is our covariance matrix, love it. Yeah, so we wanna put priors on each of these symbols inside this thing, not on the whole matrix. And that's why we factor it. Does that make sense? Yeah, enough, sense? And so that's why we write this thing inside our model notation, where we build the covariance matrix. So we define, we've got these separate little sigmas, correlation matrix capital R, which could be any dimension. Here it's two by two, so it's got one parameter inside of it. But it could be really big, right? If it was three by three, you would have three correlation parameters inside of it. And we need priors for everything. These are the easy priors. I'm gonna skip over these quickly. Alpha and beta are the means of the population of cafes. Yeah. And then we're gonna put these very weakly regularizing half-coachy priors on the scale parameters again. Exponentials are also fine. Sometimes exponentials are better, in fact. But there's nothing new here or exciting to talk about. The new exciting thing is how to put a prior on a correlation matrix. So here's the thing about this. In the two by two case, you've got one parameter to put a prior on. And so it isn't very intellectually challenging. So a correlation can vary between minus one and positive one. And regularization in that case would mean we want a prior that's skeptical of extreme values. So that's easy, we just have some hill, right? That goes down near one and minus one. That would be a nice prior. Beta distributions work great for that. It's no problem there. The intellectual challenge is we want some, I wanna teach you some strategy that will scale to big, magnificent correlation matrices with any number of little correlation parameters inside of it. And what you absolutely cannot ever do, ever, is put independent beta priors on every correlation parameter inside that matrix. If you do this, you will get nonsense out of your markup chain. And the reason is because all the correlations inside of a correlation matrix have to are jointly constraining one another. If one of them is really big, it constrains the values of all the others. If two variables inside a data set, just think of it in terms of a data set, are almost perfectly correlated with one another, that affects how correlated other things are with either of those. The correlations jointly determine one another. They can't be perfectly independent. And so you need a prior over the whole matrix. And there's just no dodging it. There's no simpler way to live. It just has to be true. Luckily for you, this problem has been solved many times. Well, there's a really nice solution. There are a number of solutions. And the one I want you guys to use is this so-called LKJ correlation. So let me explain that to you. And then eventually we will actually have some estimates here. No, we're on time. This is good. So here's how this works. LKJ is the initials of the people who published a paper on it in 2009. Lubandowski, Korovka, and Joe. That's the last name, Joe. His first name is not Joe. But Joe is the last name. And this is a great paper. It's a really nice paper where they address this problem. And the idea is, as I said, to have a family of priors, regularizing priors, over correlation matrices. And they have this nice construction, which we now call the LKJ family of priors. In the paper they call it the onion method. So you might call it the onion prior instead. And the reason for that is not important, but what's nice about this is there's got a single shape parameter, which describes either the skepticism or enthusiasm for extreme correlations. So thinking about it in one dimension now, where on the right I've shown you three cases of this shape parameter, the horizontal axis is the correlation, varies between minus one and positive one. So zero in the middle means no correlation. Yeah. At the top, when we say 80 equals to one, that's uniform. That is, and so what I'm showing you in the density is, I don't know, 1,000 samples from this prior. And so every correlation is equally plausible when 80 equals one. Yeah. As the dimension of the matrix goes up, the ends fold down though for 80 equals one. It has to because the correlations start to constrain one another. And for really big matrices, like a 10 by 10 correlation matrix, the correlations all have to be small in expectation. Or if any one of them is big, all the others have to be small. Yeah. So they jointly constrain one another in powerful ways. As eta goes above one, you get matrices which are concentrated around correlations of zero. And so what we usually want to do when we fit models is we want to use eta's that are a little bit above one so that the model is skeptical of extreme correlations. This is the regularization effect. It's a good thing to do because it reduces overfitting. Yeah. Otherwise you're gonna get the sample will rule and you'll get these spurious, low evidence correlations estimated. The other thing which I have yet to come up with a data analysis situation where I'd want to do this, but I offer it just in the sake of understanding, you can make eta less than one. And in that case, it's very enthusiastic for extreme correlations. I was trying to figure out some joke about American politics, but I failed. So there you go. That's all you get. So, here's what the model looks like. And this is the simulated cafe data. By the way, you should run this. This is all in chapter 13. You should sit down and run this through your console and get an idea of how it works. You can simulate the data. You can do it multiple times. You can change the parameters, make the correlation stronger, weak, whatever you like. The model notation is similar to the sort of things you've done before. Got a little bit of color here to help you understand what connects to what. The red here is intercepts, and the green is slopes. So you'll see in the linear model, we've got an intercept A underscore cafe for each cafe and a slope B underscore cafe for each cafe. And then when we specify the multivariate normal prior, they go inside a vector together, right? Which is clustered by cafe. The whole vector is clustered by cafe. You see that? Because it's a vector of two that's clustered by cafe. Yeah. And then we put the mean in this DMV norm two is just a version of the multivariate normal where you can put in a vector of sigmas and a correlation matrix instead of a covariance matrix. So it deals with the factoring for you. And it just multiplies them together exactly as I showed you on previous slides to get the covariance matrix. And then the rest is the stuff I just showed you. Yeah. Good time? Okay. So, oh yeah. There's our sigma, right? Vector of sigmas. And there's the LKJ correlation matrix, the onion method correlation prior. Yeah, with eta equals two, which means it's slightly skeptical of extreme correlations. Very slightly. If there are really extreme correlations to your data, this model will learn them, right? But it's easy. These weakly regularizing priors can be overwhelmed quite readily. The thing about regularization is super important for correlation matrices, because you need a lot of information to estimate correlations. It's much harder than estimating slopes. And so regularization is super important because you can get all sorts of spurious inferences from this. Your overfitting risk is really extreme. Yeah, so I think I frequently made, I make lots of bad jokes, but one of the bad jokes I make routinely in stats is that one of the things you should, it's never too early to teach your child about the importance of regularizing the covariance matrix, right? Communism doesn't work, that's the second thing, but sorry, I don't know, this is a running joke with Natalia and I have, but covariance matrices, they must be regularized. All right, so posterior correlation. If we run this model, we get a posterior distribution for the correlation parameter and just extracting it out here, as you see in the code. There's a matrix called row and the elements at one, two, right? First row, second column is the little row thing. It's the correlation between the others. The diagonal is ones. And so when we plot that correlation, that's what I'm showing you here in the solid line, the posterior distribution of the correlation between intercepts and slopes is quite negative, right? Because that's what I forced it to be in the simulation. It's a very negative result. The prior is this gentle little hill, right? That's the onion prior with 80 equals two. Make sense? So it's skeptical of extreme correlations, but there's a lot of evidence of a strong negative correlation in this case. What does that look like on the outcome scale? What, the effect it has on inference is to do pooling across intercepts and slopes and here's what it ends up looking like. So this is a scatter plot where the horizontal is the intercepts, that's the morning wait times, right? And the slope is the difference between the morning and afternoon wait time each cafe. And the filled circles, like last week, are the raw data estimates. If we take only the data for each cafe and we just compute that with pure arithmetic, this is the fixed effect estimate, right? It's the same estimate you'd get out of a fixed effects model. That's what the field points are. It's just raw empiricism being objective and using only the data for that cafe, yeah? The wrong thing to do, right? But that's what the filled circles are. The open circles are the pooling estimators, the partial pooling estimators and they have shrunk but now they're shrinking in two dimensions. So good times, yeah? And they're shrinking towards the center of gravity inside this distribution defined by the population, the adaptive prior that defines the statistical population of cafes. And so the ellipses are showing you that. That's, those ellipses are the contours of that bivariate Gaussian distribution that's been estimated from the data. Yeah? The hill that we talked about before. So the lines connect the two points for each cafe. You'll see that they are shrinking towards the center of gravity in this thing. They're shrinking towards a line basically in the middle of this. And so it'll help probably to think about particular points in this case. Oh, this is what I should have zoomed this out and showed you before, yeah? So the filled circles are unfooled and the open ones are pooled. Let's consider this cafe here at the top to help you understand what's going on. So this cafe has a very extreme risk extreme slope. It's slope, the difference between morning and afternoon wait times is much, how do you say this, less small, it's closer to zero than the other cafes in the population. It's on the tail of the slope distribution. You see that? And it's in the raw estimate. It's intercept is perfectly typical, right? The intercept is right in the middle of the horizontal dimension. It's a perfectly typical cafe in terms of its morning wait time. It's a very atypical cafe in terms of the difference between the morning and afternoon. Yeah, it's always busy or something like that or it's always average, I don't know. And now the shrinkage however, changes both the intercept and the slope. So even though the intercept is typical, the intercept gets shrunk. Well shrunk, which direction does it go? It gets increased. So you can understand maybe the first order realization is that the slope has got to come down. The slope comes down because it's unreasonably high. It's incredibly unlikely, it's probably overfit. It's probably the robot needs to get some more data from that cafe. It's probably not actually that high. So you can see why the slope comes down. Yeah, I assert it. You can see, right, it's like a mind trick. You can see what slope comes down. But why must the intercept also change? The intercept must also change because the do otherwise would be illogical. And why? Because intercepts and slopes are correlated. They're correlated, they're negatively correlated. If the slope is too high, then the intercept is too low. Say it again, if the slope is too high, then your estimate, if your estimate of the slope is too high, then your estimate of the intercept is probably too low because they're negatively correlated with one another. And so they move, this moves down to the right. And so the pooling estimators increase the intercept a little bit and decrease the slope a lot. They don't, it doesn't increase the intercept much because it's typical, it's in the middle, but it still moves up, right? Because if you move the slope down, you have to retain the correlation between the two. And this is a consequence, a logical, optimal consequence of the structure of the model. But the great thing, this is what I love about Bayesian inferences. You didn't have to realize you needed to do this, right? You just set up the model, turn the crank, the goal and figures it out. Yeah, and then we try to figure out why I did that. And realize that this is the right thing to do. Does this make some sense? Yeah? So I encourage you in your free time to analyze other cafes in this and tell yourself similar stories about why they move the way they do. Each cafe is precious and unique snowflake that has its own story of pooling and is worthy of respect, all right? It's easier, I think, quite often, this is the general theme of the course, it's easier quite often not to look at pooling on the parameter scale, but more on the outcome scale. So slopes are just little motors inside the tide machine, the little gears inside the tide machine to use that metaphor again. So plotting up the slope, while even though that's what's in the posterior distribution, it isn't necessarily the most meaningful way to think about how the pooling operates. So let's instead generate, think about it in predictions for waiting times. Use the slope to make predictions for afternoons. And in that case, what I've redrawn on the right, it's the same information as on the left, but it's been pushed out as an outcome prediction. So the bottom is the same, right? Because the intercept gives you the prediction for the morning wait, right? So it's the same. But the vertical dimension's different now. Now it's the sum of an intercept in a slope, because that gives you the afternoon wait time. Yeah? And so that's the vertical dimension. And again, we can draw the ellipses for the Gaussian distribution of these things come from the posterior. You have the code to do this to draw these graphs is in the chapter. If you like ellipses, there's a command called ellipse. We'll happily draw ellipses for you. And then you can see the shrinkage. Everything's climbing uphill, right, so to speak. If you think about the center of that, Bavarian Gaussian is the top of the hill. Everything wants to be on the top of the hill. But the rate at which they climb is proportional to how far they are from the center of the hill. Yeah? Because that's the shrinkage that goes on. And so the cafes that are really extreme near the bottom of the hill, they run really fast. And the ones near the top don't move that much. Yeah, because of the shrinkage, right? It's the model skepticism about extreme estimates. Is this good? Does that make sense? After you sleep on this, it'll make a lot more sense. It's one of those things that works that way. Okay, we're doing great on time, actually. This is what I hope to get to today. So this is this phenomenon of multi-dimensional shrinkage. What happens in one dimension happens in two dimensions, happens in three dimensions, happens in four dimensions. Excuse me. And you can get shrinkage across all kinds of different effects that apply to the same cluster type. So here it's cafes as a cluster type, and for every effect you estimate about cafes, you want shrinkage among them because that's the optimal use of information. Yeah? When you have multiple cluster types, you don't get pooling across cluster types. You get pooling within cluster types across the effects. So I'll say that again. You get pooling across effects within each type of cluster. You don't get pooling across cluster types. Does that make sense? There's no sense in which learning about an experimental block tells you anything about a particular chimpanzee. You don't get pooling across cluster types. But learning one feature about chimpanzee tells you about other features of the chimpanzee because there's a correlation structure among those features within individuals. Yeah? Things like that. So this is partial pooling and shrinkage just as before and we want to do it because we get improved accuracy just like before. Often, I should say, often the correlations between the different effects are very small or there's not much information and so you don't get much pooling across types. This is an example where there is a lot of pooling across interception slopes but routinely there isn't much because there isn't much structure across it that you can estimate. And in that case, treating these as independent effects is harmless but it's not harder to fit the model that does it right. And so my advice is to try the right model first. Absolutely. Okay. I should say non-basic packages like LME4 can have a really hard time doing this but this is no problem if you're using Hamiltonian Monte Carlo. Oh, no problem. It's a tiny problem. It could do it. But on Friday I'll tell you how to solve the typical problem with it. Okay. Got a couple of minutes. I think I can get through this example because it's so structurally easy. You remember the UC Berkeley admissions data? The saga of the dean that thought they were gonna get sued? Yeah. Turned out they weren't gonna get sued. And so here's the data again. Just to remind you, we've got applications within departments and we're interested in the effect of an application being submitted by a male human. Yeah. And whether that increases or decreases the odds of an application's acceptance. And the interesting causal story with these data is you can easily reach the wrong inference if you don't cluster on department because the departments have radically different mean rates of acceptance and men and women apply to different departments on average. So this is a famous example of what's called Simpson's paradox in the literature. We spent a lot of time on this, half a day on this example. Last week, week before, week before last. That's the last year. Wasn't it last year? Yeah. And so to remind you, we can do this with pooling estimators too. So here's the varying intercepts version of this model where we have an intercept for each department and we have an adaptive prior for the departments. Yeah, you see this? So now the prior for ASAP department is normal with mean alpha and standard deviation sigma. So this is a varying intercepts model. So it's gonna estimate a unique intercepts for each department, but it's gonna use partial pooling. This would be good if you have first cause the departments vary in the amount of data each provides. So there's imbalance. So this is the right thing to do. Yeah, does it make sense? What about with varying slopes? Well, the effect of mail, the effect of an application being mail could also vary across departments and hint it does. So this is a good suspenseful moment to pause. Yeah, I could finish, but I'll let you think about it. When you come back on Friday, we'll pick up with this slide and the varying slopes model. And I want to show you that the slope varies across departments, even though the average effect of being mail is just about zero. And this is one of the great things about varying slopes models. It's the effect to detect who is helped by the aspirin, so to speak. Okay, with that, thank you for your indulgence. And I will see you bright and early on Friday morning.