 Okay, let's get this party started. It's week nine. Second to the last week nine, very excited because your powers are coming fully online now and you will be able to do really amazing things. Today we're going to take the basic varying effect strategy, multi-level model strategy and extend it into more dimensions and that will require some new concept formation and only a few new coding tricks and what it's going to unleash for you is the ability to do lots of scientifically useful things. Okay, so to remind you, last week was all about varying intercepts, the simplest sort of multi-level model and varying intercepts models as pictured on the top part of this slide. The notion is that the averages differ by cluster. Cluster could be a chimpanzee, a block, a department, whatever. There's some average rate of a thing that happens for some average value and we want to consider that those averages differ so this could be like the survival probability of tadpoles and those survival probabilities differ on average across tanks. If we estimate them with pooling, we get better estimates. Pooling estimators are biased estimators but they're better than unbiased estimators. Again, some of you may know this terminology that I try to avoid about biased and unbiased estimators. So this thing, linear regression is the best linear unbiased estimator. Blue, sometimes people call it blue, best linear unbiased estimator, but there are an infinite number of estimators that beat it and they're all biased. In statistics, unbiased is not a good thing. So one data point is an unbiased estimate of the mean of a normal distribution but it's probably not good to bet on exactly that value. So that's the thing about bias and unbiased. So by being biased we get better estimates out of sample. That was the whole point about overfitting. So that's very interesting and that's all the partial pooling and all the wonderful stuff that you absorbed last week. We can extend this strategy of course to the sorts of parameters that we normally call slopes. What are slopes? Slopes are treatment effects. So it's another feature of each unit, how that feature responds to some treatment that you subjected to, how a chimpanzee responds to the presence of a partner. That would be a slope or treatment effect. And of course we can distinguish those among the clusters as well and apply partial pooling and get better estimates on average just as before. So these are things we're often called varied slopes or allow the effects of a predictor variable to vary by cluster as well. And I tried to show this in the lower plot on the slide. In the upper plot you just have varying intercepts. So across some predictor variable on the horizontal in this graph each line is a different cluster, like a different chimpanzee or tank. They have different heights because their intercepts are different. Those are varying intercepts. But they're parallel to one another because they all have the same response according to the model to the treatment, to the x-axis variable. Does that make sense? And in the lower one we let the slopes vary by the unit as well and then you get lots of fun. You get a Jackson Pollock painting. If anybody remembers Jackson Pollock. Somebody, thank you. So we want to do this quite often. We want to let the data tell us how much variation there is but we want to use partial pooling because we don't want to overfit each cluster. Same kind of story. There are lots of domain specific scientific reasons to consider varying slopes. Just run through a few with you to anchor your memory. If you're trying to study how people respond to some pharmaceutical intervention, usually called the drug like aspirin, paracetamol, something like that, different people respond to different pain relievers in different ways. This is well known. Not everybody responds to paracetamol in the same way or Ibuprofen. Some people get very little therapeutic benefit for some particular pain reliever but that doesn't mean the pain reliever doesn't work. Variation in the population is a very important thing to study. There could be zero average therapeutic response to a pain reliever but some people could benefit hugely from it. There's this whole industry called precision medicine now just trying to leverage this. There are lots of educational interventions that are meant to help students learn. After school programs in North America there was a period of time about 20 years ago where after school programs there was a really popular kind of intervention and it turns out not everybody benefits from them but some people do. So looking at the average effect is not really the causal effect that you're interested in. Of course all of these things arise because not every unit in the population has the same relationship to the predictor variable. There are interactions and this variation is important for understanding the causal pathways that we're trying to intervene on. Statistically the major innovation, conceptual innovation that you're going to have to confront and we'll deal with this for the rest of the lecture today is that what you could do with varying slopes is you could treat them exactly the same as varying intercepts. Now you've got another adaptive prior for the slopes and it's got some mean and it's got some which is the average treatment effect say and it's got some sigma which is its variation and run the model networks on it but you can do even better than that by relating the intercepts to the slopes. So I'll say this again. You could treat the slopes as just like another varying intercept. Statistically just have a copy of the adaptive prior relabel the parameters. Now you've got an average slope and a variance among the slopes and that'll work. It's not bad but you can do better than that. You can do better than that by noticing that the intercepts and the slopes are related in the population. When you learn about the intercept of a unit you often get information about its slope and vice versa. They're not disrelated things. Why not? Because they're lines. When you change the slope of a line you change its intercept. I know that's a very mathematical version of it. There are real causal reasons that those parameters are related but it shows up in the geometry that we represent it with and so you want to think now about a single prior and I'll show you what this looks like when we get to the mathematics. A single that has both the intercepts and the slopes inside of it. We're going to have a population of features of the units. If I call them on the slide there are all the units in the data whether they're chimpanzees or tanks of tadpoles or schools or what have you they have a bunch of features that we could list and there's a correlation structure among these features because causal processes generate correlations among features in individual units and that correlation structure lets you transfer information across the features. I'll have an example of this on the following slides but the basic idea is think about your friends or your family. There's a correlation structure among their personality characteristics and their hobbies. Say that I learned that Brendan likes death metal. Sorry did I just out you? Brendan likes death metal. You're not the only one in the room I can tell. There are immediately lots of other things you're thinking about Brendan now right? There are other things that you expect Brendan to like or not. You can ask him after class if you want. That's the correlation structure and in a statistical model that's pooling. You get information about another type of parameter by learning the value of one type. This is the feature relationship so you don't only pool within parameter types between intercepts but also between intercepts and slopes and then you get even better reduction in overfitting. Let me give you a toy example of this graphically and then we'll actually work it up in a real statistical model with some data. I want you to think back to the cafe example from last week. I introduced varying intercepts by asking you to imagine you're visiting because you're all very fashionable globetrotters. You're visiting all the hippest cafes in Europe and you visit a cafe in Paris and you order some coffee and you time how long it takes you to get that coffee and you go to one in Berlin. Before you order your coffee in Berlin you have a prior and then this is a story that leads us to pooling because the order you visit the cafe shouldn't matter as soon as you get that coffee in Berlin you should also update your estimate for Paris even though you left it behind and that's how pooling works. It's time invariant. You pool information across all the units until the sample size within each unit dominates the population inference. Now let's imagine to extend this example a bit further imagine now you're all clever engineers we're going to program a robot to visit cafes and collect data and the robot is your statistical model and how should you program this robot to learn from visits to cafes and ordering coffees and estimating the waiting times. What we're going to add to the story now is that the robot is going to visit in the morning and the afternoon and it's going to distinguish between the two it's going to keep track of the time of day and because all of you probably have visited a cafe in your life you know the cafes are busier in the morning than they are in the afternoon at least most of them and people get their coffees so the average wait time at say 9 a.m. is longer than the average wait time at say 2 p.m. in a cafe and so we want to make this distinction so now we have two parameters to estimate for each cafe the wait time in the morning and I'm going to code this as the difference between the morning and afternoon wait times to be the second parameter so it looks like a slope you could think of this as two different intercepts the problem is actually actually more quick are you with me so far this is just a setup for what we're going to do these two things are related I assert by the causal process of the popularity of cafes we'll talk this through as we go let me show it to you this is toy data on the right to see two different cafes at the top one we have cafe A which is a popular cafe the average wait time on the vertical axis is high and in the points that are joined by a line segment that's from morning to afternoon on the same day you see in the morning you can expect to wait longer in the afternoon a little less but if you compare this to cafe B which is a part of town known visits the wait times are shorter you get your coffee instantly both wait times are shorter but the difference between morning and wait time is much much smaller in cafe B because there's no morning saturation effect in cafe B you're not waiting in the morning you're not waiting in the afternoon you're just getting your coffee we'll still misspell your name but you'll get your coffee right away so think about this statistically now we've got some statistical population of cafes in the middle here and in this population there's a correlation a negative correlation between the intercept that is the average wait time in the morning standardized in this example so the mean is zero and the slope which is the difference between the morning and afternoon so since there's a negative correlation in this cloud think about how to interpret that for cafes with high intercept values they have lower slopes the difference is smaller between the morning and afternoon and the other way around so there are two sets of parameters the intercepts and slopes that we need to estimate and you could treat them as completely separate from one another but if you ignore this correlation and the correlation gives you info so now we're going to get some data from this population and I'll show you how to run it as a statistical model and show you the value of measuring the correlation between the two so the agent of our pooling now is going to be a prior distribution that is a two dimensional Gaussian and we're going to have both kinds of parameters intercepts and slopes inside of it there's a vector of how do you specify a two dimensional Gaussian distribution you need two things you need a vector of means so what's two dimensional mean intercepts and slopes a vector of means is a large intercept in the average slope so two parameters there and then you need this wonderful thing called a variance covariance matrix which you have seen before because in the first half of this course remember we did quadratic approximation all the time remember quap, quap is long gone, you've moved on, you've graduated but when we were doing quap it always represented the posterior distribution using a variance covariance matrix it assumed that the posterior distribution was multivariate Gaussian and it approximated it with this kind of entity so you've already worked with these things they're there it was done automatically for you now we're going to be more explicit about it and so you can see what they are in an essence because we're going to actually estimate them and we're going to have covariances as part of the posterior distribution now you haven't fun yet right turtles all the way down we're going to have distributions inside distributions so what is a variance covariance matrix in the two dimensional case the simplest one I'll show you here it's the multi-dimensional analog of sigma inside a one dimensional Gaussian distribution so in a one dimensional Gaussian distribution you have this one parameter sigma which shows you the spread of the measures if you add more dimensions to the Gaussian distribution you need a sigma for every dimension because every dimension could have its own spread so as I mean the intercepts have their own spread their own sigma the slopes have their own spread their own sigma and then there's another parameter you need which is the correlation between the two if you assume that correlation is zero you could do that maybe you get away with it but in general it doesn't have to be and so you need three parameters to describe the variance and covariance of two normally distributed variables and that's what I'm showing you here they go into this matrix which is a covariance matrix and there are two sigma's there's sigma sub A which is the standard deviation of the intercepts and there's sigma sub B sub beta which is the standard deviation of the slopes if you square those you have variances and so in the upper left end lower right of this matrix you've got the variance of each kind of effect yeah the diagonal is the variances in a variance covariance matrix and then the off diagonals are covariances and in this case these are the same because there's only two effects this will get bigger later on in a later example and the covariance is the product of the two variances times a correlation coefficient which I've written here with the Greek letter rho yeah which for completely I don't know arcane reasons we use this letter I don't know why someone's to blame for this probably Pearson and if you look in the book I've got a box to explain why that's true if you're interested there's a very good reason that the covariance is the product of both standard deviations times the regression it's the definition times the correlation this is the definition of a correlation correlation coefficient is a thing you make up so that the covariance is defined by that expression that's where it comes from it's nothing else the Pearson product of a correlation coefficient is just this thing that makes that the definition of a covariance there's a box in the book if you're interested in this you don't have to be interested in it I totally forgive you if you're not interested but if it burns in your mind like it does in mine to know where these expressions come from I'm here for you and I've written a box for you okay let's do some work with this I'm going to simulate for a population of cafes in the book I give you the code to do this you can simulate the journey of your coffee robot and it goes around and orders some coffee and we get a finite sample of morning and afternoon wait times for 20 different cafes on five different days morning and afternoons so we've got 10 data points per cafe that's a very small sample we're going to do pooling right we're going to do some estimation because your robot is basing it and you're going to program it with a varying slopes model in particular this is your robot's brain so let me walk through this step by step you don't have to get all this at once we're going to go through this step by step so you understand the programming in the robot and why this is a good brain right there may be better brains but this is a pretty good one it's not a bad brain okay so of course you know what the top is our outcome variable w is the wait time yeah for some cafe an observation high and then we have our linear model and inside the linear model we have varying intercepts and alpha for each cafe and varying slopes a difference between morning and afternoon for each cafe and there's this dummy variable a sub i which indicates whether you're in the afternoon with me so when a sub i is 0 this prediction equation is just alpha just the morning weight average if a sub i is 1 the prediction is alpha plus beta that's where beta is a difference it's a slope you with me yeah so this is nothing new right this looks the same as before here's the action so when we do varying slopes and we want to get pooling across the alphas and the betas we need this multivariate prior now we don't have two Gaussian adaptive priors we have one and it's got both things inside of it now this is the two dimensional prior so we have a what we say is for each cafe there's a pair of parameters alpha and beta and they're distributed as a two dimensional normal or mv normal for multivariate normal their averages are alpha and beta and then they have this s thing which is the covariance matrix that was on the previous slide okay i have labels sorry here are the labels of the things i just said so where does this covariance matrix come from i want to spend a few slides walking through this i know for some of you that my experience with the educational system on multiple continents now is that whether or not you have ever done matrix algebra is incredibly random right so like if you go through your particular course of study you'll get linear algebra and you'll know and you'll be like why is he teaching us about matrix multiplication doesn't everybody know this and then the person next to you has never seen it before and it's just even though you both have like phd's right so it's just a bizarre thing so forgive me i want to spend like five slides just giving you like what is matrix multiplication and how to deal with matrices and the thing to understand is that these matrices are just matrixes i'm trying to purge myself of matrices it sounds pretentious matrices right matrices matrices are just ways of working with systems of equations it's just notation there's no new math in them actually it's just a compact bunch of notation so you can do things faster but if you never learn the notation it seems like wizardry it's just random stuff happening because it's compressed right it's this kind of memory compression you don't need to write as many things down that's all it is it's just working with systems of equations which everybody did in secondary school right so that's all it is so let's think about our covariance matrix and we need to shuffle this thing around because we've got to put it has parameters inside of it and we're going to have to put priors on those parameters and there's a strategic way to do this which works nice for estimation so we think about in general any m by m covariance matrix a square matrix where m is some dimension in this case it's two it's going to be four before the end of the day if i move with pace and this means you need m standard deviations or variances right the squares of these will be variances so in our case it's two you need two standard deviations and you need m squared minus m over two correlations so for a two by two matrix this is one right four minus two divided by two is one yeah but it just goes up by the time you go to three or four you need more and more correlations because there are more and more pairs to consider so this is just a formula for how many pairs there are unordered pairs so in total you need m times m plus one divided by two parameters which can get big for two by two it's no big deal it's three but for big matrices as you'll see I'll have an example of a four by four later on you can get a lot it's still a very small number of parameters compared to the actual varying effects themselves right we could have hundreds of those or thousands if you're ambitious so how do you put priors on all these parameters well there are lots of different conventions the classical convention is to use a conjugate prior there's this distribution called the inverse wizard you should not use this you'll see it in textbooks it's kind of classical thing it's only advantage is that it's conjugate which means for models with Gaussian outcomes Gaussian likelihoods and Gaussian priors there's an analytical result for the posterior that's what conjugacy gives you but every other thing about this is bad it's a very badly behaved distribution it does terrible things in terms of regularization you should not ever use it right you'll see old examples when you come across a model it's got an inverse wizard in it doesn't mean it's a bad model just remove the inverse wizard and do something sensible this is just how it goes so in fact there's a big literature if you're interested in what's wrong with inverse wizard priors it became a thing in the late 90's for statisticians to beat up on the inverse wizard just kind of like a raft of papers came out slaying the ancestors over this inverted thing so it's the multi-dimensional version of the inverse gamma so it's better we can do much better the practical terms for you as applied as applied statisticians inverse wizard is annoying because it forces you to simultaneously specify a prior for the correlations and the standard deviations it's a prior for the whole covariance matrix and this is bad because sometimes you have different information about correlations and the standard deviations so you'd like to be able to assign priors independently to each standard deviation each sigma and each correlation coefficient and you can do this but to do this we're going to have to factor or decompose the covariance matrix into two separate matrices so what I'm showing you at the bottom of this slide our covariance matrix s is equal to this thing on the far left right that's just a covariance matrix with the three parameters inside of it and it turns out that is the product of a diagonal matrix with the two sigmas that's the thing here with the zeros right it's a diagonal matrix with the two standard deviations times a correlation matrix r or that's actually a capital row capital Greek letter times the diagonal matrix again and I know now you're like if you had linear algebra you're like of course everybody knows that if you haven't you're like WTF why is math like this right so let me give you a little bit of introduction about this matrix matrices sorry this is my program to purge myself I don't care what you say but I'm just for me personally I want to purge myself with Latin corals matrices are very nice things because they're shortcuts they save you some working memory and that's why math books are full of them matrix comes from the Latin word for mother it means womb in late Latin it means womb it's the thing something develops in that's what a matrix is it's the thing something develops in so it's from matrix and so think about this is where your how is prior born it develops in the matrix so there are a few simple rules for multiply matrices and these are just ways to deal with systems of equations and so let me give you as I said the two minute totally unsatisfying version of matrix multiplication just to demystify it your computer will do all this for you you don't have to multiply matrices by hand anymore I don't but I want you to have some sense of it if you know if you're among the people who just never had linear algebra which is perfectly fine it's just depending upon the course of study you do you won't be taught it so if you've got two square matrices which is all we're going to be dealing with in this course it's incredibly easy so imagine you've got these two matrices sorry I'm trying matrices with capital letters and lowercase letters and these are just numbers the capital letters are just some arbitrary numbers and the lowercase letters are just some arbitrary numbers we want to multiply the one on the left here by the one on the top and what I recommend when you're starting out is you set it up like this you multiply two square matrices matrixes you're going to get a square matrix back so set it up like this and then the rule is we consider each square in the answer each cell in the answer one at a time and then you take the row and the column relevant to it so in the upper left it's the upper row there as I colored in red and the left column in the top one so it makes sense how you locate those and then you're going to sweep from left to right in the in the row and top to bottom in the column and then in each corresponding sets of positions you're going to multiply and then you're going to add all the products together so what does that mean we're going to take capital A multiply by little a and then we're going to add that to capital B times lowercase c and the answer is capital A plus capital B lowercase c and this rule works for every cell so if you repeat this operation for all the others you get the answer so for the lower left now we've got c times lowercase a we get ca plus capital D times lowercase c and that's the answer in the lower left and so on and that's it now you know how to multiply matrices so as I always show if you can make an omelette you can do linear algebra making an omelette is often more challenging right now you're learning about me and not about omelettes but it's really just mechanical stuff it's just opaque if you've never had a course in linear algebra okay so this is where we get this decomposition and sometimes you'll see it written as SRS that the covariance matrix is SRS where S is this diagonal it's not an RS on the left it's the diagonal to standard deviations times a correlation matrix where it's one down the diagonal and then the correlation parameters in the off diagonals and then again times the diagonal standard deviations and I won't spend a lot of time going through this calculation for you but I wanted to put it in a note so you can follow it in slow motion at home if you can you need to do two matrix multiplications and then you get the original matrix back so it starts again like this you set them up we're going to multiply the first two the diagonal standard deviation matrix times the correlation matrix so you set them up so the diagonal standard deviation matrix is on the left correlation matrix is on the top needed an answer if you apply that rule you get this thing back which is not yet a covariance matrix it's just an intermediate result and then you need to put that on the left and put the diagonal standard deviation matrix again on top one more time with feeling and you get the covariance matrix back yay I know it's not exciting but this is just where it comes from and again your computer will do this for you you don't ever have to do this right matrix multiplication function that will do this for you whenever you want computers are good at this stuff but I just want it not to be magical to you so you understand what's going on so now we've got this covariance matrix and it's been factored so that we can specify independent priors on the correlations and the standard deviations and that's why we went through all that nonsense okay you're welcome so what do we do with that power now well here's some priors I want to explain the mean we need priors for the alpha and beta mean these are what I call mind foreign priors I know something about cafes I know the average wait time is not zero right I have expectations so I'm putting some information in these average priors and then we have three standard deviations in this model one at the top in the probability of the data and then a sigma intercepts and a sigma for the slopes the thing we need to spend some time talking about which is new is how do you put a prior on a matrix now we've got a correlation matrix there's only one correlation in this matrix but in general you could have a correlation matrix which is very big and it could have a bunch of correlation parameters inside of it and the truth is you can't assign priors to them independently because there are constraints on correlation matrices and values constrain one another so I know this is an annoying fact but this is where you know like galaxy brain turns on right and you're going to thank me for all this is that if you only have one correlation in a matrix well it can vary between minus one and one right those are all the possible correlation values but now imagine you have two and one of them is really really big that implies that the other isn't equally large because that would be an impossible matrix right if the if the paralyzed correlation only two values is really really big that constrains the correlations among the others because the other variables also have to be correlated with those two as the matrix gets really really big those constraints get really strong and if you can't in a big correlation matrix it's very very difficult to have strong correlations in fact and have a valid population you can have a really strong correlation but then all the others have to be small it's an absolute fact just about how correlation functions in populations so you need a family of priors which deal with this fundamental problem that you can't just pick random numbers for the independent correlations and stick them in a matrix and have a valid matrix there's this thing some of you will know if you have linear algebra about positive definiteness right there's this mathematical constraint that has to do with eigenvalues which is a way to test this you don't have to know that you just have to know that there are good physical reasons that you can't pick arbitrary correlations inside of a matrix it's just impossible for everything to be perfectly correlated with everything else in a matrix can't be true not if there's any variation at least okay so there's this really nice family of priors which I encourage you to use and use in this book and you'll see that this is present in almost all of the recent textbooks in Bayesian statistics it's this thing called the LKJ family of correlation priors LKJ is named after the last names of the authors who published the paper this is what people have started calling it so Lewandowski, Karowicka and Joe Joe is the guy's last name I don't forget what his first name is I hope it's Joseph Joseph but I should check but LKJ they call it the onion method for setting priors so maybe we should start calling it the onion method prior which doesn't sound nice but LKJ is what's called in the software so you should recognize it as the LKJ prior how does this prior work there's one parameter which determines its shape this is eta and eta specifies how concentrated the matrix is from the identity matrix an identity matrix is a matrix with zero correlations everywhere just ones along the diagonal and zeros everywhere else that's called the identity matrix well because you can multiply a matrix by it you get the matrix back that's why it's called the identity matrix it's linear algebra again you should just think of it as if eta is a really really big number it means no correlations in the prior if it's a lower number then you have a flatter prior on the correlations and many many more correlations are possible in the prior so you can visualize this in one dimension quite easily that's what I've done on this graph eta equals one is uniform so if you look at this graph we're thinking about a two by two correlation matrix and putting this prior on it now the correlations unconstrained because there's no other correlation parameter to interfere with it if eta equals one you get uniform distribution from minus one to one on the priors correlations as you increase it above one eta above one you get more and more concentration around zero which is the identity matrix so with two it's this gentle hill it's mildly skeptical of extreme correlations with four it's a little bit more as it goes to infinity you're guaranteed to have no correlations in the posterior and this is the way it works for big matrices all of these prior densities get concentrated more on zero because of these constraints that arise in high dimensions so in the book there's code to make this graph and I encourage you to play with it and get a sense of it usually you want to use something that's mildly regularizing you want to be skeptical probably the correlation is not minus one or one that seems implausible however if the data demands it this prior doesn't put zero mass anywhere so if the data demanded you can get minus one or one out of the posterior how do we program this you're used to these formulas by now a bunch of new things here I'm going to go through slow and highlight each bit that's new first thing to note in the linear model we've got both random intercepts and slopes I'm calling these alpha cafe cafe I know that's annoying but it's just a way to notice that these are the alphas for the cafes and then bracket cafe is how you get each cafe right there's a vector of alpha cafe parameters one for each cafe and then there's beta cafe for each cafe as well and then the way they're actually there are a number of different ways to program these these multi normal priors and you'll see a few before we're done this week but here's the most transparent one that I want to start with you can just use this C combiner to make a vector and you can put the alpha cafe and beta cafe parameter vectors together this is like making a two column list and then for each cafe again the bracket goes on the end so each row in this two column matrix of parameters corresponds to each cafe so you say the way to read this maybe in plain language is to say the bracket cafe means for each cafe so for each cafe there's a pair of parameters alpha cafe and beta cafe good yeah again from the lecture you can get the concepts of this but you really got to go home and draw the owl you know and run the code and get some sense of what's going on then we write multi normal instead of normal and then you can put a vector of means that's a and b are just the means for the intercepts and slopes and then we put in here our correlation matrix row and then a vector of sigma sigma cafe which will have the same length so row automatically when you do this the multi normal prior here knows that rows should be two by two because there are two means and there are two outcomes and it knows that sigma has to be linked to and it sets that up for you then the rest of the priors are things you've seen before and then for the lkj correlation it's just lkj underscore core and then two is that mild hill I showed you on the previous slide right it equals two so it's mildly skeptical of extreme correlations very mild as it turns out you with me so far oh yeah so I had some more highlighting here you go all right so row and sigma yeah there's row and there's sigma there's their correspondence so yeah it when this builds the stand code it knows that sigma cafe has to be a vector of length two because of the multi normal has two elements in it does it automatically for you you don't have to write it here but I'll show you later a case where we can specify the length manually because we're going to have to but for now just know it's doing it automatically all right when you run this model what happens let's look at the posterior distribution for the correlation actually first and then I'll show you its consequence so we run the model this is model 14.1 in chapter 14 extract the posterior as usual then I plot the density for this correlation matrix when you look inside the posterior you're going to see a matrix row and it's going to have four cells right but here I'm showing you the density for one two that is it's got rows and columns so for row one column two and that's the correlation position one one is always one right because this is a correlation matrix and two two is always one because it's a correlation matrix but one two and two one are the same value but they're all inside the posterior because we sampled a matrix yeah it's a distribution of matrices matrices and so I'm showing you this is the correlation parameter of interest the black dashed density is our prior that's our eta equals two LKJ correlation prior gentle hill and then the blue is the posterior distribution there's almost all the masses below zero there's a negative correlation between these two things why because I simulated it to be that way we'll see examples of natural correlation structures does this make some sense what's the consequence of this well you get pooling in two dimensions now shrinkage in two dimensions so what I plotted up on this slide is on the horizontal are the intercept estimates for each cafe on the vertical are the slope estimates for each cafe not drawn this like a statistical population so again the blue points like last week are the raw data values they're the fixed effect estimates if you just took only the data for that individual cafe and naively did the okay ignoring the rest of the sample this would be your estimate so called raw unregularized fixed effect estimate the open circles are the varying effect estimates we get out of the posterior distribution of this model now let's spend some quality time with this graph okay we're not in a hurry so first thing you notice is I've drawn all these ellipses on here what are these these ellipses are the contours of the inferred population so that multivariate normal distribution that's in the model the prior it has been learned from the data because the alpha and beta means and the covariance matrix weren't known but now we have a posterior distribution for them so we have a posterior distribution for this population of cafes and I've plotted the contours of that population with these ellipses do you want to know how to do this to codes in the book there's a library called ellipse in R and it does this for you basically give it a covariance matrix that draws ellipses it's very easy two dimensional Gaussian distribution implies an ellipse at any level that you're interested in so you can just draw it as a bunch of concentric ellipses so I was trying to say at the 10, 30, 50 is the median 80 and 99 percentile contours of this Gaussian hill that the cafes are in and this is the statistical population that generates pooling shrinkage in the population of cafes but now in two dimensions so remember how shrinkage works is if there's an estimate that's really extreme from the perspective of the prior then it gets shrunk towards the mean now we're in two dimensions and now that shrinkage works in two dimensions so one of the cool consequences of this is that we can highlight particular cafes like the one that I've drawn the red oval around and notice the thing about that cafe that's interesting is it's got a very typical intercept it's intercept which is located on the horizontal axis is right in the middle of the population it's exactly what you'd expect it's a perfectly boring average cafe in terms of morning wait time but it's slope is unusual it's very extreme it's out on the edges of the population it's towards the top of this graph since it has an extreme slope the model is skeptical of that and it shrinks it towards the mean slope but it also knows that slopes and intercepts are correlated in this population they're negatively correlated so when it shrinks the slope it also changes the intercept and it moves the intercept to the right and that's why as you follow the shrinkage line from the blue point to the open circle you go down into the right so even though the intercept is not extreme statistically the optimal thing to do turns out to be to adjust it because of the correlation structure between the two features are you excited? Brenda's excited thank you Brenda so this is a cool thing it's and why is this good this is helping you reduce overfitting the correlation structure is there and you shouldn't ignore it if you want to learn opera now you can go through you can tour through a bunch of these cafes and understand what's going on why some are shrinking in some aren't where they are but you see there's still the general shrinkage pattern it dominates it is that extreme the further any cafe is from the center of this distribution the more it moves that's the shrinkage phenomenon but now they're being drawn towards some kind of contour there that goes through the angled middle of this distribution you can represent this both on the parameter scale which is what we've just done or on the outcome scale you transform this right you can add the slopes to the intercepts and then you can get another version of the posterior distribution where you've got a parameter for each cafe which is the morning weight and a parameter for each cafe which is the afternoon weight you can always do that so I showed you all the code to do this in the text and then you get the graph on the right it's the same shrinkage distribution you still got shrinkage all the same phenomenon but now there's a positive correlation but notice that we're below the diagonal quality that is you expect to weight more in the morning than you do in the afternoon everywhere but there's a correlation between the two right you weight more both in the morning and the afternoon at particular cafes but everything's still being shrunk towards the middle right because shrinkage is working for you good and you don't have to get it all right now I just want you to get enough of the concept so you can go home and draw the owl let me try to summarize this a little bit what we're exploiting here is multi-dimensional shrinkage when you have a joint distribution of varying effects parameters you can pool information across the types of parameters and this is even better than pooling in within one type of parameter but this depends upon the correlation between the effects if there's no evidence for correlation between the effects in the data you don't get pooling across the types so you still got to learn it from the data and it has all the same advantages of pooling in general okay let me try to show you a grown-up version of this let's consider many effects many types of clusters well two types of clusters that's many right two is many in this class two is many so let's go back to the chimpanzees data I know you're sick of this data by now I'm not I love this data set with all my heart and it's like I know each of these chimpanzees actually no their names but I'm trying to anonymize them these chips appear in multiple papers actually so you'd like start to recognize lefty in multiple papers but so there are four treatments in this experiment remember because there are two treatment factors so to speak and we combine them factorially you've got the partner presence and absence and you've got which side of the table the pro-social option is on you've got four treatments and we're trying to estimate average behavior in each of these treatments but the treatment how often the left hand lever is pulled in each treatment can vary both by actor and block so let's think about this as a big varying effects model where we consider the effects of both actor and block on the four treatment effects so where we're going to express this now at the bottom the outcome is still zero one whether the individual pulled the left lever and now we've got four mean treatment effects which everyone call gamma TID is your treatment ID from one to four there are four different treatments in each treatment there's an average rate of pulling the left lever across all chimpanzees and blocks but each chimpanzee in each block has a deviation from that mean and we're going to estimate that as a parameter now and that's what the next two bits in the linear model are we've got a matrix of effects now alpha and this is going to get fun right but it's all the stuff you already know it's just the notation is the challenge so alpha sub actor I comma TID I what does that mean the alpha deviation from the mean in this treatment for actor I in this treatment right so for each actor in each treatment there's a deviation so we're letting each actor's behavior vary by each treatment so each actor gets four parameters right because there are four treatments and there are how many actors seven is that right seven or eight it'll show up on a later slide embarrass me let's say seven there's seven actors so we've got seven times four parameters here seven times four alphas and then we have beta parameters for the block effects these are the deviations from the mean the gamma means for each treatment per each block so each block gets its own deviation for each treatment so there are four parameters for each block I think there are six blocks in this data set so six times four block parameters we want to do some shrinkage right it's a lot of parameters there's a lot of data here but we want to do some shrinkage are you with me on the structure so this is like the rawest empirical description of this data set yet we're saying for every little box that's got a unique identity there's a unique block and treatment and actor that combination we're going to let it have its own deviation which comes from the sum of these parameters right so it's like a really raw empirical description of the variation in the experimental data but we're going to need to do some shrinkage to deal with overfitting so let's count up parameters we've got one matrix for each cluster type that is actor and block so there's going to be two covariance matrices matrixes in this model so we get seven times 40 what's that seven times four equals 28 actor parameters six times four is 24 block parameters each of these covariance matrix matrices has six correlations and four sigmas and to prove this to you I've drawn a four by four correlation matrix in the lower right in the abstract and the diagonal is all ones right because each variable is perfectly correlated with itself at least in this universe and the off diagonals are the pairs and there are one two three four five six unique pairs and these matrix matrices are always symmetric right so the upper triangle and the lower triangle are the same make sense so we've got six correlation parameters now to estimate good times so there are ten parameters per correlation matrix in some here we've got 28 plus 24 plus 20 plus 4 is 76 parameters to estimate no problem amitone Monte Carlo laughs at this this is not a challenge at all it's just warming up this is breakfast for amitone Monte Carlo yeah so we could do thousands no problem we've also got all the other parameters we could have if we had more individuals so you had 100 chimpanzees right then you get lots of alphas in the notation you think about these multivariate normals this way there are for each actor J there's alpha one two three and four for each of the treatments the means are all zero because the means are going to be the gammas that are in the linear model and then we've got this covariance matrix depicted on the right to estimate this is what the code looks like we're going to walk through this step by step again the first bit to notice are the actor effects shown in red here in the linear model we've got our matrix of alphas actor comma TID you want to think about the alphas as a matrix where each row is an actor and each column is a treatment yeah and that's why it's alpha comma treatment an actor comma treatment and that gives us the cell that goes on this particular outcome so it makes sense enough sense to keep going yes okay thank you and then we define it down in the adapter priors as a vector of four treatment effects one for each actor that's what they said so vector four means I want four treatment effects I'm going to call this parameter alpha and there's going to be one of these for each actor and this is what defines the matrix so what is the matrix it's four things right times the number of actors so this is what gives you the number of actors times the number of treatments as a matrix it's just a weird way to think about it I know but this is this is the way multivariate normals are often conceptualized so a collection of vectors right there's a vector of four parameters for each actor it's one way to think about this and then there's this multinormal thing again the mean is zero there's a single zero in the code here but it knows that there are four elements in this why because you said vector four yeah and so it multiplies the four it repeats it four times and then there's row actor and sigma actor knows that that needs to be a four by four matrix and it knows that sigma actor needs to be a vector of length four and then those get defined as priors down before sigma actor exponential prior on each element of it and then row actor is this LKJ core prior here with our notation with the D in front and I made eight of four so it's a little bit more skeptical of extreme correlations you with me the only new bit here is this matrix thing and the vector notation and now the same thing for the blocks it's really just the same thing with beta replacing alpha everywhere right or block replacing actor so up in the top now we've got a matrix of betas each row is a block each column is a treatment same thing for the adaptive prior now there's block instead of vector and again it's for each block there's a vector of four that are drawn with this particular correlation structure so here's the idea to think about this scientifically the different treatment effects have a correlation across actors and blocks right they have an association with one another that's the correlation structure you're going to estimate what does that correlation structure mean we'll look at this that's the handedness when you're talking about actors it's handedness right so if you're really really say you're lefty actor number two you're always pulled a left lever there's a correlation among the treatment effects there right they're all high yeah and handedness generates correlation in the actor effects in this data set I'll show you that when we get to plot that correlation structure so there's really really strong correlations actually in the covariance matrix for actors that we're going to estimate and it arises from handedness in block there's nothing because there are no block effects in these data but you got to see when there's nothing right as well sometimes there really is nothing often in my data sets okay so sigma block and then again lkj core now with these models is I think part of learning is you want to run this model and understand it and then come back and gradually make these lkj core prior stronger you're going to have to make that number really big before you have much of an impact here but you should play with it try 64 128 make it a big number and then see what happens to the posterior here and that's the way you get a sense about how strong these things are since you can't plot a four dimensional matrix right no one can do that no one's got galaxy brain and can do this right so you know the joke I think I told it already in this course you've got like a 14 dimensional matrix how do you visualize that well you plot a two dimensional or three dimensional matrix and then you stare at it hard and you say 14 right and no one can do this it's just no one knows what these things look like okay got five minutes but let me get this topic started this is one of my favorite topics in the course so divergences again right we spent time last week talking about divergent transitions remember the roller coaster pops off the rail okay the roller coaster really pops off the rail with these models a whole lot it really likes to so you're going to have to use the non-centered formalization quite often well not that you have to you should it makes the sampling more efficient you'll have a lot more confidence in the results and so you'll see here we get 96 divergences right the cognitive challenge though is remember to do a non-centered parameterization what you do is you factor all of the parameters out of the prior into the linear model so for a one dimensional normal that's easy right because there's just there's a mean you can take that out and stick it in the linear model there's a sigma you can factor that out and multiply it back in in the linear model you're done it's not a big deal once you've seen it a few times it's second nature but now we've got a whole matrix so the sigmas aren't a problem again we can just factor those out and put them back in as before but you've still got a correlation matrix inside the prior how do you put a matrix in a linear model and the answer is this fellow on J. Louis Choleski or Choleski Choleski here in Germany we say Choleski and it's a Polish name right so we can say what we like but this is a fellow who did an amazing thing it figured out a really cool trick for solving systems of linear equations and we can use it to put an entire correlation matrix in a single linear model and that's how we do this now your computer will do it for you but let me spend a little bit of time saying a little bit about the background because you know part of the one of the themes in this course is I like to tell you about random mathematicians right it's just a hobby with me so this is part of the history of the science part so this fellow Choleski is really interesting because his work was published posthumously but it's an incredibly famous bit of work which has had a huge impact on applied mathematics in lots of fields this is a workhorse thing the so-called Choleski factor or Choleski decomposition it's a standard thing everybody learns it we use it daily all the time it saves your bacon and lets you get these non-centered forms and it's a cool story in the sense that so he was an artillery officer in the First World War as it says here the Great War where is it in the Grand Glee he died in the war but his colleagues rescued his notes where he had solved on the battlefield basically a way to solve systems of linear equations and the cool thing about this is that about this technique is it's a way to solve a system of linear equations by solving fewer equations than you started with and that's what makes it work it's a really cool efficient thing this paper is great fun by the way it's written by his colleague and so what do we care about this you don't feel like you're solving systems of linear equations what you're doing is you've got this correlation matrix and you need to somehow factor it out and or rather you want to blend it somehow with a vector of z-scores remember this is the non-centered prior business you've just got a bunch of z-scores independent z-scores on your parameters and then you want to blend those somehow with a correlation matrix and get things back on the right scale so this is what his technique lets us do here's a little bit of our code which shows you in a two-dimensional case how it can work I'll walk you through this and then I'll show you the automated version for bigger things you never have to do this by hand it's just matrix algebra so imagine for example we wanted to simulate two vectors of random numbers that have a particular correlation structure how would you do that just say you just want to do this all the time right if you're like me you do this is a common problem but you want random numbers and you want them to be correlated in a particular way well it turns out you can just simulate independent vectors of random numbers that's what I'm doing here so I'm going to say at the top we want 10,000 random numbers we want the standard deviation of the first of the random numbers to be 2 the standard deviation of the second kind of random number to be 0.5 and I want their correlation to be 0.6 how can I do this in a computer and Koleski to the rescue this is how we do it we imagine ourselves on a French battlefield in World War I and we solve a geodetic network this is what he was doing he was a geographer and artillery officer and so Z1 is just a bunch of Z scores 10,000 of them and Z2 is a bunch of Z scores 10,000 of them there's no correlation between Z1 and Z2 at all and then of course we can get A1 as a function of Z1 to have the right standard deviation by multiplying it by sigma that's the second to last line does that make sense right if it's Z scores you multiply by standard deviation it's on the normal scale again it has the variance you want and then in the last line this is this thing the Koleski factor and I know this is a bizarre expression you don't have to understand it you just know that this is a consequence of doing a calculation that Koleski discovered in the service of the military in World War I that you can do this weird thing with rho and Z1 in there and then multiply it by sigma 2 and then I just show you at the bottom the correlation between Z1 and Z2 is 0 that's 0 in a computer right that first long number there the correlation between A1 and A2 is 0.6 which is what we wanted this works and it works for any size matrix so we can have any number of C's here Z1 Z2 Z3 Z4 Z100 and pick any correlation structure we like and apply this technique generalized up to that and get things back on the scale we want this is how we do the non-centered parameterization in this business so it's 11 o'clock so I'll stop right here at this at this thing when you come back we'll pick up on this slide we'll do this in an automated fashion and I want to show you the code to do it then we'll get rid of divergent transitions and then we'll talk about lefty again okay so thank you I'll see you on Friday.