 Hello. Alright, I'll let you guys finish the homework after five. Welcome back. This is the last week. And as I try to fit in some of the more awesome and useful things that we can do. And before I get into the material, as a prelude to what we're going to do this week, I want to say that for much of conventional statistical tasks, it makes very little difference what paradigm you use. The interpretations are different. Some people find one version or another more intuitive. But for example, linear regression can be justified a dozen different ways. And all those justifications make sense within their own logical framework. And that's all fine. There are a bunch of different ways to motivate the same procedure. And they give you the same behavioral outcomes. People act on the inferences produced by them in the same way. So all this philosophy stuff is just to keep philosophers employed. That's sort of by knock on philosophy. But there are some kinds of modeling tasks and things we need to do with scientific estimation. Where the Bayesian approach has some real advantages. That is, since we describe everything with probability distributions, when we get to non-traditional, non-classical sorts of problems like measurement error and missing data, it suddenly gets a lot easier. All the stuff you've already learned just carries forward. And data becomes just a special case. Whereas in the classical approach, what arrives is something that is called in statistics ad hocary. Which is you make up a procedure that seems to work in the cases you study. And there are no principles which leads to it. And ad hocary is deadly. And so that's one of the things that we get to deliver on to the last week here. To show you examples of the non-ad hocary solutions. Logical solutions where we just make assumptions given the information we have about our data, about the processes we think generate the data. And logic figures out the implications. You don't have to be clever at discovering ad hocaries that work. Anyway, and I do think however ad hocary is a great thing for keeping statisticians employed. And the Bayesian approach is actually the simplest because it doesn't support ad hocaries. But people can't propel their careers on discovering estimators and things like that. So it's actually the least glamorous in addition to being the oldest statistical paradigm. Before we get into the material, a couple of quick announcements. I've updated rethinking to 1.5. I know some of you have already discovered that. I forget what happened in this, but there are some usability features and a couple of other fixes and something major. But you should always have something. I'm already working on 1.5.1 on my computer because I've already discovered some other things I've changed. So I'm always one revision ahead of you guys, which is sometimes awkward because you've sent me problems and I've already fixed it. And I realize, oh wait, I already fixed that in my computer. Let's roll back. Go ahead and update. And then I uploaded a new copy of the book with a fully written chapter 14, not just code, which was actually English in it now. And grammatical English. I think it was English before, but it wasn't grammatical. So now it's grammatical English. I like this chapter a lot. It took me a long time and many revisions to get to it, but I'm proud of it. I hope you like it. So the book's almost done. Oh my God. Four years on now. So let's pick up where we left off before. I motivated an issue of getting beyond classical varying effect models with their discrete unorbit categories with an example of political partisanship in the United States and brick cohorts. So just to quickly remind you of that, it's a fact that in the U.S. electorate, birth-year is a really good predictor of your partisan offset from the mean election outcome in a presidential election and a national election. And the effect seems to be that which party within the White, controlled the White House around the time they see, and how popular they were, interact to set partisanship for a lifetime. Not in everybody, but in a lot of people. In my generation, the Dinex generation, through the Winota Ryder generation, right? I think the Winota Ryder is my age. And Reality Bites, that's the anthem of our generation. We're very Republican as a generation. And that's because we came of age when Ronald Reagan was ending his second term, or the first Bush was beginning and the Republican Party was popular and we were winning. So there's this big partisanship. The millennials have the opposite shift because they came of political age when the Republican Party was in office and was extremely unpopular, and then the Barack Obama election. And most of those things, that partisanship in the other direction probably for a lifetime. And it's a powerful effect. Now statistically, we care about effects like this because age is a proxy for a bunch of common exposures that people of similar ages have. But age is a continuous variable. No two people have exactly the same age, right? Everybody's born in a slightly different second. What matters is that you don't want to... What you could do with age is you could take it and you could discretize it and you could say, okay, I'm gonna treat age as a cluster variable and everybody who's 18 gets the same varying intercept. And that could work. But the problem there is there's nothing in the statistical model that notices that 19-year-olds are more similar to 18-year-olds than 20-year-olds are because all of the traditional cluster variables and varying effects models are unordered. There's no dimension in which one cluster is more similar to another. And that throws away information because you know before the data arrived that 18-year-olds and 19-year-olds have more common exposures than either does with say 40-year-olds. And you want to get that in there somehow. Why? Because you want to pool more between proximate age classes than between physical ones. So we need some extension to continuous categories. And luckily this exists. And it's been around for a long time actually. And so to give you a few more examples before we start developing a solution here age is a classic one but also income is a proxy for a bunch of things which affect human behavior, consumption behavior, schooling decisions and all kinds of things. You can't measure those things but you can account for the variation and the correlations and induces of individuals with similar incomes if you could treat it correctly as a continuous category, continuous proxy. To motivate later. Or patristic distance so that you do phylogenetic work, you know, patristic distance what's the length of the time separation connected to species and languages. That's also a measure of shared similarity common ancestry, how much of it is there and that gives us expectations of the co-variation among species. Social network distance those of you the suffering social scientists in this course do sometimes think about you and social network distance is another one you measure the distance a different way but it's a dimension along with individuals who are more similar in their pair of distances from one another they can be more similar in a bunch of things they share information and exposures all kinds of reasons that you may not be able to make if you want to control for that and lots of others you can once you get into some examples of this you'll see these issues arise in many many cases and there are no obvious cut points in these continua we do have opera reasons to expect that individuals with more similar values of these variables or ever closer to one another in the distances between their values are going to behave more similarly on some outcome scale of interest and we'd like to get all the advantages of pooling in this kind of continuous dimensional space so how do we do that and the very common and practical which means easy to compute easy for your computer approach is called Gaussian process regression which is a big family of really cool machine learning techniques but like a lot of machine learning techniques it has a perfectly logical Bayesian representation as well that's the version I'm going to give you let's do it in the case of again a previous data set that I taught you saves me the time to teach you a new data set so let's go back to the oceanic societies data remember the motivation of these data is there are sizes of tool kits you can think of as complexities of tool kits across the different oceanic historical oceanic societies and there's an underlying evolutionary model here cultural evolutionary model which suggests that the magnitude of population size should be related to the complexity of tool kits and what was left out and that's true in a very strong way we saw it before they're associated in a very strong way but they're definitely associated very strongly what we left out before though is of course societies which are geographically close to one another share lots of unmeasured common exposures which may also account for differences in their tool kits some of these might be geological ones oceanic islands vary a lot in what they are, some of them are basically glorified coral reefs with people living on them sounds romantic but it's not nothing grows it's horrible and then the typhoon comes and you're dead it's horrible and others like Hawaii are massive volcanic constructions really impressive pieces of geography with complex geology and good soils at least on some of the islands and so on lots of variation, lots of variation in the raw materials to make tools from so a lot of these tool sets very little metallurgy in Oceania a lot of it's a tool stone it's very important that some islands have good sources of tool stone and others like the coral ones basically none and they traded for it so the particular shellfish that are available may provide good materials for making fish hooks and other things and other places they won't so those unmeasured covariates that might affect technology the idea is maybe we can capture some of that with location islands that are more similar to one another may share some of those commonalities the other thing may be just the first chance in this course to deal with direct contamination of one outcome on another so we've been assuming all along in this course that every case is independent of the others, this is a standard statistical assumption the way to think about independence in that sense of statistics is it just means conditional on the predictors, they're independent of one another obviously there's correlations across cases induced by similarity in predictors, that's the whole point of bottoming is to be able to make predictions like that but once you know the predictors then they're independent of one another is the idea and any correlations between different individuals in a data set are due to those predictors and not due to say one outcome actually directly contaminating the other but with human affairs, like tool kits tools can move across islands and islands that are close to one another can actually share technologies with oceania, there's good archeology on this and we know that neighboring societies traded tools so there's true contamination now where the outcome variables actually cause other outcome variables to increase by proximity, so the spatial proximity can give us a way to deal with that statistical non-independence of the second kind first kind of non-independence is non-independence before you've conditioned on predictors the second kind is the really awful kind where your outcome directly affects somebody else's this is like cheating on the test your answer resembles as you copy the computers that would be a case where the outcome actually causes the other outcome and that's the horrible kind of non-independence statistics these sorts of models that we're going to develop here help us deal with that you still have to get the model right you have to understand how the contamination happens and model it but it'll let us deal with that true non-independence ok so we're going to use space as a crude proxy for all these things and by space here I mean sort of as the crow flies or actually I think this is sort of like as the as the Boeing flies because I got these distances from an airline travel database so they're great circles on the globe basically and they really are like airline travel distances I looked them up and there's actually a great online database of the stuff you can put in locations and get it for all the oceanic islands and so in the rethinking package there's this matrix this distance pairwise distance matrix that I've displayed in the lower right of this slide and these are distances between two pairs of societies and the status set in thousands of kilometers these are big distances because hey it's the Pacific Ocean that's half the world so there's a lot of distance here you've got everything from Hawaii which is distant from everybody more than almost more than 5000 5000 or more kilometers from everyone but then you have little clusters of islands like this triad of Balecula and Ticopea and Santa Cruz which are all less than 1000 kilometers from one another and historically very integrated as joint societies even though they were separate chieftains had a lot of contact so this these are the distances that we're going to use to model the co-variation between the mutual sense of the different islands so we're going to motivate it up that way does the motivation make sense before we get this technology with me on motivation okay so these distances in other contexts if you had individuals right you'd want pair-wise differences in age and that would help us model co-variation in individual political attitudes given distances, dissimilarities in their age with their incomes or things like that so you construct these distances based upon your theory about what causes co-variation between pairs of units in the data make sense so in community ecology the ecologists in the room you know community compositions are often modeled this way it's a common application of this stuff you get all kinds of effects like you spray one agricultural plot the neighboring plots get effects of that their pest distributions change so the spatial effect helps you understand that as well you get these contamination effects okay policing in the social sciences criminology they use models like this policing in one neighborhood spills over positive crime in the neighborhood neighborhoods is the effect that I usually see in literature right you police all the crime out of one neighborhood it spills over into other places big problem in Mexico because I understand it okay so let's start with the familiar part of this model this will look like a regular old Poisson GLM we did this before and the only thing that's really different now is this gamma island eye this is going to be our island offset these are still varying intercepts but they're going to come out of the what's called the Gaussian process that we'll define on the next slide but they give you an offset from the expectation based upon a common mean alpha and the fixed effects of log population for that island make sense so all the mystery comes later on in the model the top part is EOD Poisson GLM nothing new about it at all so what is this gamma thing well it comes from what I call the Gaussian process prior this is another multivariate normal prior you're enjoying these in your homework I know because I'm getting lots of emails about that and I'm trying to respond quickly I think I popped them all up great so they'll be more coming once you wrap your head around these things they're great by the way so don't be frustrated it's just the next hurdle intellectual hurdle in this so this is just yet another multivariate Gaussian prior and what you want to think of prior over every outcome in the data set simultaneously every single outcome in the data set every society in this case there are only 10 in the data set so it isn't that mind-blowing that's why I like this teaching data set right but you can do this with a thousand you can run Gaussian processes on a thousand observations no problem Gaussian is easy that's why we like Gaussian and so for all 10 societies there's a vector gamma is a vector of 10 varying intercepts but now every island is its own unique snowflake and it has a pair-wise distance from every other island and we want to model the co-variation among all 10 of them simultaneously and we do that and here's how we're going to do it so the first thing we do is we center this normal distribution at zero which just means the gammas are all offsets from the mean alpha so this is on log count scale because it's a Gaussian model with a log length and then there's this thing bold K which is a co-variant matrix multivariate normals are defined by a vector of means in this case 10 zeros and then a 10 by 10 co-variance matrix and all of the action in Gaussian process regression is in modeling the co-variance so here's the shift to catch and here's the thing you want to wrap your head around up to this point in the course most of the action has been modeling the mean even in all the GLMs we do we spend all our time making linear models on some transform scale load it or log for the mean of outcomes whatever the distribution was whether it was Poisson or Binomial or whatever now the action is not only at the mean there's still some action at the mean there's a lot of populations in there but now most of the action is going to be in the co-variance and we're going to model how long these things co-bear with one another and it turns out because of interesting properties of Gaussian distributions that these things lost up out of the mean to put it in the co-variance to get the same effective predictions but what this bind us does is we can go to really high dimensional space and do all of the outcomes jointly in a pretty easy way easy for your computer to do that's what we're going to do and we're going to use the conventional most conventional Gaussian process definition of the co-variance matrix unlike the co-variance matrix that you're working with for the homework that you're doing this week with varying slopes models the number of parameter doesn't rise with the dimensions because we're modeling the co-variance so the very small number of parameters that define how co-variation between any pair of societies decays with distance between them and we just make that definition and we estimate the parameters that define its shape and there's really there's only three parameters in this conventional function and on the next slides I'm going to explain them to you but in our example only two of these are going to be in play so there's really only two parameters that we have to estimate to model all the co-variance among all 10 of these things but you should know there's a bunch of different assumptions you could make here this is only a very customary and easy to fit one but if you have a theory that gives you something better to do then by all means, as I keep saying, do that violate the horoscope and use your domain knowledge to do something better so for example in phylogenetics the magic regression has the same Gaussian process in it but the definition of k here of the co-variance matrix is different it's based upon a Brownian motion model from the distances and the expected co-variation between a pair of species given their characteristic distance so it's a different function I think in that case it usually has one parameter it's actually Pagel's lambda is about the only thing in those matrices usually but it's the same inspiration what just changes is the model of how the co-variance decays with distance so let's spend some time and by the way, I got all these nice pictures from Google image search of Polynesian islands for the most part I hope this relaxes you as we go through I found it very relaxing when I made the talk this way I think this is a Tongan resort I want to live here, I want to go to there as Tina Fey would say looks pretty nice let's go there and do math next time we have this course, can we do it there? don't be good so we're going to define a function for every cell in this matrix 10 by 10 matrix and we say for a combination of societies i and j what is their co-variance and it's defined by this function so let me walk you through the steps of it first is just say Kij is the co-variance between islands i and j the first parameter is eta and you can think of eta squared as the maximum co-variance between any pair of islands so as the second part here we'll look at as that part goes to 1 you get the maximum co-variance so as islands get really close to one another this is the sign of limit of how correlated they can get and still be different islands the next part of it the action part is the thing in the exponentiation we raise E to the minus rho squared, rho is our second parameter that we'll be interested in times capital D squared ij and Dij is the distance between islands i and j it's the entry in that matrix I showed you a few slides back in thousands of kilometers and it's squared because that induces a particular shape in fact a Gaussian shape that I'll show you on the next slide and rho squared affects the rate of decline with distance so if it's a big number that means the co-variation between any two islands declines rapidly as they move apart from one another the small number you can sustain co-variation over vast distances we're going to estimate eta and rho from the data David? the maximum co-variance of what? pairs of i and j well across a bunch of them you're solving that with the whole equation and then you're coming back around on a subsequent iteration and putting it into eta I haven't understood that question so the k ij is the co-variance the co-variance is a function of the maximum co-variance plus everything else that's right we don't know the maximum co-variance until you solve the whole equation it's a parameter, we're going to get a posterior distribution eta and rho will have posterior distributions it's a still full Bayesian inference so and just remind everybody because I have to with my mantra what does that mean? that means it counts up all the ways the data could happen given each value of the parameters and then it ranks all the different parameters that way, that's all the posterior distribution is the relative rankings of the number of ways the data could be produced that's all it does, it's magic but not magic, it's like the dumbest kind of robotic thing but fabulous at the same time so that's what we're going to get posterior distributions for both of these but in the code for any particular like a posterior Markov chain eta and rho have fixed values on one set of samples from the Markov chain and they're plugged into this formula and it finds the whole co-variance matrix then a likelihood is computed from that for any set of gammas that are proposed by the chain and then there are iterations that go through in the algorithm, absolutely depending upon the MCMC algorithm we use that will be done in different ways HMC is really good at this stuff but give sampling does a great job with these too the best way to fit these is not even to go full Bayesian because if everything's Gaussian this won't be because we've got plus on at the top but if the outcomes are also Gaussian I wouldn't even use stand, I'd use GP stuff probably by staying in the book Google it, there's this great package called GP stuff Gaussian process regression, it's really good it's really nice so let's explain this middle part in this convention so this functional form is often called the L2 norm for reasons you don't need to know but that's just often what it's called and a norm is just a distance function in analytical geometry, that's all it means and so it's a way of constructing a distance saying how far apart things are and it's a square distance function so what this induces is a half Gaussian shape decay of co-variance of distance so remember all the Gaussian is to exponentiate a negative parabola so if you've got something and you square it and you put a negative in front of it and you exponentiate it, you get a bell curve that's all the Gaussian distribution is and there are deep maximum entry reasons for that to have to do with parabolas and the definition of a variance people who are interested in that but that is, you don't need to know that so on this plot what I've showed you is the solid curve is just an example of what this square distance function would give you and here I've set 8 of the 1 and I forget what I said row 2 in the example probably just 1 so what I want you to see is compared to a linear case the linear function would just be you get rid of the square distance and you just put in the absolute distance and in the square case initially units that are really close to one another there's a kind of a it doesn't decay very rapidly at first there's a kind of plateau in the middle of the Gaussian distribution then the decay of covariance with distances fastest at these intermediate distances not at the closest whereas with the linear one covariance decays fastest immediately out of the gates so there's no kind of like close field of interaction where they're bound together well I use one over the other well they do have different properties in particular the Gaussian one's easier to fit I think that's actually the reason it's used so much in terms of general social science it makes sense to say that really close things there's a field of really close things that share common exposures there's a region of that we're trying to estimate that and the Gaussian function will be way better at it you can think about those you have a physical science background this is kind of like the inverse square law it's not the same relation but it's this thing about if you're flinging darts out in a high dimensional space the decay of things hitting different cells decays as the inverse square of the distance some of you have physics background so I know there's one person you're like my business in the ring so in the inverse square law it's like all the electromagnetic spectrum refers to that so does gravity and it's because there are particles lying out in decaying and really spread with distance and that induces this inverse square relation these things don't work exactly that way of course and we don't have the basic physics of these things but it makes sense to say that with distance the decay is not linear because they're scattering so the Gaussian thing has more scattering in a sense than the pure exponential one does so that makes some sense so to say that things are strong when they're close and then it decays very rapidly with distance where the inverse square doesn't look like either of those actually maybe it's a bad example okay back to the function last little bit on the end this is called the digger term digger means like wiggle or play it's like a little mechanical term there's something loose in your motor and it's shaking, it's jiggling and so this is when distance is zero this whole thing in the exponent evaluates to one because e to the zero is one anything to the zero is one that's nothing to do with e and that means you get a to squared and then delta ij is let's just call it an indicator function and when i and j are equal to one another it evaluates to one when they're different it evaluates to zero so all that does is it turns sigma squared on and off so sigma squared is the additional variance of two units that have exact or piled on one top of one another the purpose of this is I mean actually it's when it's the same society if you had multiple observations for the same societies in the data set this would allow those to vary from one another by some excess amount this is the diagonal variance term in the variance covariance matrix and this data set this will never happen because each row is a separate island and we only have one sample from each island when would that not be true say we actually had a time series for each island like from the archaeological data we could say from 100 BC you could look at it with this complex and then at 1000 BC it was this complex then you have multiple observations from the same islands obviously they're zero distance from one another but you expect them to vary a certain amount so this parameter sigma squared is the variance in each unit that's exactly the same does that make sense it's going to end up not mattering in this model because we don't have more than one observation from each unit in the data so let's put it all together this is what the model looks like I don't think there are any surprises once you get below the definition of the covariance matrix there's just a bunch of fixed non-adaptive priors but the Gaussian process prior in there this is going to induce cooling absolutely it will and all the other things they're not in balance in sampling in this data but if you did handle that as well it's the continuous category extension of the varying effects approach we haven't gotten to how slopes get in here yet I mentioned that in the notes how you can extend this actually I think I get to it at the end of today so let's carry forward with this and see how to do it in code this is a case where doing this in map to stand hides some of the mechanics but it makes it a lot more convenient for you I just have this template called gpl2 that's the Gaussian process l2 norm where you give it a distance matrix and notice I've fixed sigma squared because we don't want to estimate it because it never matters for the likelihood because there are never two observations from the same island but all that gpl2 thing does is it defines the covariance matrix it handles the Kij line for you, constructs it and then it's really just a multivariate normal just like the ones you can use whatever you could be any kind of distance you want abstract distance whatever you think whatever your theory tells you is the right kind of distance to use this is routinely used in all kinds of situations where it's just any old predictor and you compute the square distance between the value of that predictor for any two unions in the data people do that all the time rather thoughtlessly I think but if you have a good theory about it you could probably do better especially in this data set these aren't euclidean distances because there are great circles on the globe but but these can't be the right distances for social interaction because there are trade winds and currents and other things that induce stronger connections among islands Hawaii is even more distant from the other societies in this data set than this as the as the Boeing flies distance indicates because none of the trade winds or currents will get you there from the rest of Polynesia Hawaii is just out and nowhere and there's nothing else in the area that can be discovered people get to Hawaii really really like compared to the rest of Polynesia the anthropologists over here know this this is the famous thing that we all said how rats got to Hawaii and stuff like that which is not the best way to track it it's like rats and pigs and stuff like that so what was I going to say yeah so I encourage you to demystify this a little bit and make sure you understand what's going on after you compile this model because every thinking package is called the stan code stan code and give it the fit model and it shows you the code that is actually being used to define the Markov chain and what you'll see in there is there's a loop that defines decay entries it really just does it just goes over iterates over all of the lower diagonal the lower triangle of the matrix and defines them all and it's just that mindless it's exactly what it does it just pulls out that function and so when you want some other function you can start with that stan code template and put anything in there you want that's really all there is to it yeah question so this process that you're putting in there is taking place in the varying intercepts that we're talking about yes is what your reason to have you could absolutely so you had other clusters other kinds of clustering going on like language groups or something like that you could still cluster on the other things as well because then it's like the cross-classified model right there's an offset due to spatial proximity offsets due to other things that you may want to cluster on that are more traditional discrete unordered categories absolutely you could do them all at the same time in a practical sense you might reach a threshold where it just explodes and cranks to a halt and you drop out of school but in general you can do it and people do it this lives well with all the other stuff it's like an erector set you can plug all these little modules into your fancy little drawer and make it do things they might overheat at some point their hand yes so basically the 0.01 doesn't do anything it never affects the likelihood of anything why do you have to include it? because it's part of the definition of the function the function allows you to have a parameter there when you fix it to a value then it goes into definition but it never affects the likelihood because there are never two observations so if you don't put it in there if you leave it off you'll get an error because that function expects four arguments does that answer the question? basically it's there because this is a template that is trying to make things easier and so it's not ideal for all purposes this is very common by the way that people define this process this way and then fix the jigger part to some constant it's extremely common the jigger part doesn't matter much especially if you've got another variance parameter somewhere in the model because that'll soak it all up so you have to be careful about identifiability too bit this thing this fits great and I say something in the book about like effective number of parameters are small or even we've only added two parameters the model is 10 by 10 covariance matrix so it's not parametrically very rich the coefficients, just to think about it for a moment, these g's are the offsets there's the continuous category the Gaussian process equivalent of varying intercepts for each society, each of the 10 island societies and it's hard to interpret these though because they're on the log scale they're on the log count scale I don't know about you, but I have a hard time thinking on log count scales it's a magnitude of a count so bigger is more and smaller is less but depending on the value of the others the scaling could be a big deal or not so much because it's exponentiated once you get back to the count scale so this is another case where you want to push predictions out and see what the model says on the outcome scale instead and in particular it's really difficult to interpret eta and row directly notice I fit them on the squared scale because why not and obviously this is some covariance, but on the log scale because it's the covariance of a logged offset and again, it's pretty hard to interpret that and this is a number and is that bigger, it's small well it depends upon the scale you measure distance on so this is all interpreting coefficients is is hard to do tables just don't tell the story, especially in a model like this where now eta and row only have meaning in combination because together any pair of values for eta and row define a function that defines a covariance matrix defines that covariation with distance so that's what we care about before we look at predictions on the outcome scale let's look at what the model says about the decay of covariance between pairs of societies as they get further apart from one another and now we have a posterior distribution of covariance functions there are functions in your posterior distribution there's a distribution of them and forever unique combination of eta and row there's a unique covariance function well I shouldn't say it's unique there probably are a bunch of different values of eta and row that will give you the same function I haven't thought about that it's a good algebra homework problem for a different course they might be unique but anyway so any combination of eta and row imply a covariance function k and we can plot that out so we can take samples from the posterior distribution and plot the covariance functions and that's what I've done on the right hand side of this slide the code to do this is in the book draw samples from the posterior you know the definition of k it's at the top of this slide you plug in eta squared and row squared we can ignore the digger part and then you just dropped and the curve function in R is efficient to do this what I'm showing you here is the thick one is the posterior median let's just give you some idea of the center of gravity in the posterior distribution here and then there's a whole range there are a lot of posterior functions which have less covariation at close distances and then there are a lot that have more and there's a really long tail of really extraordinary covariances over distance but they're unlikely they're pretty sparse in distribution does this make sense so again it's like well what is this saying well you can get a little bit out of it so for example by the time you get to about 4,000 kilometers the vast majority of the posterior distribution decays to nearly zero covariance and by the time you're at 6 there are only a tiny wisp of like 1% of the covariance functions which have any covariance beyond say 6,000 kilometers and why is it Hawaii is writing that but there are many pairs of ions that are closer to one another which do have some covariance but how has that became a distance and notice that at the median there's a lot of co-variations between islands 1,000 kilometers from one another there's a lot of co-variations it's on the log scale exactly this is controlled for a log population exactly as an exercise for the student I recommend taking log population out of the model and running it again seeing what the co-variance function will look like what we should expect is that it goes up at all distances right because there's big islands are next to one another in the data sets and so once you put log population in there you take out some of the variants that would be explained by this relationship so this is kind of what's left over what can't be accounted for by log population you guys with me? yeah, alright okay so next step in crawling forward to understand this model let's take the median the posterior median co-variance function and see what correlations it says between islands let's look at the correlations because then this is scale free now we're just thinking about however much variation there is in tool complexity in the data set let's get rid of that part from the data and just think about in a scale free sense how correlated the different islands are with one another as an expectation of their distance and that's what this calculation is again the code to do this is in the book it's very straightforward and I apologize for the blue line but this is my fish a little bit it looks like right so for example up here there's this triad of islands Malikula, Tikopia, and Santa Cruz which are close to one another they're all less than a thousand kilometers from another and a little triangle in the south Pacific that I'll show you on a map in a moment and they're pretty highly correlated even after accounting for population size differences right these are small islands but they do vary in their population size and in their toolkits there are some islands like Yasawa here which is not very correlated with any of those because it's far from them and not so much with any of the others very much either you get a little bit of correlation with some of the other islands these are as close as neighbors Hawaii is really far from everybody so it's zeros every place special Hawaii and then a big field of things in between Fiji correlated with Malikula, Tikopia and very highly over here with Tonga over here and Tonga and Fiji are actually close to one another for big rising chiefdoms they definitely interacted next to one another there was a lot of water but you know they had great boats so all depends for me it's a terrifying amount of water for them maybe not so much this helps you a little bit you get a little bit closer to the idea of what it's saying saying that at close distances there's a lot of correlation expected that's consistent with geographic proximity that's unexplained by the other things we have in the data set they're more similar than expected by chance and that similarity goes along with proximity in a strong way so that gives you a better way to understand it now we do plotting on the outcome scale and I'm not going to show you the code to do these but the code is in the book there's no sorcery involved here but let me explain what I've done I've taken that correlation matrix that I computed on the previous slide and now I've plotted up the islands by longitude and latitude which is in the data set conveniently for you and I've scaled the sizes of the points by the log population so you can see why it's bigger why it's a lot bigger than the other places by the way so if you do it by linear it's like the whole map is white it's just like one big white it's so much bigger than all the others but and then the lines connecting them are the darkness is proportional to the squared correlation just because it makes it easier to see what's going on and so here's our little triangle in the south pacific very highly correlated all of those are greater than 80% correlated to one another so the dark lines Fiji and Tonga about 75% correlated if I remember right, something like that in the data set and Fiji is about 0.5 correlated with Balikulantikopia that's kind of what it looks like there so you can see how it lays out the trobrians and monos is the other strong correlation in the data set I have a hunch about what that is I think it's absence of toolstone those are miserable places with no good stone and so it's hard to make tools and they have small tool sets as I remember the data set does this make sense? this kind of gives you an idea of what's being said the problem with this representation is you can't see tools and that's the thing that's being predicted the correlations are about tool complexity but you can't see it on the map so let's look at it to compliment that, let's look at this the traditional way we looked at it before the correlation of the horizontal total tools on the vertical the lines mean the same thing but now the islands are located at their combinations of log population and total tools and then the dash trends are the counterfactual expectations for some new unmeasured societies just using the mean so you're running out the offset thinking about the average society what's the relationship between log population and tool complexity and then the central dash line and then I think 95% intervals of the mean by those so there's a lot of variation in this and then what you see is you get these things where this triangle of Balei Kula, Santa Cruz and Tikopia they're below the expectation what the correlation is done is drag them all down like gravity into simpler tools this doesn't tell us what the cause is but it suggests that there's some common relation among them by chance given their population sizes their tool sets are simpler than their population size says it should be given what we learned about population size from all the islands together does that make sense? so that's their offsets Fiji is an interesting case because it's got gravity from you know the the three dumb brothers here who have simple tools it's being pulled down by them and it's being pulled up by Tonga it's a much more complicated tool set for the population size than you'd expect so it's pulled by both it's correlated with both of them and what the model is saying is it has an offset that basically sets it on the expectation but if if Tonga were gone what the model is saying is Fiji would be below the mean to be pulled down by those others and if in particular if Santa Cruz were gone then Fiji would be above the mean because it's close to Tonga so if it's a contact effect or whether it's a geographic effect something about bronchurials or something we don't know the data doesn't say but that's how the model sees this these varying intercept offsets can deal with it's the whole matrix at once and you've got these diffuse influences pouring out in every direction from all of the pairs in the data set it's just describing those correlations it's up to you to figure out what they mean if anything does that make some sense? there's an appendix with a ton of ecological covariance that they looked at and mainly that was kind of a wash if I remember right they didn't do a Gaussian process regression this is my hobby to add it to the well I already added the data set for the Poisson chapter and then I needed the Gaussian process example and I thought I'd do that I should probably ask Michelle if she wants to publish this that would be useful I guess I'm about to publish the book and it will be in there anyway so what would be awesome is if the ecological covariance wash all this out and you could remove all the spatial contamination but it looks like it's spatial contamination if you're by knowing the actual ecology of the islands that would be cool but I don't know there's only 10 cases so at some point there's a limit about what we can do anyway other questions about this? no? let me give you the mind expanding summary of this before we transition to the next topic Gaussian process regression has a bunch of different applications they often look completely different from one another but under the hood it's the same idea you're modeling the covariance in a high dimensional distribution and you're essentially the model predicts all of the things at once in this case it was all the parameters you're predicting all those gamma parameters at once they have a common prior and the action where your model goes is into the covariance matrix and not into the mean of what we've done not completely different varying effects have an element of this where we start modeling variation and how it's structured hierarchically but this really takes it to another level takes it to 11th as we say in my house people know that I don't know you don't know that shame on you so there's a big literature where people study seasonality particularly in the social sciences functions in the covariance matrix definition like sines and cosines are the easiest so think about as time is a distance you don't expect individuals the further away from another they get the further two measurements get away from another in time you don't expect them to get continuously less similar because there are seasonal effects eventually it's spring again and then the phenology of the plant returns this is also true of all kinds of things with people with quick fall season for some reason so kids are conceived during quick fall season and in North America at least this is a mystery that it's descriptively true so there's a seasonal effect and so cyclical functions in the covariance matrix are extremely useful for predicting birthdays so there are these recurrent effects of the calendar so people don't give birth on weekends for some reason or Christmas babies don't tend to be born on these days or you're controlled but apparently if the doctor won't see you you will keep the baby inside basically I think what happens and lots of babies are born on Valentine's Day it's a famous effect so even prior to these actions is often true so you can usefully model all kinds of seasonal effects cyclical periodic effects over distance with using sines and cosines inside your definition of K there are tons of examples of that out there in literature it's very useful to do I already mentioned phylogenetic or patristic distance there's a big literature on this they don't usually mention that those models are Gaussian process models but they are you start the phylogenetic tree is used to do is to define a set of patristic distances which is the full loop link from one species to another in the tree is its patristic distance you just count up the branch length it's going up and down then you make a matrix of those distances and then you define a model of how being typically differences between species or genetic differences accumulate over time given the patristic distance and the clocks come into play and models of selection whichever those you adopt then there's a way there's a function for taking those patristic distances and defining K for each pair of species and that's how all those models work in the mainstream work at least there are some other kinds of model types but nearly all of them are special cases of a Gaussian process as well so you can unify them under this social networks again the network distance is the thing you compute and give different theories about how things move through networks so like the total path links that connect things to networks so social networks is just metaphorical distance but in physical networks you can actually measure this with flow diagrams how long it takes for chemicals to diffuse through plumbing you can do these things this way but for information sharing and social networks this also works and it's used quite a lot again the covariance matrix may not be exactly the same sometimes people use the L2 norm one but sometimes they don't and then there's a very broad use where people just use Gaussian processes to do splines so non-parametric form of non-parametric regression is extremely useful but it's still the same kind of model what they're doing is inside the traditional linear model of regression there's nothing but a mean and sometimes not even that it's just all the mean is zero what they do is they use terms like the one I've got here at the bottom the log PI minus log PJ squared for every covariate in the data between any two cases I and J you just construct that distance and it all goes into the covariance model and it turns out you get splines estimated for the data from this from the distance because close ones and their values and their regressor on the x-axis are more similar and their values and it's a very robust and powerful Bayesian splines used a ton of machine learning very useful and you can add a bunch of predictors in so this leads us to the last thing often we have more than one covariate well it's okay just stick them all in the distance function now obviously theory matters but here's the most common approach in the L2 norm approach inside here we have a bunch of terms where there are different rows there's a row for D which is the relevance of geographic distance in this case and then we could take log population out of the typical linear model and we could insert it in here and then it's saying that populations with islands with similar populations are expected to be similar you'll get very similar predictions but they will use cooling so this is the continuous category equivalent of variance slopes is the way to do it so you get two relevance parameters it estimates them separately these things should be quite complex they can have periodic components lots of fun stuff with that goes on stunned? okay my goal here is to expose you to this and give you an idea when you see what's going on there's nothing mystical about Gaussian process regression and I think it's going to become increasingly popular now that it's very easy to do on desktops or health phones for that matter so you're going to see it a lot more and I want you to be able to ride that wave rather than be intimidated by it is the idea okay, any questions before I transition to the next part of this? alright for the rest of today's lecture I'd like to start the last major unit of the course and we'll get through about half of this today I think and do the second half on Thursday and that'll leave us time to do debriefing the end and explain your final exam and stuff like that so what we're going to transition to now is talking about the last chapter the last working chapter chapter 14 where we expand the horizon of what we can do by reconsidering what measurements are and how we can get error in them into it and also deal with the fact that sometimes we don't have complete cases we don't have the same amounts of data for all cases but we'd like to use all the data and in traditional approaches we don't have a way to do that the nice thing about the Bayesian approach is that it gives you an easy way to do that before I go on to flatter the Bayesian approach more though I should say the awful thing about the Bayesian approach is that you don't get the extra power so I was explaining to someone in my office hours recently that there's like this sweet spot in Bayesian inference where data sets are kind of just right but like tiny data sets it basically doesn't matter what you do because you can't learn anything anyway so I don't care what model you use so I'll just throw darts of paper and say something with tiny data sets but it doesn't much matter so you don't get the extra power or another way you can say about it with tiny data sets the prior is what matters most statistical paradigm is free of that even if they don't require you to define priors yeah it's still lurking there right so in the intermediate of data set size Bayesian inference really comes into has huge advantages because computationally we can still do it and you can get full Bayesian inference you can get across your distributions and it's easy to incorporate some of the more uncomfortable issues for classical statistics like the ones we'll deal with today like missing data and measurement error with really giant data sets millions of rows computationally it's too expensive to do Bayesian inference you'll have to finish your PhD and you can't wait for it to finish sampling all that data and so you're going to have to use some approximation or some other technique so I think as I said in the book I think because of that approximations to or alternatives to Bayesian inference are always going to be necessary can you use like a cluster computer or a supercomputer to run some of those? maybe the question was how do you use a or something to run these things it's hard to parallelize Markov chain computations but that's a I'm hesitating I'd say right now it's hard people are working really hard on that so now that statisticians are done fighting over whether it's okay to be Bayesian they spent like all that century doing now they're knuckling down to working on computation and there's been a huge amount of advance since the 90s on how to sample from these just likelihood functions that we need so people are working hard on how to parallelize sampling from these distributions so maybe in five years the answer will be different right now basically so if it was a really really super fast computer you could run that one chain really fast maybe that'd be okay but usually what we get in cluster computing is just distributed processing and you get almost it's very hard to gain efficiency in this business from that unfortunately right now so now I think still we're in that sweet spot area you're in your data sets and you're in the sweet spot right so it's okay but if you've got a 10 million row consumer database I would say don't be Bayesian use something else boosted regression trees or something you need regularizations that doesn't change machine learning what I love about the culture of machine learning is it's really into regularization and that's what I like about the Bayesian approach too but anyway so world according to me alright so let me introduce pancakes these are pancakes they are trying to see that my spouse is like those are not pancakes well it's called the pancakes okay so it's the best I could do so these are three pancakes that I made let's say and when I started cooking them the skillet was too hot so my first pancake is burnt on both sides that's what the lines are not my boot print the second pancake is only burned on one side right so it's good on the downside and the third pancake is edible right it's just right so there are these three pancakes you come over to my house say I've made these three pancakes and I serve you one at random because I'm an asshole all you know you've gotten one of these three pancakes and it's on your plate and you can only see the top side and the top side is burnt and I want you before you turn the pancake over you know on the basis of probability theory this will be your final exam what's the probability the downside of this pancake is burnt Jewish those of you who know the answer Jewish I want people to think about this for a second this is a classic probability paradox like really classic like back when French gamblers were figuring out probability theory classic and Joseph Bertrand is the guy who did it but just think about it for a second what you think the probability is nearly everyone let's say 80% of people who take probability theory courses get this wrong it's a common franchise most people say one half so if you were thinking one half you're a normal human being if you weren't thinking one half you might still be wrong hang on but I said one half when I first saw this so no shame what I want to show you is what I've learned to do over the years of probability theory is never try to use my intuition absolutely terrible my brain was evolved to like hunt gazelle on a savannah or something it wasn't evolved to do probability theory and so like most people my intuition is bad and the nice thing I think about the amazing approaches to the inferences is just to apply probability theory take all uncertainty and define it as a distribution you don't have to be clever you just have to state the information you have and then let probability theory discover the implications and I like that because I'm not clever if you're infinitely clever then you can do very well in some other paradox but if you just like me you don't feel particularly clever about these things the Bayesian approach has some real advantages so let me walk you through and click what I mean about not being clever in this case I mean it's just applying the axon to probability don't even stop and try to use your intuition just resist what's the definition of conditional probability conditional probability is this thing that lets us take what we already know and condition what we want to know upon that's what conditional probability means and there are rules to do this and they're easy and when you run your Markov chain that's what it's doing it's conditioning the model on the data the data is what you know you'd like to know the parameters and that's what posterior distribution gives you this case the pancake case is a little bit similar let me run you through it real quick so what you here's the definition of conditional probability in this case we want to know whether the downside is burnt we know the upside is burnt so the definition of conditional probability it is the joint probability that both sides are burnt in other words the probability that it's the first pancake right the first pancake is burnt on both sides divided by the probability of any side being dealt to me any top side being burnt so now we just have to figure out these components and we can plug them in just remember this rule and then you brute force it I call this being ruthless the top part is easy enough we'll get to that in a second the bottom part is the only tricky part you just remember this is an unconditional probability so that means we averaged across all of the things that could be conditional upon so there are three possible pancakes each one has its own unique probability of having the upside be burnt if it's the first pancake the burnt pancake that's what probability DB means then it's guaranteed that the upside is burnt that's either side the second pancake that I made there's a half chance that the top side is burnt there's a half chance that it wouldn't be and then with the last pancake if I dealt you that one which obviously didn't happen there's no chance that you would get a burnt side on so this is all we need to calculate the probability that the burnt is up and it is a half because probably each pancake is a third with me I know this is fun and then also the probability of getting the burnt pancake is a third so the numerator is a third the denominator is a half and the answer is two thirds so if anybody got that congratulations you're awesome and there are a whole bunch of probability paradoxes like this which are only seen paradoxical because we use our intuitions to solve them and modeling is often like this in the sense that the output of a model can seem paradoxical because it violates our intuition but if you leave the information you put into the model then the only thing wrong here is your intuition right now sometimes your intuition is good and you do have to check these things so I'm not saying the model always wins but what I value about the pure probability is that I don't have to be clever none of us do we just have to know the information and state it and then all the implications are discovered by the calculus of probability it's the logic of science so I want to show you some examples that are useless in this sense where we discover constituted implications of things we know about data that we very hard to figure out unless you were infinitely clever and there are infinitely clever statisticians out there that have found non Bayesian solutions which are basically the same as this but they were really clever the Bayesian approach is unlamorous in the sense that you just define the information and then it's the model that's clever and so it's not nearly as lamorous we're not needing to be clever and I want to show you two really useful practical examples that all of us have all the time in our data sense these are measurement error and missing data we'd like to incorporate uncertainty in measurements into our data usually we just ignore it yeah that wasn't very precisely I'm just going to take the mean we all do this, I've done it, it's bad we should use all the information and error is information we should figure out what it means the second is missing data we'd like to use all the data but what most of us do is throw away any case that has even one dising value because that's all classical methods can do and that's data deletion the complete case treatment and that loses a lot of power in the average case under benign conditions where the missingness is random you just lose power it doesn't bias estimates but you lose power probably in more realistic cases so we'll look at both of these today we'll get started on measurement error and we'll do missing data first thing on Thursday well we'll probably finish measurement error on Thursday and then do missing data so first measurement nearly always entails error the error can be reduced to be quite small in nice benign cases but often it can't be almost all regression models assume some kind of error process on the measurement so like in a classical linear regression that sigma the residual variance it's neutral about where it comes from it's not clear some part of it might be measurement error on the outcome or the predictors or it could be just stuff we've left out of the model that's actually causing things so remember randomness in this class is the property of information and none of the world right so it's not that the data is actually in error there's stuff that's made it that this is deterministic universe I hope and so something is made it there we don't know the cause so we call that random statistics but it's what's random is the information remember so there is error in a sense accounted for in all statistical models because none of them expect to exactly predict everything ideally right this is not Newtonian mechanics where you expect to be able to predict exactly where the cannonball lands this is a different business but often we have cases where the error isn't constant across observations we're going to work with a case of that today and also a case of error on predictors which we'll probably get to first thing on Thursday so let's start with the case of error on an outcome variable we're going to return to the divorce data set from way back in when was this chapter 5 like forever ago right you guys were naive back then now you're Bayesian rock stars and so remember the waffle house right the divorce rate is correlated okay so we're going to focus on the divorce and the divorce rate problem a thing about the data that we ignored at the time is that divorce rate is measured with error it's just a sample from each state it's not the total population so there's a column in the data set which is divorce standard error which is the expected standard deviation of the sampling distribution of the value that was given and what's important is that the error is not constant across states because small states have less certain estimates they have bigger standard errors than those data from them so tiny states like Wyoming have big standard errors on their divorce estimates big states like New York it's measured extremely precisely and I think actually New York has a reasonably low divorce rate for the longest time in New York it was illegal to get a divorce in New York I think this is why they all went to Vegas to get divorced but this is the big obstacle so when you look at the plots media and agent marriage is shown in the top against divorce rate for each state the vertical lines are the standard errors of each measurement you can see that there's a lot of heterogeneity in those errors and some of that heterogeneity is actually correlated with the divorce rate that's been estimated as well this is going to turn out to matter because some of these points should have more weight in the regression than others because there's less certainty of value and there are a bunch of ad hoc non-basic procedures which just do that that just construct weights based upon precision of the estimates and those can work unreasonably well but you have to be clever in order to do the derivations we're going to be un-clever we're just going to state the error and we're going to let the model figure it out and we'll reveal some interesting implications and the bottom one is just showing you that the standard deviation is related to log population the magnitude of the population that matters in each state so big states California is all the way over there on the right it's the biggest one we have a pretty low divorce rate okay so because the age of marriage is high basically approach, we're going to treat the true divorce rate as unknown which means it's a parameter everything you don't know in the Bayesian model is a parameter and 50 parameters one for each outcome that we'd like to know we're going to estimate a posterior distribution for it and we're going to predict that posterior distribution with a Gaussian likelihood the model just like before so here's the assumption the way we get the measurement error information into the model the observed rate is understood to be sampled from a Gaussian distribution for which we do not know the mean the mean is the true value so what we've observed is a sample sampling process with error if the error is reported as a standard error that means it's a standard deviation of a sampling process but all we know about is we don't know it's Gaussian but we've only got two moments so the maximum entropy can do is call it Gaussian we've got no other information assuming any other distribution is illogical here it implies you know other information and we don't it implies there's a sampling process which has produced this observation it has some unknown mean which is the true value and then there's some standard error which we do know so in this there's data on the left there's a parameter in the middle and then there's data in the standard error and this states the information we have to estimate this thing that we don't know in the model and in the context of the full model it looks like two regressions in the same model estimated simultaneously at the top now the estimates the true divorce rates in each state and that's the same regression we had before there's an impact of agent marriage and an impact of the marriage rate we're interested in how each of those contributes partially to predicting divorce rate you remember that story from back to chapter 5 then there's this second thing which also looks like a likelihood where the observed value is predicted by a unique mean for each one that we don't know but a known standard error and so we get a posterior distribution for the estimates of the things that are being predicted by the regression at the top it's a little bit meta I know but this is exactly like everything you've done before it's just now you have to appreciate that data in the Bayesian paradigm is a special case of a probability distribution where all the masses piled on a single value you're sure that a random variable has a unique value it's been determined then that's a delta function you get this spike of probability mass of a unique value and that's all the things we call data before in this course or like that the measurement error is spreading the uncertainty out over multiple values you're not sure exactly where it is but you know it's in a range of causabilities and that's what the standard errors let us define then you don't looking at this I don't know about you but I don't know what the implications are but logic will figure it out just like your computer figured it out no need to be clever and oh this is the stuff I had some animation divorce rate estimate appears twice in the model that's okay doesn't break anything it has implications in both cases and because of that information is going to flow between both of those models that are in there both of those Gaussian distributions and you can think about these are two different likelihood functions there's a likelihood for each estimate there's a likelihood for each observation and the rest is just logic as they say all this does is state what we know and what we want to know about it and fitting this in map to stand is just a matter of typing it it looks exactly the same and this gets passed off to stand exactly as it looks and this fits fine the only thing to notice is I turned off WAIC calculations because my WAIC function is not set up to recognize that the outcome is a distribution so it won't probably integrate over it you can compute WAIC in this case but there's no loop involved to integrate over the posterior uncertainty in this thing you're predicting right so if that didn't make sense don't worry about it like oh my god is this on the test no but that's why I've turned off WAIC just so it doesn't throw an error and compel you to tears or something so was that a hand? yeah so like which part means you have to turn off WAIC? the divest being it has a likelihood right so WAIC is done over predictions it's on the prediction scale here the thing you're trying to predict is your estimate and that's uncertain so you have to integrate over the uncertainty in the data how does it when you do I'll answer that question after class it's just because I'm sorry it's a great question if you do like sim or link it guesses what you're trying to predict it knows the model yeah it knows the model so yeah it gets the likelihood well it's not the first thing you put it's the thing that looks like a likelihood is what it pulls out it's what it does yeah it's using a very error prone artificial intelligence to figure it out but it knows the model because you typed it once but it won't work for all model types so this is a case where it won't I suppose if I toiled long enough I could make it work that's what it needs to do I suppose so alright let me show you the consequence of this and so on the left what I'm showing you is the relationship between median age of marriage and divorce rate in the raw data so the open circles in the left hand plot are the observed estimates what we called divorce rate before and then the bars are their error bars against median age of marriage against a strong relationship on the right the points now are the posterior distributions of divorce rate and I think these are standard deviations of the posterior distribution as well so that they're on the same scale what I want you to see is they've all shrunk and one of the reasons is because you've got states like this one in the lower right of the left hand plot which are extreme values they have pretty extreme divorce rates and that's true in many cases you get these a number of points that are farthest out are the ones in the most uncertain estimates and so pooling yes our friend pooling pulls them towards the regression line there's a consequence in fact that they have less weight so information flows out of the estimates that we have into the regression model but also out of the coefficients the line induces gravity and shrinkage on the estimates of divorce rate and shrinks them towards the regression line it's the same shrinkage phenomenon in general that we observed preparing interstates but now in a context another context and again you don't have to plan for shrinkage you just get it for free as an implication of the logic of the model so what happens in this case because of where the errors are where the biggest errors are located is that the blue regression trend here of the one that has a less steep slope is the one that we get from the errors on the posterior divorce rates so it has moderated the posterior estimate of the association between median age of marriage and divorce rate you can think that the naive estimate that didn't account for measurement error got too excited by the fact that there's outlying states but they had highly uncertain estimates and we ignored that we got misled we were overconfident about what happened but I should say measurement error won't always result in a moderation effect it depends upon what the reason errors are and in this case it results in a moderation of the association but that won't always be true it's upon the exact data and what's going on does this make some sense? yeah this is pretty deep I know there's all this error on outcome let me try to summarize this for you real quick so yeah this is what I just said it could have increased the association could have increased but all depends upon where the small states are and what their median ages of marriage are well I won't go into the details I've attempted to talk about Rhode Island for a second but things like that Wyoming so but divorce rate estimates do move from the observed values and the reason is shrinkage and pooling so small states are highly uncertain rates they have low influence on their regression the large states transfer information to them they inform them how do they do that through the regression line because the regression line is what the model knows about the relationship between the predictor and the outcome so it's trying to get a posterior distribution of the divorce rate in each state which has been informed improved by what you've learned about the relationship between divorce rate and median ages of marriage from all the states together again you have to be infinitely clever to figure all this out right on your own in this case you just define what you know about the error and it figures it out truly logically I value that hugely in this so information also flows the other way right the regression line is obviously informed by what we know about the states and the states with uncertain posterior distributions why posterior distributions inform it less and all that happens simultaneously in the joint probability distribution that is the model okay this seems like a good place to stop questions no having fun this is like the bonus round for the course here because you're not going to do a homework on this let me put this slide up just to say this is where we're going to pick up on Thursday we're going to stick with this example and we're going to add measurement error on yet another variable in the dataset we'll have two variables of measurement error we'll keep trying alright thank you guys