 Welcome to lecture 13 of statistical rethinking 2023. When I was a kid we were really into books. It was the early days of the internet and I guess, you know, books were pretty cool. So at the library we used to exchange these choose your own adventure books. They came in lots of exciting titles like mystery of the Maya Island of Time, Cup of Death. What was great about these books is that, well, they were distracting and entertaining, but they're kind of book where you get to choose branching paths like the Garden of Forking paths, the Garden of Forking data from the first half of the course. You reach a particular page and you're given a choice and you choose which page to turn to next and the story continues from there. And many, many stories are embedded in a single book through this branching path mechanism. So for this particular book, The Island of Time, this is its map and there are 12 possible endings and from a causal inference perspective, these books are relevant because you can explore counterfactuals, see what would have happened if you had done something else, the sort of thing we can never really do in real data analysis. Not all of these adventures are as simple as the previous one. The mystery of the Maya, for example, has many, many more possible endings, 39 in total, many other many branching decisions that lead to those and the possibility of returning to the start of various points for looping paths. A directed cyclic graph, if you will. Why am I talking about kids books? Well, in this course, I encourage you to follow this idea of drawing the Bayesian owl that there's this five-step plan to success where we start with a clearly defined theoretical estimate. Then we use that to sketch out some causal models, make them into real generative models that can produce synthetic data and then we use those two steps to build statistical models through the logic either of do calculus or by expressing the generative model directly as a statistical model. Then we need to test quality assurance is necessary. We use simulations from the generative model to validate the estimator and then only then are we ready to analyze the real data. The disservice with this outline, of course, is that you know just from the course material and the homework assignments that it's much more complicated than this. There are branching paths. It's not a straight line from one to five. There are lots of little subjective decisions that have to be made in between. It's much more like this, the mystery of the model. And there are many, many possible endings more than 39 for sure. But reassuringly, when you reach a bad ending and your model doesn't work, you can return to the start and fix it. And those are the tools that is my responsibility to teach you. And part of that is when drawing a real owl, for example, there are many technical skills involved in technical details from the choice of pencil to how you hold it. And everybody knows that in every art form or and the analogy in statistical modeling are these little things about estimators and coding them and making them run like the partial pooling thing that I showed you in the previous lecture last week. These are things that don't really appear in the generative model typically. But if we leave them out, we leave information on the table. We could do better if we pick up that information and incorporate it correctly. So I want to introduce you to your multi level adventures, the sorts of branching paths that will lead you to good estimation. And from this point on in the course, I think the foundations are in place. Once you've got the basics of partial pooling, you've got a really strong foundation and advanced statistical modeling. And from there, you can choose to invest as little or as much as you want in any particular sub area. There's no obligation to learn it all. That's the way I think of it is the this course up to the first multi level modeling lecture is a foundational thing that most behavioral scientists and biological scientists need. And after that, you choose to specialize. And so from here, I'd like to propose some some choices, some branching paths for all of you who've been following along in the lectures or taking the course. And the first, of course, is to return to the start. Start again from the first lecture, come back through, take notes on what you didn't understand the first time and really reinforce your foundation because that foundation is the most important thing. And after you've repeated that foundation, you'll feel much better because you'll understand it much better the second time through, give you a warm glow. And then you'll be in a much better position to decide what you want to do next. You could also, for the rest of this course, skim and index what remains. Don't put too much pressure on yourself, just try to learn the details. Just sit back, be entertained, see the kinds of applications that are possible, index them in your mind so that if you come across a research problem where you need something like that, you can come back to this material and choose to go in as deeply as you need to. Third, you could pick and choose. You could skip over whole examples. I will not be offended and engage only with those topics that interest you because from here on out it's going to get more specialized. This lecture and the next are not too specialized but they will get increasingly specialized in the week after and we'll look at topics like social networks and phylogenies and it's perfectly fine not to be interested in those topics. But if you are interested in those topics then wait for those and focus on those. You don't just you have no obligation to focus on the other pieces. Fourth option is what I call the Bayesian flow which is what I've been encouraging from the beginning. Try to learn enough in each part just to keep moving, just to hang on and that's fine. I think it's actually a bit foolish to try to understand everything on the first go but just enough to keep moving so that you're learning a little bit and then you can do it all again at some point or better yet, stop and engage with your own research problems when you feel ready and if you reach a stopping point, a block, a wall in your own research problems then you can come back here and find some help. So the distinction that I need to prepare you for so that you can branch out and choose your pass is this distinction between clusters and features in multi-level models and basically every kind of advanced statistical model or machine learning model is some kind of multi-level model so this distinction is very useful no matter what you end up doing later. So clusters are the groups, the subgroups in the data. Things like tanks in the tadpole example, stories in the Charlie example, individuals also in the Charlie example or in the tadpole example or departments in the Berkeley admissions data example. These are subsets in which there are multiple observations and then there are features and these are aspects of the model, things we'd like to estimate quite typically that may vary by cluster and the way we program these different things into the models is the thing you want to get clear and so I'm going to do another lecture today which basically repeats and gets a bit more involved in the material from the previous lecture so in a sense there's not going to be anything new and yet I hope it feels like it's a bit new because it's good to reinforce this basic distinction and understand what's going on and this will give me an opportunity to start introducing some of the branching pass and in the mechanism of really making this stuff work. So in the machinery when you choose to add more clusters so often in a particular problem there's not just one kind of cluster that you want to structure by not just tanks not just stories but also individuals and other things maybe not just regions of countries but countries themselves as well. When you add more clusters to the data I actually think this is the least complicated thing you need more index variables so it's just categorical variables like before and you just need to add population priors for each so this is not the hard problem at all it's just copying and pasting and renaming parameters for the most part. Adding features is where the subtlety lies and that's what we're going to talk about today when you add features you add parameters sometimes quite a lot of parameters hundreds or thousands or tens of thousands and what this means is there are more dimensions in each population prior because there are more aspects of each cluster which can vary and this leads to additional complexity and additional interpretation issues. So I'm going to slowly move through an example which I'll build over this lecture and the next to help you understand this engineering if you will and why it's really useful for research. One of the things that helps me not get too lost in the possible branching pass is to keep reminding myself what I'm trying to do right this is step one of drawing the L and varying effects for us are a way to try to estimate unmeasured confounds this is quite often what they're for confounds in the soft sense of competing causes or confounds in the hard sense of common causes that are unobserved. So the varying effect strategy to remind you is we have unmeasured features of the clusters we believe they exist or they or we're worried that they exist and these leave some imprint on the data but we have multiple observations from each cluster and this gives us the possibility to actually estimate things we have not measured and the reason we use partial pooling is because it gives us better estimates of those things because it borrows strengths from across the clusters from a predictive perspective this is important because it gives us better estimates and makes us more accurate in our predictions it gives us regularization from a causal perspective there are inferential threats the things we have not measured are the most terrifying things but repeat observations within clusters give us some hope you've already seen some examples of this if you think back to the introductory causal inference lectures this example of grandparents education of grandparents influence on the education of their own children and their grandchildren and I introduced the idea of this haunting that neighborhoods induce shared exposures between parents and their kids and this is a thing that stops us from doing a mediation analysis on this on on measuring the direct effect of grandparents on their children but if we have families in the same neighborhoods then we have repeat observations on those neighborhoods and then we can use estimation varying effects estimates to estimate those unmeasured features of neighborhoods potentially in the trolley problem example I ended that example by talking about the fact that each individual had responded to I think 30 different trolley problems and individuals vary a lot in how they make use of the subjective scale and that adds a lot of noise to the responses this is a competing cause if we could estimate those individual features let's just call it personality that affects how reactive they are to the scale that would help us get better estimates of the treatment effects I think it's also quite likely in that particular experiment that these individual features also affect participation and therefore those individuals unmeasured individual features may be a confound and so it's even worse than that yeah that they're sampling bias through those personality features a political science example so I had a colleague back in California who was really interested in why some countries go to war and some governmental forms go to war there's this kind of folk saying that democracies never go to war against one another and so you can imagine in a schematic you've got this idea that you have some time series and there are periods in which different nations are at war with one another and they have different governmental forms G1 and G2 at that time and you've measured some other stuff about them like their economies and so on those are the X1 and X2 variables of the countries 1 and 2 and the problem in inference in this which makes this a sort of never-ending debate is that there are lots of potentially unmeasured things about nations that can influence all of these variables and therefore are confounds things like their geography their natural resources their cultural history and so on and these are things that we might be able to estimate with repeat observations so the point of all these examples is to say we're interested in varying effects both from a predictive perspective because they regularize and from a causal inference perspective because it's a chance for us to estimate unobserved unmeasured confounds there's an alternative approach called the fixed effect approach if you watch the one of the bonus rounds from last week I described it there but if you didn't very quickly fixed effects are varying effects with an infinite standard deviation and so they don't do any pooling at all in effect they use only the data from each individual cluster to estimate features of that cluster this leads them to overfit yes but they have some advantages in dealing with group level confounds as I described in that bonus round they have the disadvantage of not allowing you to study cluster level causes in the bonus round I argued that there are plenty of times when fixed effects are fine in this lecture I'm going to show you one of the reasons that in realistic research with finite sample sizes they're often quite impractical and the good news is they don't offer any unique benefits over or varying effects we can deal with group level confounds and varying effects as well as I described in that bonus round but in general with all these terms flying around like fixed effects and varying effects don't panic draw your assumptions make the generative model and focus on that to get your thinking straight first and worry about the estimator later because most of the big problems in research are about getting the story straight okay lots of practical difficulties and that's sort of what we're focusing on here varying effects are great as a default I often tell people that we shouldn't be making excuses to use varying effects we should be making excuses not to and sometimes there are good excuses not to and sometimes it's the practical difficulties for example how do we use more than one cluster at the same time and as I told you you duplicate but it's easier said than done in many cases you need examples if you're going to make progress there calculating predictions with varying effects gets trickier because now you have to talk about at what level you're making predictions are you making predictions for whole new clusters or are you making predictions for new elements inside of previously experienced clusters and those are really different kinds of predictions I'm going to say that again because I know this is weird so imagine we were making a prediction for a new tadpole in an existing context that we had varying effect estimates for that's a different kind of prediction task than making predictions for new unobserved tanks that tadpoles might appear in yeah we need to use different parameters from the models to make those different kinds of predictions and then there's just this issue of drawing the owl how do we get the chains to sample efficiently and as you make models more complicated this gets harder but don't worry I am here you don't have to go alone and I'll show you the tricks fourth group level confounding this is a real threat in models and it's often ignored in the varying effects literature it's something to really think about and as I said in the bonus round from last week I showed you one effective way to deal with this that's essentially equivalent to the fixed effect approach the example I want to stick with in this lecture and the next is the 1989 Bangladesh fertility survey so Bangladesh is a very densely populated country and in the 1980s a typical woman by the time she finished her reproduction would have had seven or eight kids I think that was the median and these days it's around two so there's been a radical demographic change in the last decades in Bangladesh as in many parts of the world much of Asia and there are many good reasons to study this change both descriptively and causally descriptively because countries need to understand what's happening to them so that they can plan appropriately and causally because as human scientists we want to understand why this happened so I'm going to give you relatively modest data set here that's in the rethinking package it's a data Bangladesh it's a 1,934 women from 61 districts in Bangladesh and Bangladesh is a highly variable place with cities and rural areas and the outcome we're going to be interested in is contraceptive use this was a survey in the 1980s late 1980s of contraceptive use when it was just being heavily pushed by the government we have some other variables we're going to think about too just to make this spicy the age of the woman that was interviewed how many living children she had at the time and whether her location within the district is urban or rural so lots of people request more advice about how to draw DAX this is extremely common so and my response is always well you need domain expertise and that's true but I think there are some heuristics as well that can help so let me let me try with this case to show you how I would go about trying to trying to draw DAG with just these variables of course there are other variables that matter but hang on we're going to get there let's let's not punish ourselves let's start simple so the variables we're going to nominate so far we've got contraceptive use that's our outcome of interest we've got the age of the woman how many living kids she has urbanity I'm calling it does she live in an urban area or not or in principle you could make that a continuous variable how urban is the space or distance from an urban center and then the district which is this variable which is just an index but remember we're interested in it because it's going to represent unmeasured things about districts that may affect all the women and all the families within them so first thing you can do is focus on the cause of interest what are the causes of interest why are you doing this research at all and in this case we're going to think about the idea that we would like to at minimum describe the association between age and contraceptive use and family size and contraceptive use how many kids the woman has and possibly even estimate causal effects of these things although as well as we'll see that's that's not so easily to do then there are competing causes other things that influence the outcome but or that we're not necessarily directly interested in but that we may want to stratify by so we can get better estimates or deal with confounding so naturally the district may influence contraceptive use because there may be many things about the economy of a district or its resources or its population density or its cultural history which may also influence the behavior of people in it and then urban spaces and rural spaces are quite different as well and then there are relationships among those causes so in this case age nothing influences age this is a great thing about the variable age nothing influences it because it's just a clock and unless you have a time machine you don't have an arrow into age but age can influence other things like how many kids you have longer you've been alive more kids you can possibly have and urban living may also influence kids because of the cost of them and the district you're in again could easily influence urban living because some districts don't have cities you could draw more arrows here I'm not making a strong argument for this particular diagram I'm trying to stimulate your imagination but I think these relationships are a plausible start so that we can do something useful and educational there are also unfortunate relationships you shouldn't just draw the stuff it's easy for you imagine imagine the haunting imagine the ghosts in the dark so for example here's a group level confound the unmeasured things about districts may also influence features of individuals in those districts not just the group level variables like whether there are cities so for example there may be things about the history of particular districts which influence family size could be the ethnic composition of that district for example and then imagine stuff you haven't measured that may also haunt you so in this case families can be quite large and remember I said in the in around 1980 a typical woman would have had seven or eight kids so they're going to be a bunch of sisters within a family and they may be similar in contraceptive use and their family sizes because of common socialization this is another kind of cluster variable but if we don't have it in the data set then it could be a confound okay let's get started don't panic we're going to start real simple and build up one step at a time and we're just going to think about building the tadpole level version of this data set and what I mean by that is we're just going to cluster by district and describe the variation in contraceptive use by district and this is already a lot to do and there's lots of advanced statistical machinery already in this so we don't need to push too much harder until we've got this done and often when you're doing a data analysis on a structured data set like this this is always the place to start get your varying effects structure in place get it to work right because there's often some tinkering to do as I'll show you and then worry about the causal effects second but whatever you do never try to go to the end point and put all of the variables in you think you need you've got to build it one step at a time because something's not going to work and if you've done it all at once you won't know what's not working one step at a time test and keep going so our estimate is very modest right now it's just contraceptive use in each district and we're going to use partial pooling and you'll see that's been necessary for the survey because the coverage is not very uniform and in some districts there just isn't a lot of data so we're going to estimate a varying intercept on each district and this really from my perspective it's just another chance to help you understand partial pooling because let's face it it's weird here's the model we want this has the same structure as the tadpole model from last week but still let's I want to remind you of the pieces and what they mean the first line this is the prior for the observed outcome variable right there's Bernoulli with some constant rate piece of i where piece of i is a function of the log odds parameter alpha which is the log odds of contraceptive use in each district and then we have our regularizing prior for districts this is a prior with parameters inside of it and that's the third line of this model these are the alpha j's and then alpha bar is the average district the log odds of contraceptive use in the average district and sigma is the standard deviation among districts here's the code this looks very much like that the code from last week however I want to call your attention to just one thing and we'll get to it in a moment so the top line is not new at all this is our Bernoulli outcome a zero one outcome for contraception because that's the way it was recorded then we have our link function logit for the probability and we have a is our log odds bracketed by district D and then here's the new bit I'm defining the vector A not by bracketing with D but by explicitly declaring its length at 61 there are 61 districts in this survey but not all of them were surveyed and so for some of them you don't have any data and so if you bracket by D in that case you're going to end up with an error and I wanted to save you from that problem but otherwise it's all the same you've got A bar and sigma inside the normal and then the priors for those two parameters those parameters are often called hyperparameters hyperparameters are parameters that determine other priors it's a weird thing I know and then sigma the exponential you run this model you will not encounter any difficulties and it samples extremely efficiently but you do get a terrifying number of parameters as you see I think there's 63 parameters in this model it's really not that much compared to things that we could do but it's not the kind of thing you just stare at the coefficient table and understand right as always we need to push out predictions and understand from the posterior predictions what the model thinks so the foreground that you have to appreciate the variation in the amount of sampling that was done in each district some of the districts there are as almost 120 women sampled like district one and in other districts there are very few like district three I think that's two women sampled in district three district 49 has three women for example and poor district 54 has none there are definitely women living in district 54 but none of them wanted to talk to a fertility researcher this variation is the sort of situation where partial pooling is a huge help because there's much more evidence for some districts and so we can be very confident about estimates there but there's much less than others so let's think about posterior predictions and I'm going to layer them on let's here's the raw data these black excuse me black circles are calculated just by taking the number of women who reported using contraception in each district the districts are arranged on the horizontal axis from one to 61 and dividing that by the number of women who responded at all in each district and then you get a proportion reporting and that's what these black circles are as you can see for example that in district three all of the women who responded which I think is two reported using contraception and then we put on the posterior means for each district these are the partial pooling estimates just the posterior means of for each end as you should expect by now just like in the tadpole example they're shrunk towards the mean that's the effect of pooling information across districts districts that don't have a lot of data get pooled more and then here are 89% posterior intervals to give you an idea that there's uncertainty about these we don't know the mean right there's still a lot of uncertainty in each of these cases and want to spend a little bit of time studying this graph so you understand what we call shrinkage last time why some of these estimates some of the red circles which are the posterior means are very far away from the black circles which are the empirical means and others are right on top yeah and this has to do with well you guessed it how much data there is in each district so let's take some interesting cases here and just label the sample sizes so they're starting on the left for district three only two women responded that's where that two is on the black circle and you'll see that the model is not fooled by this it is not at all confident that all women in district three use contraception right it would be a foolish model if you thought so yeah however if you used a fixed effects model that's what it would think and then on the on moving from left to right we've got some districts with really low reported use 10 and 11 and they all they have modest sample sizes as well and their estimates are shrunk towards the mean so the model does not think that almost no one uses contraception in those districts again for there's a low one has 14 observations they're in the middle and all the way far on the right you'll see the ones at the bottom district 49 only four women and and then one with six and one with 10 I've highlighted also a couple districts that have large sample sizes and have sort of pulled their district level estimates up above the mean and you'll see those labeled with the 35 sample size and the 45 sample size a bit more women in those districts use contraception than is typical in Bangladesh in 1989 but the model has been has agreed to be pulled up because there's more rate of evidence in those cases okay the point here is just to understand that all of this is logical right the model is just following the logic of probability theory and doing what's necessary and that's why the shrinkage happens and it's also why sometimes it doesn't interesting case is this district right over here and you'll notice there's no black dot in this column where I've marked no data because that's the district I think it was 49 where there's no data at all I'm not 49 must be 50 something 54 something like that where there's no data at all nevertheless there's a posterior distribution for it yeah remember the minimum sample size for Bayesian analysis is zero because you have a prior in this case this is an informed prior because it's been estimated from all the other districts and so that estimate the posterior mean at the red circle and the 89% interval for the district where we have no data is in essence a prediction that comes that has been educated by the data from all the other districts but it's not it's not that it thinks that this district must be like all the other districts it uses the variation as well because remember we're estimating that sigma parameter the variance among districts and that's why it's a bit wider as you can see than the typical district the pink band is wider than it is for most other districts okay that's partial pooling and it after a while you get really used to it and you come to expect it and one of the funny things about it of course is that it's a reason that we don't expect the model to we don't want the model predictions to exactly recapitulate the sample I'll say that again we don't usually want the posterior predictions of the model to exactly recapitulate the sample because remember there are features of the sample which are not regular and we're trying to regularize so that's why we use this approach note we have done no inference yet so we can think about that next what about urban living urban spaces impose different economic costs on people and crowding and there are cultural effects of urban living as well and opportunities for labor which compete with reproduction and so on there are many many theories about this but typically urban populations have smaller family sizes and greater uptake of contraception than rural populations so let's take a look at this is it can we at least describe the association between urban living and Bangladesh in 1989 and contraceptive use and and maybe squint really hard at it and convince ourselves it's a causal effect the issue here right away is that district features are potential group level confounds and different districts have different levels of urban development and so we need to also stratify by district and we want those varying effects in place ideally we'd use the munlack machine that I talked about before but I don't want to over complicate this example so if you're interested in that you can go back to that bonus round and and take a look the total effect of of you a variable you which indicates how urban the place where the woman lives passes also through K kids in my dag you see that it's got a direct effect and an indirect effect so this is just a reminder you don't want to just throw everything into the same model you got to think about which estimate you're using and only use the right adjustments after that if you did stratify by kids you would block part perhaps most of the causal effect of urban living here's the model there's a lot of choices about how to parameterize this and I'm going to choose one that I think is is the most broadly useful because it's a structure that lots of people use when they make these sorts of models and that is to add a slope if you will so you survive in the data set I've given you is is an indicator variable it's zero when the woman does not live in a city and it's one when she does and we're going to have a coefficient in front of that indicator beta and there's going to be a different beta for each district yeah so it's whole new vector of parameters because we're going to let the effect of urban living vary so the variable use of I effectively just turns on beta on different lines for different women depending on whether they live in a city or not nothing changes for alpha sub j we add beta sub j it's the same kind of structure we've got a beta bar which is the mean effect of urban living for across districts and then a scale parameter tau is that weird cute-looking t there and tau is just like sigma it just has a different name so alpha sub j is is the regularizing prior for rural beta sub j is the regularizing prior for the urban effect it's the difference between rural and urban contraception rates on logout scale within a particular district j and then we have the averages and the standard deviations as before so in a sense it's it's the same model we've just got some duplication and renaming that's been done and the code shows that think quite clearly we have another vector of link 61 but it's named B and it's got its own mean and standard deviation okay when you run this not all is well you're likely to get some scary messages like this warning 4 of 2000 transitions ended with a divergence and then there's a link never a good sign warning three or four chains had an e b f m i whatever that means less than 0.2 is that good it sounds bad there's a word warning in front and again there's a link never a good sign what's gone wrong here well actually the Markov chain is is fine in this particular case but it's worth paying attention to these warnings because there's a way to fix them and when you fix them you'll have a lot more confidence that the estimates are good and it'll take less time to get them it'll make it more efficient if you look at the pracy output for this model you'll see that the n f the effective sample size and the our hats are not great especially for tau tau has where we took 2000 samples here and tau's effective sample size is 45 that sounds bad and it's our hat is way above one so and if you look at the trace plot and the trunk plots these are unhealthy chains yeah now if you run this model long enough you're you're going to get the right posterior distribution in this particular case but in some cases when you see warnings like this no matter how long you run the model it's not going to fix it so it's worth taking a little bit of time to figure out how to recode the model actually so that it runs better but let's do that after a break go back and review the first half of this think about your branching paths again maybe go find a copy of a choose your own adventure book read it through and come back whenever you feel like it I will still be here so before the break I'd introduce this idea that we could fix the problems with the Markov chain in the in the Bangladesh model that includes urban versus rural and the trick for doing it is going to seem weird I'm just going to tell you right now and I'm not even going to explain why it works in any deep way but at the end of this week I'll do a bonus round to explain the details in more depth I promise now but for now I want to keep moving so we have a good flow and you can get the top level concepts instead so the problem in this particular model arises from the fact that there are priors inside of priors that is that we have parameters that define the shape of the priors for other parameters there's nothing wrong with this in fact it's essential for making varying intercepts work but it can be challenging to sample because it creates awkward spaces for Hamiltonian Monte Carlo to cruise around in this kind of prior where parameters appear inside the prior for other parameters like the priors shown in red on the screen are called centered priors the idea is that there are parameters which center them or locate them in particular places but we can re-express this exact mathematical statistical estimator without centering the priors here's the idea I just want to give you the intuition so there's this thing called a z-score the z-score is a standardized Gaussian deviation you can calculate it for many things sample from a Gaussian distribution by subtracting the mean of that distribution and then dividing by the standard deviation so this formula on the screen z or z if you prefer sub j is equal to alpha sub j minus alpha bar so we're taking the particular value alpha sub j and subtracting the mean and then we divide the remainder by sigma and that's called the z-score it's a deviation in a standardized normal distribution it's used in a bunch of statistical tests you can do this with any normal distribution it's a perfectly harmless transformation you can always reverse it in fact to reverse it you just use this formula alpha sub j is alpha bar plus the z-score for that j times sigma and then you're right back on the original scale so we can use this trick to do all of the posterior updating on the z-scores and what's nice about that is that the z-scores don't have any parameters inside them because they're normal 0,1 in the prior I'll say that again we can use this trick to re-express this model so that the Hamiltonian Monte Carlo doesn't have to move around inside the distribution of alpha sub j it just has to move around inside the distribution of z alpha sub alpha comma j and that has no parameters in it it has no hyper priors but it's the same model I know this is a super weird trick but it works so we re-express the model in the so-called non-centered version on the right of this screen these two models are the same model they are mathematically identical but when you use them in your computer to do Markov chain Monte Carlo they're not equivalent they reach the same answer eventually but the one on the right is a lot more efficient so let's code the model that way the code looks worse because it's got some extra lines but it's much more efficient the only thing to note here is I have expressed these deterministic relationships for the alpha vector and beta vector same length and I put save in front of them so that we get them back in the posterior distribution even though they're not true parameters because they're strictly functions of the other parameters alpha bar beta bar z sub alpha and z sub beta and sigma and tau this model samples much better and again the less efficient model works you just would have to run it much longer to have the same confidence in the posterior samples you get okay I know this is a strange trip and it's just a weird fact about scientific research as in art that there's a bunch of technical stuff that's really annoying that you have to deal with to make things work and get the final beautiful product and that's what this non-centering trick is like if you run you build your varying effects model based upon scientific principles and interests some desired estimate and then you have to wrestle with a Markov chain like this glassblower has to wrestle with all the details of melting temperature and ovens and putting things together in the right way and how much lead is in the class and so on but there's just no way of to avoid that if you want the beautiful end product you have to deal with the technical monsters in between and that's what this varying effects trick is like there are other things about estimation and finite samples which are similarly just not what we got into this business for but in the end it's all worth it okay and this is what we get now we have I'm repeating that same kind of structured plot from before where I'm showing the posterior predictive distributions for each district but now I'm splitting it by rural and urban so in the top plot we have rural all the districts are are arrayed from one to 61 on the horizontal axis and then probability of using contraception on the vertical and this is not the same plot as before because we've taken we've taken out the urban districts and then on the bottom we have the urban districts the same sort of plot shown in blue but only the urban parts of each district there's a lot of complexity here and a lot to investigate and we're not going to try to just deal they're not going to try to analyze this in too much detail just a few things I want to point out first let me label the really extreme empirical values the the black dots with their sample sizes like I did before this is the the action of partial pooling and there's a couple things to note here this is another chance to appreciate how partial pooling works the cases where the sample sizes are small you get a bigger difference between the red circle and the black circle because there's more shrinkage towards the mean because there's less evidence in that particular district so you'll see like in the top plot where you get sample sizes of four or seven or six there's a greater distance between the posterior mean and red and the raw empirical mean in black same is true on the bottom we have even more extreme small sample sizes on the bottom a number of districts where there's only two or three women from urban areas because those districts aren't very urban they only have small towns in them really and this illustrates another factor about partial pooling is that as you begin to cut up a data set by stratifying by various predictors and you will need to because you have some particular estimate in mind you're going to get smaller sample sizes in each unit and in that case partial pooling becomes increasingly valuable because the shrinkage guards against overfitting yeah the overall result here you probably appreciate it already is that there's more contraceptive use often a lot more in urban areas but there's also a lot of variation in the urban areas and it sort of seems like from eyeballing it there's more variation across urban areas of districts than there is across rural areas of districts let's just say on average women living in urban areas are used contraception more but they also vary more so we can take a look at the sigma and tau parameters for these effects for these features and that's what I've done here in the bottom half of this slide these are the posterior distributions for the sigma and tau that is the standard deviations of the rates of use across rural areas in red and urban areas in blue and then I've added the prior here this is something I often like to do in my own projects is superimpose the prior for particular distribution to make sure that the data did some work at all right you want to see some you want to see the posterior move from the prior so what's going on here as you can see is that actually there are many there's a very wide range of standard deviations consistent with urban and many of them are quite large there's much more evidence about what's going on rural areas so this is not a confident assertion that there's more variation across urban areas because there's just less data from urban areas from Bangladesh in 1989 it wasn't very urban at the time and you want to think of this as another example of the case where sometimes what the posterior distribution tells you is that it doesn't know how much variation there is and that's what you're seeing in the urban area it could be consistent with low or high or the same as rural okay a final thing to say about this as a bridge to the next lecture I'm more natural way perhaps to plot these posterior predictions is to have a graph like the one on the screen where the X axis is the probability of contraceptive use in a rural area and the vertical is the probability of contraceptive use in the urban area and each point is a district yeah so the horizontal axis is computed using the alphas and the vertical axis is computed using the alphas plus the betas for each district and what I've shown you so far is just the posterior means and those dashed lines that intersect there are showing you the 50% on each axis so to the left we have less than half of women using contraception to the right we have more and above the midpoint line we also have more than 50% using so the first thing to appreciate of course is what you saw on the previous slides is that there's more contraceptive use in urban areas okay the other thing though that's revealed now that was quite hard to see in the previous slides is that there's a strong positive correlation between contraceptive use in both rural and urban areas within a district yeah you see that I'll say that again there's a strong positive correlation between contraceptive use in the rural and urban areas within each district yeah and that's why this this cloud of points goes up to the right and that correlation is extra information that we have not exploited in any particular way I'm going to show you it's also a feature of the uncertainty about these points so what I'm going to do next is impose the whole posterior distribution on here in the form of these 50% compatibility regions this is not a beautiful plot I know but what are these these are 50% compatibility regions of the joint uncertainty from the joint posterior distribution of the alphas and alpha plus betas in each district just to show you that they also tilt up and to the right there's a correlation in the uncertainty of them as well this is a little easier to appreciate if we just sub-sample down to about six of them and you can see it better here this correlation both in the posterior means and the whole posterior distributions of these parameters alpha and beta is information that we can use to make even better estimates actually but I'm going to leave that story for the next lecture because it's a bit involved but the good news is there's more information on the table and we can do even better with tinkering with the machinery using the right pencils to draw the owl okay to summarize what I've tried to show you here is examples of how to develop additional features into a multi-level model and how to think about begin to think about the co-variation among those features so in this case the features are rural versus urban within each district and in the next lecture I'm going to show you how to do some get some extra value out of the joint uncertainty in the features okay we're in week seven week seven we're doing yet more multi-level models and this will be a launching point keep going forward into special topics that use these estimation technologies in future weeks and I'll see you next time