 Welcome to lecture 18 of statistical rethinking 2023. One of the difficulties in teaching this course is that I have to find a way to move your mind from the narrow perspective of introductory statistics and help you see that those kinds of approaches and models are very special cases. It's not that the man is moving but the river and once you realize things like this lots of solutions to well ordinary realistic scientific problems open up. There are still coding challenges but the conceptual part is the most important one and that is to get your mind to move and stop seeing the river or the man. In the previous lecture I showed you this much more realistic kind of analysis. Finally, we're almost done with the course, a data set that has lots of missing values to remind you this is the primate family tree. The ends of each branch there is a species and the dots represent different variables. There are just three of them and we'll become reacquainted with these variables later in the lecture and where there are missing dots there are missing values. We simply don't know the values for those particular species they haven't been measured. I like to thank in the previous lecture when we spoke about measurement error that I started to convince you that an observed variable is a very unusual thing which is to say an observed variable with no error or uncertainty of any kind and in this lecture I want to double down on that intuition and convince you that missing data is quite typical. In fact it's the ordinary case. Observed data that is data that we treat as if it had no uncertainty at all is a very special and unusual case in real research. Sometimes it does occur because our measurement device is really really good and we're measuring exactly what we want to measure but most of the time we're just fooling ourselves into believing that we have measured something with no error at all. So measurement error in the previous lecture is a set of methods for confronting that directly when we have some information about the measurement process so that we can put bounds on the uncertainty of a measurement and in this lecture we're going to think about data that are completely missing that is we have no measurement all of them directly and this is what is usually called missing data that is some cases are unobserved. Now of course there could be whole individuals or cats or whatever it is you're measuring that are also unobserved and you could also call those missing data but we usually think of that as a sampling problem but I think it'll start to soak in that it's really the same kind of problem and we can approach it through probabilistic modeling as well. In either case missing data are not really missing when you have a generative model because just like in the measurement error case we do know something about them because there are implications that come from the generative model for cases we have not observed. We know for example the constraints on the variable because we know something about the nature of the variable now so assume we were studying human height you hadn't measured a particular individual's height but you know that it's positive yeah that's a constraint you also know something about the valid ranges it could be it's not going to be seven meters yeah the other thing you often know about missing cases are their relationships to other variables and you know these things through the causal model and those implications through probabilistic modeling can be very powerful and help you make use of the available measured data when otherwise people just tend to throw it away so in this lecture I hope to show you how we approach this first conceptually how do we think about drawing assumptions about missing data into causal models and then talk about the sorts of procedures that people use to deal with missing data because in most real research you will have some one of the things it's often done is simply to drop all cases that would be species or countries or whatever it is whatever the unit on each row is to drop all those cases that don't that aren't complete that have any missing values in the variables of interest and sometimes this can be justified but it can only be justified through drawing causal assumptions now so there's just no way out of it and the same causal assumptions that will allow you to drop cases with missing values will also permit you to do something called imputation which allows you to make use of the present cases that is the observed variables for the cases that have some incomplete values and therefore recover some efficiency in estimating some of the parameters in the model let's start with the concepts up through the break and then after the break we'll do some calculations okay so let's think about a totally abstract parable about missing data now let's suppose as is often rumored that students don't like to do their homework and sometimes the homework is not turned in and there are various reasons for this so we're going to look at a simple dag here where there are students and they cause homework and we're interested in the quality of homework from different students that is ingrating them but we don't get to observe all homework from all students because some students just don't turn it in we don't know how good their work would have been if they had done it so i'm going to draw the dotted circle around homework meaning we haven't observed the true vector of homework values instead we've observed this descendant of it h star which is also a vector but it's got missing values in it that is the students who didn't turn on their homework they're just blanks in there this is what we usually mean when we say missing values we've observed for some cases and not observed for others now what can be done about this should we just ignore the students who have not turned in homework what should we think counterfactually about their homework well depends upon what we believe is causing homework to go missing let's consider the simplest case say the students say my dog ate my homework here's the dog this is a mechanism that leads to homework to going to missing and so the the descendant h star the vector of homeworks with missing values is influenced both by the true homework yeah that was submitted or would have been submitted had the student done it and the dog that eats homework yeah and in this particular dag i want you to pay attention to the dog that is the missing this mechanism in general and what is influencing the homework that the dog eats and in this particular case the dog is as we say random there are no to figure out if there's any bias that will arise from the missingness and analyzing only the present samples we need to well obey the due calculus analyze the graph with your eyes and think about biasing paths is there any biasing path connecting the homework to the cause of interest the treatment of interest that is the student's ability s and there isn't there isn't and i'll show you an example in the next slide where there is and as you'll see how to work with that but in this case there's not because the dog is just like a competing cause it eats homework at random it's not influenced by the student's ability and we don't expect a bias in most cases in this circumstance as we say that the dog the random dog is usually benign i do say usually because in particular measurements sometimes even randomness can be bad here's a simple simulation so you can play with this and think about this causally there's a hundred students and a hundred dogs simulate standardized student abilities in variable s and then their homeworks are correlated with their abilities and then the dog eats 50 percent of the homework totally at random without any regard for the quality of the homework as dogs do and then we have an observed variable h star which has missing those missing values in it and we can plot and what i'm showing you on the right is the total sample in black that's the total sample and then the incomplete sample in red and if we draw regression lines through these you'll see that the slopes are about the same on average and you can run multiple simulations here to play around with this and reassure yourself i haven't picked any special simulation we don't expect a bias for missing us when the dogs are random like this we do lose data though and that's not good so we lose efficiency and we lose precision okay let's modify the parable of the dog in the homework and let's draw an arrow from student ability to the dog that is that the dog is influenced by the student's ability maybe for example students who study a lot work really hard in their homework and don't pay enough attention to their dogs this makes the dogs grumpy and the dogs eat the homework more as an active revenge yeah this is a possibly biasing path now we're ignoring the missingness mechanism that is the reasons the dog eats some homework and not other could result in confounding in the general sense plain english sense that i often use it meaning that you will be misled so this is a possibly biasing path the way you could think about this kind of dag is that the dog eats conditional on the cause of homework yeah some cause it doesn't have to be the student but the student in this case is the cause we're interested in and so this is the most alarming sort of situation so this can be benign yeah but i'm going to show you benign case and then a less benign case it depends upon the measurement scales and the functions that relate the variables unfortunately it's it's you just have to think harder about the generative process in this business so again a simulation 100 students 100 dogs 100 homeworks same relationship between student ability and homework quality now the dog is going to eat 80 percent of the homework but only when but only for students who are above average that is s greater than zero and because they neglect to play with their dogs and then we plot and what i want you to see is the black points the total sample um are showing on the higher values of s because the red values the observed sample has been depleted in that range i'll say this again the black points for the total sample are showing in the on the right hand side of that graph for students above average ability because the observed sample the red points have been depleted in that range now for linear functions like i've assumed here you see the two regression lines doesn't have much of an effect right because it's linear it doesn't matter what value of s is there's the same relationship to the quality of the homework on the vertical but that's a special case that has to do with the functions i've chosen here and the linear additive case is sort of the most benign sort of situation you could assume which is why people often assume it but it's very implausible for lots of situations including homework right so let's imagine something a little more realistic that as student ability goes up there's a ceiling effect on the quality of the homework right eventually you just get everything right on the homework and it can't get any better right we're not grading on penmanship here so now the dog eats the homework of students who study too much just as before but there's a nonlinear ceiling effect on the on homework i've chosen a particular exponential function for this just to make a curve so if you look at the black points plus the red points that's the total sample and you'll see how it eventually homework start stops getting better with increasing s yeah left starts to level off now we deplete the sample again for the students of high ability because their dogs are upset with them it's an active vengeance and we observe the red points and we fit straight lines and you'll see we do get the wrong answer here because we're using the wrong function yeah so this is just to remind you that part of degenerative the reason we go from daggs and then write generative models with functions is because the functions matter okay this can unfortunately get worse and often is worse in real data now let's draw an arrow from the homework itself to the dog uh what could this possibly mean well uh say the dogs have a preference for a certain quality of homework bad or good um in particular say students who do bad homework and are aware of this tend to feed their homework to their dogs now it's the quality of the homework moderated by the students uh who um who that cause a certain homework to go missing this situation is very bad yeah so now you've got a biasing path for sure uh and unless you can model the dog you can't close it there's nothing you can do yeah um the dog eats conditional on homework itself this situation is usually not benign in fact it's the it's a kind of nightmare of statistical contexts and again a simulation on the left I chose linear functions again we'll go back to the simple case but now what's leading the dog the dog eats 90 percent of homework where the homework is below average yeah so students are feeding their their below average homework assignments to their dogs and we plot again on the right and the black plus the red is the total sample and we only observe the red because the bad homeworks on the bottom half of the graph bottom half of the y-axis are depleted and we fit two lines and we get a bias yeah I hope this is intuitive to you it tends to make more sense uh this situation is very hard to recover from unless you can model the missing this mechanism and you know something about the functional forms let me try to summarize these are three idealized cases and they're uh certainly not all the situations you can be in because you know dag can get complicated and you can have missing this in multiple ones and and uh so don't think this is the universe of things but these are the stereotype cases and you want to start by being easy on yourself and learning the stereotype cases so you can scaffold up your conception of the problems so the first one I showed you at the top here is the what I call dog eats random homework uh sometimes in status this is called missing completely at random but that's not a very memorable phrase think about the dog the dog eats random homework in this case in this sort of situation if you want to drop incomplete cases that is um cases where there are missing values that's usually okay that won't result in a bias but it does result in a loss of efficiency if we had other variables here we needed to adjust by we would be throwing away information that lets us estimate coefficients for those adjustments yeah and that's that's a shame we want to use all the information on the table but it's not wrong to drop incomplete cases number two dog eats conditional on cause again in stats this is dreadfully sometimes called missing at random which again is completely uninformative sort of name think about the dog the dog is eating and it's conditional on a cause some variable that we have observed yeah a treatment or something we want to adjust by and what we need to do is we can need to condition on that cause or stratify by that covariate in order to deal with the missingness but we must model it correctly and that was the point of my example where they were diminishing returns on the quality of homework there's no there's no get out of jail free card here and then finally level three of the inferno I hope you never find yourself here the outcome of interest itself is causing missingness and it's extremely difficult to deal with situations like this but they're not rare now the dog story is obviously a comic yeah it's meant to entertain you and draw your attention and be memorable so you can come back to these stereotype cases but there are lots of variables which are which cause them where their own value causes them to go missing I'll say that again there are lots of variables where the value itself increases the probability that you will not observe it think about income you're trying to do a survey on income and people in certain income ranges will be motivated either to refuse to answer or to lie yeah now a lie is not missing you might call it measurement error but that just makes it worse than missing in many ways this sort of thing is not so unusual unfortunately I say at the bottom of this slide this is usually hopeless that's the word that I don't tend to like to use because I'm a hopeful person but this is usually hopeless you just want to report it honestly that this is a likely thing and describe the sample you have and say you can't draw causal conclusions because of the missingness that would be the honest thing to do I think in most cases but there are times where hope springs forth and you can do something about it if you know enough about the missingness mechanism a really common case of this is survival analysis I had a bonus in an earlier lecture when we analyzed the black cat adoptions which use survival analysis and in survival analysis we we nearly always have missing values there are adoptions which have not happened yet and but we know enough about the missingness mechanism and we have strong enough distributional assumptions about the the adoptions that haven't happened yet that we can recover those cases and we can do something about it so it's not always hopeless so what do we do in those cases well in all those cases in the standard middle case dog eats random homework dog eats conditional on cause and exotic cases like survival analysis what we want to do is something called imputation or a more mathematical trick which which achieves the same goal but just avoids having to develop probability distributions for the variables themselves something called marginalization what does this mean Bayesian imputation it means we compute a posterior probability distribution for each missing value remember in Bayes an unobserved variable is a parameter yeah and and the special rare case is a perfectly observed variable we call that data but data is in the minority in in science right usually we have imperfect measurements and those things require distributions information goes into those distributions of course and gives us priors but then we then we need to update through a generative model so we're going to do Bayesian imputation in this lecture and I'll show you how it works in in principle you've already done Bayesian imputation in this course when we did multi-level models remember the Bangladesh fertility example there was a district which we had no data for but we nevertheless got a prediction for it that was Bayesian imputation it just happens automatically and I'll show you some examples after the break the next sort of thing which achieves the same result is this thing called marginalizing unknowns often we don't really care about the posterior probability distributions of the missing data but we want to leverage probability theory so that we can use the efficiency in all of the observed data without throwing away incomplete cases and in that case we can do this marginalization trick we don't have to bother with computing posterior probability distributions for those for the unobserved cases the unobserved values instead we can just sort of average over them using the probability distributions of the other variables I know that sounds weird but it's it's an automatic sort of thing you can do with with probability models I'll say a little bit more about that in a moment and in principle you've done it in the previous lecture in the misclassification example I just didn't call it that okay why does Bayesian imputation work this idea that we can compute the posterior distribution of the unobserved values well because if you have a causal model of all the variables and that's what a Bayesian generative model is it's a joint probabilistic model of all the variables at once then there will be implications for the values that are missing because the available evidence gives you probability distributions for the coefficients that tell you how the variables are related to one another how they're associated and that lets you do the imputation and lets you put bounds on the plausible values of the unobserved of the unobserved variables conceptually this is weird but technically it's even weirder so I'm going to go slow here but don't worry this is a very common sort of thing to do it's not at all avant-garde so sometimes you don't need to do imputation at all it's just unnecessary so if you have discrete parameters for example then we almost never bother with imputation and it's quite expensive and difficult to do the sampling anyway and so the misclassification example in the previous lecture was a sort of covert example of this where I had missing x values that was the extra pair paternity assignment of each child and I didn't bother to compute probability distributions for those I just I just averaged over them using that little probability tree that's marginalization sometimes you want to do imputation though because it's actually easier than the marginalization and one example of this is actually in survival analysis with some censored observations if you have a complicated hazard function the hazard function is what determines the scheduling of the event that is time to event if it's at all complicated something more than a simple exponential the marginalization can be very computationally difficult and sometimes quite unstable and in those cases it's actually easier to treat the censored cases as missing values and model them as parameters and there's an example a coded example of how to do this in the script that goes with this lecture for marginalization there's a fully worked example in the book and I apologize for not putting in this in this lecture but this lecture would be three hours if I did all the interesting material in the missing in the missing data chapter I'm sorry it just has to be this way but please take a look at that if we start with the generative model and we build up to the marginalization code and talk about numeric stability issues and the whole thing the focal data example for this lecture is just going to be the primates data from the previous lecture but now we're going to deal with the missing data and we're going to revisit that analysis and we're going to do imputation for the half the data that are missing and you might be saying that's a lot of data can we really do that of course you can remember in Bayes the minimum sample size is one or actually zero because you get predictions just from the priors and this all of this also goes for imputation there's no magic number which tells you how much data you're allowed to impute yeah if you don't learn anything from the samples and the posterior won't be different from the prior and we can always check that so just relax and trust the axioms yeah to remind you about this data we've got 301 primate species lots of missing data in the three variables interest that is mass body mass and grams brain volume and cubic centimeters and typical social group size there's also a measurement error here and lots of potential for unobserved confounding i want to spend a little bit of time just talking about this sample again conceptually before we get back to the data so remind you what the tree looks like i started the lecture with this picture it's the 301 species if we did a complete case analysis which is what we did before then we end up dropping half the species and this is what we're left to analyze just those and you see you end up with big gaps in certain parts of the trees because there are some species which are just not so glamorous as apes apes are quite well measured but even in the apes we're missing a lot of data so we're going to impute some primates the key idea as i keep saying but maybe it's good to hear it over and over again is that we already have probabilistic information about missing values because of the relationships among observed values in the sample and that lets us leverage all the data that we have observed right so say there's a species where we haven't measured body mass but we've measured their group size we want to use that group size to inform the coefficients in the model not throw it away and the imputation will let us do that and the good news is it sort of happens automatically from the same statistical model you don't have to really do anything different writing the mathematical version of the stat model to justify imputation because in Bayes there's there's no deep conceptual distinction between data and parameter a missing value will be a parameter an observed value will be called data but they're just observed and unobserved that's all there is yeah and so whether something's a prior or a likelihood is well that's whether the river is moving or the man is running in it yeah it's your mind that moves not the distribution so let's take you through this and as I say at the bottom of this slide this is conceptually weird but you get used to that in a little while right do yours in meditation think about the man in the river catching fish it'll come to you you can you can pivot your mind's eye back and forth between the duality of parameters and data the coding is awkward and there's just no way around being honest about that I will always be honest with you the coding is awkward it just is and sometimes packages will hide this from you and that's great but at some point you have to face that technical awkwardness and there's just no way around it but worry about that after the break the technical awkwardness before the break here we're not long from the break let's just focus on the conceptual issues of drawing the missingness into this particular case okay to remind you here's the dag I had drawn in the previous lecture where actually wasn't the previous lecture it was like two lectures ago we've got group size and brain size and body mass and there's some evolutionary history which is influencing a bunch of owners or confounds and that was the justification for including phylogeny and we used a Gaussian process to do that however all three of these variables group size and body mass and brain size have missing values and so what we want to do is draw that on the dag we haven't observed the full g vector we've only observed g star which is group size and missing values and it's influenced by some missingness mechanism which I write here as little m sub g now what is influencing missingness the typical assumption is that it's totally random but that's very unlikely in evolutionary college studies yeah we know scientists don't study species at random and so the species we have data for are a non-random sample of all the primates so for example there's a very strong bias among primatologists to study species that are closely related to humans why well because we're basically narcissists right and that's the whole premise of my field of anthropology it's a deeply seated and unembarrassed narcissism but the consequence here is that it means the the smaller and less related to humans primate gets the greater the probability we're missing some some variable for it another possibility is that larger species are just easier to observe and so body mass it's one of the other variables in the data like body mass might influence missingness on group size so for small imagine small body primates they live in trees they're hard to observe we don't know their group sizes yeah that seems plausible as well but for something large like a chimpanzee they're easy to spot you can count them you know how many chimps are in a group another possibility just to give you night terrors is that the variable itself group size in this case influences its own missingness and what would that mean for example well imagine solitary species are less studied and so we don't know their group sizes yeah these sorts of things can happen and so they're worth considering and maybe you know enough scientifically that you can reject this option but it's it's the sort of thing that is an assumption yeah and it's an assumption that can't easily be tested with the sample so all these arrows are potentially in play whatever assumption you end up making what you need to do is use the causal model to justify an analysis procedure yeah to build some either you admit defeat and simply describe the sample and don't claim that anything any causal estimates can be made or you argue that here are the assumptions that are necessary to interpret the results as causal these are all good options you just want to put your assumptions on the table because assumptions are what license conclusions yeah if your conclusions would hold under any assumption that doesn't sound like a conclusion does it arguments always depend on their premises so lay out your assumptions lay out your premises let your colleagues inspect them I'm not meaning to argue here for any particular version of this dag because I don't know what causes the missing values in this sample I just gave you some hypotheses but I want to show you if you were willing to say it was something like evolutionary history or body mass is influencing missingness in group size then we can proceed yeah but only because we have drawn out those assumptions and we can prove that there's a way forward from there using the generative model and testing the analysis pipeline the reason we can make progress is because once you've got the generative model and and you program this as a probabilistic model the missing values even though they're uncertain that uncertainty will cascade through the whole model in exactly the right way you don't need to be clever or intuit how this works just trust the axioms and all the necessary constraints on the information will be obeyed and the posterior distribution will take it all into account so with that I think we should pause and I really encourage you to review the first half of this lecture before you continue after the pause this is conceptually strange I admit that took me a long time to wrap my head around missing data and imputation when I started this so I'm completely sympathetic to that so do the review jot down what's confusing you can bring that confusion to me if you can't resolve it on your own and then take a break take care of yourself and when you come back I will still be here before the break I had added missing values to group size but obviously all three of these variables mg and b have missing values so I've added the analogs for them a b star for the observed brain sizes and an m star for the observed body masses and then missing this mechanisms for each of them and if we assume that missing this is totally at random that is there are no causes into those little m variables on the graph we know a model that will deal with this because we did it back in the Gaussian process lecture when I introduced phylogenetic regression we will have a covariance kernel k that is informed by the phylogenetic distances d among the species that we get from the consensus primary phylogeny and then we stratify by body mass because we believe it is a confound right it's a it's at the middle of a fork between g and b okay that's just review if we were only modeling it this way this is the sort of reduced graph in the way you think about it that g and m are influencing b this is like a sub model of the whole thing the only part the only causes that we're interested in right we're not modeling the influence of body mass on group size and so the tag at the top of the screen has deleted all the arrows that are not represented in this statistical model and we don't necessarily need them right the due calculus says we don't need to estimate those relationships given our assumptions to estimate the influence of group size on brain size but to do imputation well often we need the whole graph and the other relationships because the dag simultaneously implies um a um relationships among the other variables right we can focus instead on a different estimate on group size and the influence of body mass on it and it will also be confounded by phylogeny or evolutionary history and shared environments lowercase u in this graph and so there would be another simultaneous phylogenetic regression that we could run to estimate that the influence of body mass on group size and of course there is also the influence of evolutionary history on body mass itself neither g or b by assumption in this example influences body mass but evolutionary history does and so we might want to estimate that as well this would be for for the evolutionary biologist this would be an example of trying to measure the phylogenetic signal on body mass all three of these regressions uh big Gaussian process regressions coexist simultaneously in the dag and we can run them simultaneously and that will be the way we will do imputation on these things at the same time uh and be perfectly uh compatible with the assumptions in the causal model and any other approach any other ad hoc approach to doing this is going to go wrong somewhere yeah and this is a point at which i want to point out too that even though i'm using the word imputation and this is a word that's used in non-basian missing data methods as well what we're doing is different uh non-basian imputation does not involve assigning probability distributions to unobserves right because they don't do that um it involves simulating data sets using a generative model and then running the analysis multiple times we're not going to do that we're going to run the analysis once and we're going to get posterior distributions for all the unobserved simultaneously um compatible with the generative assumptions okay a model like this three simultaneous Gaussian processes is not something to take lightly you don't want to try and build it up all at once you want to take small steps and even then you're going to fall down uh making these models is hard it's hard for everybody when you start out so you take tiny little steps you want a friend walking beside you the first time you try it and even then you might fall down stumble a bit and when your friend laughs they're not laughing at you they're laughing with you and they're there to pick you up eventually you can build it all together and practice and get it to run and in our business we all fall down so don't feel bad so we're going to take it slow i'm not going to draw the whole ally once here i'm going to draw a little subcomponents of this model and build it up and remind you um if i were doing this as part of a research project i would also have a generative simulation with synthetic data and i would be testing each little step of building the estimator of the statistical model as well i have to leave that out of this lecture because it would be three hours long but that's the sort of thing that's really worth doing it's not a trivial case to do a synthetic simulation of phylogenetic data but there are packages to help you do that okay we're going to go slow the first thing we're going to do is we're just going to ignore all the cases with missing brain values why well that's the outcome and the species with missing brain values the model will at the end be able to make predictions for those but we don't anticipate getting any value out of imputing the missing cases in the outcome those are just predictions yeah um we then want to impute g and m uh ignoring the models for each which is almost certainly the wrong thing to do but i want to show it to you because first of all this is how you build up code you start with things that are slightly wrong but have some structure to them so you can get the get the machine running and then you add in the complexity and layers uh it will also turn out to be very useful for showing you the consequences of adding the causes of g and m into the imputation you'll understand this when i get there so then we get to step three and we impute group size using the model and you'll see the consequences of that what i mean by using the model is using the causes of g which is both body mass and evolutionary history and then finally we'll get to step four and we'll do it all yeah and i'll show you the results that arise from all of that i will not walk you through the detailed code of the last model because it would take up multiple screens but all the code is in the script for this lecture so what does it mean ignore cases with missing b values well we can take a look at what's missing here so here's a table uh where we see what's present for all the complete cases where brain size is present and you'll see that there are patterns of missingness as well what we're left with after we reduce ourselves down to all the species where we have observed brain sizes you can see that what this little table does is true means observed and and false means missing so you can see that we're missing a lot of values for group size yeah and there's some correlation and missingness as well in these variables but this is what we're left with you did that 151 number that's the case where both m and g are observed and that's what we saw before so now let's impute group size and body mass ignoring the models for each what do i mean by that well what i've got on the screen right now is the full thing the full luxury base approach we're going to end the lecture with when i ignore the models that means i just treat g and m as if they were standard normals yeah uh no causes at all um this is not the right answer but this is almost always the right place to start when building a fancy model like this and go easy on yourself because you're going to slip on the ice okay uh got to take small steps or your fall is going to hurt a lot more so when gi is observed this is the thing to think about this distribution normal zero one that's assigned to g sub i some of the g sub i values have been observed their measurements so when it's observed uh conventional statisticians we call this a likelihood it's a likelihood for a standardized variable yeah it gives us the probability of observing that measurement it's okay to assign it to standardized because we can just standardize group size right when g sub i is missing however since we're basians we get to call this the prior but again this is your mind moving and not the variable yeah is the is the man moving or is the stream moving um these categories uh likelihood and prior are features of our mind they're not features of probability theory and we can exploit that dualism we only need one definition whether g is observed or missing it's just that sometimes when it's missing there's a parameter in that place um and so we will get a posterior distribution for it and uh in this case normal zero one will be the prior for it and but that will not necessarily be the posterior it'll get updated um and uh when it's observed it'll inform the coefficient uh for it for that variable so this looks like a typical uh this is the same kind of code we used in the the Gaussian process lecture this is a Gaussian process regression where the phylogenetic distance matrix there is d mat um and uh no surprises I hope you run this and you get a huge vector of imputed values you get two imputed body masses remember there's only two missing body masses in these uh in this case body masses is much more often measured it's easier to measure than group size because it's not a behavior and uh and then a bunch of missing group sizes you see there that have been imputed as well and you don't want to stare too hard at this table because it'll drive you to madness yeah that's not how we interpret the outcomes of these things we want to plot posterior predictions yeah and so let's do that uh let's think about body mass on the horizontal here versus brain volume on the vertical and what's the relationship between these things and you see the black points of the observed cases there's a very strong correlation between body size and brain size this is totally unsurprising because bigger bodies have bigger brains um and the imputed values there I'm showing you the posterior distributions for the imputed values in red and each of those is a single point remember there's only two and the circle is the posterior mean and then I think those are 89 percent uh intervals you'll see they follow the trend yeah uh even though they both had the same normal zero one prior and that's because the coefficient was estimated and so the imputed values uh follow the trend yeah because we know the brain volume uh for the species and that gives us information about its body mass now what about the relationship between body mass and group size uh the model we just ran is silent on this because there's no coefficient to connect these two variables even though we believe body mass is related to group size it's certainly strongly associated in the sample and you can see that in the plot on this graph in the black points that's the raw data these are there is an association it's not nearly as strong as the association between body mass and brain size but they're associated but look at the imputed values in red those are just posterior means each of those is an uncertain point remember that but these are the posterior means for ease of visualization and you'll see they don't follow the association trend between these two variables we've left some information on the table and this is a consequence of ignoring the full generative model yeah and so we're going to fold that information in now by updating the generative putting the generative model into the statistical model and this will change okay what happens to the estimated causal effect of group size on brain size is the consequence of this imputation well it gets a lot smaller actually as you see here the black posterior distribution on the right is the so-called complete case analysis that's what we did in the previous lecture and the red is after we've done this imputation this is the kind of thing I would call naive imputation because it ignores the relationships among the right-hand side variables we just treated them as normal zero one in the prior and that's almost certainly a mistake because the dag says it's a mistake but this is the first step it's still right to do this first both because it helps you get the code to work remember the bear on the ice take small steps you will fall down but also because then we can later compare and we can learn from that comparison so let's do that let's still going to ignore body mass for the moment but remember small steps we're going to add the generative model for group size which means I'm we're going to fill in the regression in the middle here the center column so that there's a coefficient for the effect of body mass on group size and again we have a file genetic covariance matrix but it's a different one right with its own parameters but it uses the same distance matrix the the so-called phyletic distances among the species but it has its own parameters see they all have sub g on them and that's because the file genetic signature for for different traits can be different okay and now we run this and on the top I do it without phylogeny which is to say body mass is influencing group size but we don't but I don't worry about phylogeny confounding on group size we ignore that issue so there's only one covariance kernel in there yeah one one Gaussian process line in the middle and so that's this model here yeah you can see the model on G the brain model is just above it and it's just an ordinary normal regression it's not one of those multi-normal Gaussian process monsters yeah so again this is not where we wanted to be but we take it step by step and you would do the top model here and get that to run and mix and then you would put in the Gaussian process and that's what I have at the bottom down here where I convert that g tilde normal to g tilde multi-normal and we have another Gaussian process covariance kernel k g this time the covariance kernel for group size and it looks the same but it has its own parameters yeah that weird rep vector thing in the middle is we just want the means here to be zero yeah so that there's just phylogeny in this example so that I can contrast the two again this is not where we want to be because I've taken out the influence of body mass in the bottom model but this will let me contrast the effects of considering phylogeny versus the effects of considering just body mass and then we can combine these two things in the same model just by putting the the line from the top the new is ag plus bmg times m into the bottom model as well and get them so I'll show you all the combinations and again I apologize for taking all these tiny steps actually I don't apologize at all I'm not sorry at all I'm taking all these tiny steps with different sub models because this is the way you should develop stuff even though you know where you want to be but you can't get there right away it would you would you would take too big a step and you'd always fall down okay so let me show you this is what we had before this relationship between body mass and group size the black points or the observed values the red points of the posterior means of the imputed values this is obviously not so great yeah it'd be nice if they follow the trend so let's do that let's layer in on the right I'm going to layer in the other effects here are the imputed values the posterior means of the imputed values for the model that only considers the influence of body mass on group size ignoring phylogeny you can see now it captures the trend yeah it follows a regression line that's clumsily drawn through all those points yeah but notice it it's still quite weird in the sense that group size has a very weird distribution and this is one of the issues because there are lots of solitary species there at the bottom yeah those are the ones at the bottom those are the the ones that that where adult females live alone is there what we call solitary in primatology and there are a lot of solitary prosimians so also some apes yeah and our imputed values are not doing a very good job of accommodating that sort of thing that's not too surprising but it's the sort of thing you want to remark on then we consider phylogeny only ignoring the influence of body mass and these are the blue points these are the posterior means of imputed values using phylogenetic covariance and that is when we don't know haven't observed the group size for particular species the imputation of its group size is informed more by its close relatives on the phylogeny i'll say that again when we haven't observed the group size for particular species the imputed value is informed more by its close relatives on the tree and now you see there's a lot more structure here yeah this is because there's a lot of phylogenetic signal on these things and so it captures a regression relationship without assuming anything about a linear relationship between these two variables it's just the information in the phylogeny so here's the way you can think about that we plot the the OU, Ornstein-Ulembeck Gaussian process kernel here and you see that there's a lot of phylogenetic signature for group size in these data and that's why the imputations for nearby species are hugging one another so much in the graph on the right the blue points on the right okay now purple points i know this is one of the uglier if not ugliest data visualizations i've ever done in this class i'm always trying to outdo myself the remember to remind you the red is the relationship that ignores phylogeny only pays attention to the influence of body mass the blue is only phylogeny ignoring body mass and now purple is phylogeny plus body mass right it's blue plus red which is purple and you see this is very much like the phylogeny part because the phylogenetic covariance is so strong it really dominates the imputations here yeah it moves a little bit you see the purple points are moved a little bit towards the what you might call the regression line but the phylogenetic information really dominates imputation here and that's not that wasn't our assumption right it's it's a natural consequence of the generative model so now let's summarize the inference of all of this and remember in the previous lecture when we only did the complete case analysis that's the i call observed here the black posterior distribution that was the largest effect strong effect of group size on brain size and the different kinds of imputation models only reduce this the one that reduces it the most is the one that ignores phylogeny and as you add in phylogeny it moves it up a bit but overall doing the right thing the honest thing in imputation reduces the strength of the evidence that there's a strong causal relationship between group size and brain size okay we're not done yeah so now we need to do the m model too because body mass also has phylogenetic signature so to remind you so the left column that's the model we started with this is the brain size model it's the model from the Gaussian process lecture really the center column is the model we just focused on where we're modeling the influence of body mass on group size and simultaneously the phylogenetic signal to deal with phylogenetic confounding between these variables and now on the right to get the body mass imputations right we also want to think about phylogenetic signal and we can do that at the same time so this is a very big model i'm not going to show it to you but it's in the script for this lecture and here's what we get if you want to see the details of this it's in the script as i said at the top of this slide i'm showing you posterior distributions for the regression effects and i'm contrasting the complete case analysis to the full luxury imputation but on the same model so when i say complete cases i mean the full model that considers phylogenetic covariance matrices for all three variables for brain size for body mass and for group size simultaneously and i show that in black in each case and then the red is the full luxury imputation so on the in the top left you see the the effective interest as it were the one that motivated this example what's the effective group size on brain size and it's essentially unchanged for the complete case analysis yeah uh so you might feel a little disappointed say you went through all that effort and you got the same result um but listen uh this is duty yeah uh we it's not enough to say well i didn't do the right thing uh because it might not matter yeah you have to show it doesn't matter it makes sense it's just an issue of professional responsibility and sometimes it doesn't matter um and that's because of well the generative model uh saying that you could get away with the complete case analysis in the first place um but you have to show it the other effects it does matter a little bit and so there might be uh something going on here that's worth following up on if you were really into this question so the effect of m on b has gotten larger um and uh after imputation and the effect of uh in uh on body mass on group size has gotten a little bit smaller and this has to do almost certainly with the non random missingness uh on these on these variables this is the kind of thing if i were interested in this sort of project i would explore through synthetic data simulation to see the kinds of um uh biases that arise from complete case analyses when you have these uh sorts of sampling artifacts at the bottom we're showing the phylogenetic signatures so to speak in these cases and on the left phylogenetic signature for brain size is very small but remember this is after accounting for the things in its equation that is net um body body mass and group size there's essentially almost no covariance among the brain sizes of primates after considering those things and then in the middle phylogenetic distance a kernel for body mass so this has changed yeah and this is one of the things that changes the coefficients at the top is that there's less phylogenetic signature after doing the imputation and then on the right for group size the imputation leaves that essentially unchanged so i hope i convinced you that this is worthwhile despite it being incredibly awkward you learn things from the contrast between the complete case analysis and the imputation modeling but it's also duty the key idea here is you already know things or rather probability theory knows things about missing values and it knows those things because you have taught it a generative model and therefore it can deduce posterior distributions for the unobserved values yeah if you believe the model you teach it then you can believe those distributions but remember when we say we believe the model this is a very small world statement there's always model uncertainty that at least in my opinion cannot be easily put under the umbrella of probability theory there's a vast model space an unimagined model still and it's a different kind of creative really artistic exercise to come up with scientific models and interrogate them and the little deductive part of the small world where we build posterior distributions from specific generative assumptions that's indispensable but we need to bounce back and forth between this this highly deductive and objective process of Bayesian updating and specific models and the more subjective and imaginative part of science that is most of science which is theory construction and debate to build the small world part in in this lecture we had another example of thinking like a graph we had multiple equations we used them simultaneously so eventually that when you see enough of these examples it starts to sink in and it starts to seem natural to you and doing isolated regressions single equation regressions will seem very weird i hope this example also convinced you that imputation of of relationships among the predictors is a very good idea if they're in your model and you start doing imputation you probably want to start thinking like a graph immediately yeah so that you're modeling the covariation among the predictors that'll give you partial pooling in fact i didn't make that point during the lecture but it'll give you partial pooling among the imputations across the variables and final thing even if it doesn't change the result it doesn't mean you wasted your time because you did your duty and remember do the analyses that you would like your colleagues to do yeah you don't want your colleague to tell you that they didn't do something that they knew was the right thing to do because they thought it would be hard and it might not change the result that's never a professional excuse okay i hope that was useful and we're closing in on the end of this course next week i will start off by talking about something i call generalized linear madness by that i mean i will introduce scientific models which are not typical statistical models but actual scientific models based upon premises of the things we're modeling and i hope to see you there