 Let's do this. Welcome back everybody. We're in the last week. This is a lecture I'm very excited about because we get to take all of the amazing things you've learned in the previous nine weeks and put them in one model. Not one model but a few models. And we'll do some very useful things with all the Bayesian machinery. Let me start with a data story. I'm hesitant as always to bring up American politics just because it's, you know, well just because. But this is an abstract and really interesting example of a general phenomenon that doesn't only apply to American politics. There's a lot going on in this graph but let me help you understand what's going on. What you're seeing on the horizontal is the birth year of voters. It's the year they're born in. So it's like age, right, but set to some Roman calendar, right. And on the vertical we have the Republican vote share of individuals in that birth year. So you can't go from this graph to who won the election because you have to deal with it. There's no voter turnout and there's no age distribution of the population. It's at each specific birth year what for people of that birth year what was the share of their vote that they gave to the Republican presidential candidate in different years. And so as you may know every four years there's this international event. American president is elected, right. And there's a big betting pool that goes on all the time. And so each of the colors here is a different election year of 2000, 2004, 2008, and then 2012 in blue. And as you get closer to the present younger people are allowed to vote. Well, not to say younger, it's that people born in more recent years are allowed to vote, right. So that you'll see that these curves are shifting to the left as a consequence of that because people are becoming 18 and 18 is the legal age to vote in the United States. Nevertheless if you plot it, so if you plotted this graph by age you wouldn't see the phenomenon that you see in this view. In this view you see an interesting phenomenon is that the hills and valleys of these curves across elections line up even though in any particular year it can move up and down quite a lot. So 2008 was an election some people will remember here there was a certain candidate who was historical different and it was a big movement election in the US. And so you get it's much lower down there was much less Republican vote chair in all age groups that year as a consequence. Nevertheless the peaks and valleys of these are in the same place. What does this mean? It means that they're what we call cohort effects in the American electorate. Something special about the year you're born in sets the average political preferences for life it seems because these things aren't changing, right. So what's going on? Of course it's not the year you're born in it's something that happens to people born in the same year at some point in their life probably not their birth year because if you've been around a baby lately anybody right they're not a very aware of politics. Yeah so it's probably not getting political ideology in year of birth right they're not paying attention. It's something that happens later and this is scratched from a cool paper from 2014 by Geetze and Gelman where they fit a big hierarchical Bayesian model to this to figure out what age it is that is setting people's lifetime political orientation and the robust explanation seems to be what matters is when you get near 18 or turn 18 who is president what party is that person from and are they popular and from that you can predict an age cohorts political orientation at least in the United States for the rest of their lives. So this is weird effect right looking again on the on the left on the right you're seeing this what they've done is they fit a parameter for every age group to see the weight of the popularity of the president when you're that age and how that predicts your political behavior and you'll see there's this spike around 18 it's not perfectly 18 though because some people become politically where earlier some people become politically where later because they don't vote until they're 23 or something like that right it happens very often in US actually 18 year olds don't vote at very high rates so there's this general curve but 18 is a very special age because it's when you come legally able and then things that happen when you're older don't have nearly as much of an effect age is a predictor variable you've used it in models before but it its effect is not necessarily linear it's like a category to hear you're born in but it has similarity to joining categories if 18 is more similar to 19 and 17 than it is to 25 so it's an order category where every particular member of this category can have its own quantitative effect and if we want to take effects like this seriously because they exist then we need to have a statistical machinery to do this to deal with what I'm going to call continuous categories so in this case I love this example help you understand so if you look at 1950 on the left graph there's this trough why that's the Nixon effect right so Americans born around 1950 are less conservative than Americans my age which is the 1970 cohort right window-known writer right when you see me see window-known writer same generation nobody's like no that's fair but it's that generation right the reality bites generation anybody knows that movie and my generation is the most conservative American generation that's the Reagan effect when we became a political age Reagan was president and he was winning the Cold War remember there was a wall and all this other stuff right tear down that wall and all these other things and I don't mean to give him credit for all that it's just in the American imagination Reagan gets credit for everything and so if you turned 18 around that time right there's this whole narrative that plays out and so I'm sorry to say as by generation ages into political power the US is probably going to get more conservative as a consequence of this anyway it's very cool statistical effect how you uncover it so continuous categories like age or everywhere in data sets you can assert that they have linear effects but then you're immediately giving up on a bunch of these non-monotonic sorts of cohort effects that can arise any number of other interesting things in data sets so income can work like this there's there's no reason that every additional percent of income should have the same quantitative effect nevertheless if income values that are similar to one another probably have similar effects so it's an ordered category but it's continuous right just like age location how is location a category well it's a proxy for stuff you haven't measured plants or animals that are near one another are exposed to common confounds that we haven't measured and so location is a proxy for those things for common exposures that we haven't haven't seen phylogenetic distance very analogous to location phylogeny is a proxy for unmeasured things that make closely related species similar to one another social network distance lots of things you can you can then continuous categories all day long if you think about it for very long so there's no there are no obvious cut points but the idea is that similar values are well similar in their effects so if we're going to do this you want to have an infinite number of categories you probably want to use some pooling so that's what we're going to do we're going to make models today where there are an infinite number of categories is continuous and will fit arbitrarily shaped functions like that age curve you saw on the previous slide but we'll do so with pooling and it'll work great and this is a very common process in machine learning called Gaussian process regression which is not a very helpful term admit it doesn't tell you anything about what's going on except it the word Gaussian probably gives you a clue that there's going to be a normal distribution somewhere in here there will a very very big one in fact so let me spend the rest of today showing you two examples with code of how to do this and the first one is going to be dealing with spatial autocorrelation is an incredibly common phenomenon whether you're a biologist or social scientist this is a routine issue where shared space is a proxy for unmeasured confounds that you might want to deal with so let's go back to the oceanic societies example from earlier in the course and think about spatial autocorrelation in this data set so to remind you we were interested in predicting the number of tools on each island as a function of its population size because there's this cultural evolutionary model that leads us to believe that larger populations have more tools but that they're also diminishing returns on that one of the problems here is that islands that are near one another can get their tools from their neighbors that invent them oceanic societies were not isolated they're on islands but they're very good sailors and so to say that oceanic societies are isolated from another is just wrong we just know that's wrong maybe Hawaii okay Hawaii was by itself but I mean they brought a lot tools with them though so they're still in effect of the contact but certainly when you down in near Tonga and all the others there's tremendous historical contact and flow of even political systems around so distance between islands is a proxy for contact it may be better than that dummy coated high and low contact that I had before so can we use that can we think of it that way and we can we'll do this as a Gaussian process regression here's the idea you construct a distance matrix it's just a big matrix shown here in the bottom of this slide of all the pairwise distances here in thousands of kilometers between the different island systems right we could probably do better than this if we did some real GIS stuff and thought about sailing distance right so this is as as an airplane flies distance I got these distances from airline routes like how you would do it so you just great circles from island to island as the crow flies but I would say but no crow can fly this far right so it would collapse in the ocean at some point so so if we knew something about actual sailing routes and trade winds we could probably do better than this but this is this will suffice to give you an idea how the method works so you'll notice along the diagonal it's all zeros and that's because every island is exactly on top of itself yeah so that's what the diagonal is and then off diagonals this is a symmetric matrix so the upper triangle so same as the lower triangle you have distances in thousands of kilometers between the two so this is like our map of confound threats what could make the tool count similar correlated between islands that are closer together so it'll be more movement so let's build this matrix into the model so just to remind you this is the model we're going to use this is what I called the scientific tools model right we're going to use this one because it's way better than the other one right to remind you how this works we've got the expected number of tools lambda that's going to be linked to in the actual outcome TSA by the measured historical number of tools on that in that society and there was this dynamic model which gave us an equilibrium prediction shown here is the definition of lambda or alpha is some proportionality constant that is the innovations of tools you get from each person in the population that's what piece of eye is the population size beta is this elasticity or your diminishing returns rate on population and gamma is the loss rate of tools because tools fall out of use they're forgotten and the more tools you have the more you lose so we can get the Gaussian process into this model by adding a factor on the front of it so I'll walk you through this I've added this exponent e to the case society case of society is like a varying intercept right it's a parameter that's going to be estimated for each society specifically it's going to have a normal prior the normal distribution when we exponentiated it's positive that's why it's exponentiated it's got to be positive because lambda's got to be positive so this becomes a factor way to think about how this works is when k is zero e to the zero is one and this means exactly as expected by the model right so if an island gets k equals zero that means it's just the the model prediction it the it isn't inflated or deflated at all if it's k is negative you exponentiate this you get a number less than one so for example e to the minus point five is about point six and that means 60 percent of the expected number of tools so you see this factor on the front is is decreasing or increasing or leaving the same the expectation from the other parameters and we're going to fit this using this matrix of all the islands this is going to be that is going to hold our correlated effects so some of the islands are going to get inflated because they're near islands with lots of tools and others are going to get deflated and those inflations and deflations will be stored in k this makes some sense if you had a linear model you would do the exponentiation you just stick the k on there yeah okay so now the fun part the Gaussian process part where do we get these k's from and the answer is from the big matrix a 10 by 10 normal distribution this is the Gaussian part of Gaussian process you have this big prior that has all of these varying effects varying factors in it so there's a big vector of all the k's there are 10 of them here because they're 10 island societies in the data set or it's a small data set good to teach with yeah and these all come from the same multivariate normal prior just like the things we worked with last week for varying slopes now we have a 10 by 10 covariance matrix k yeah with me and how do you build this thing that sounds bad right 10 by 10 matrix won't that be a lot of parameters no it will have a very small number parameters and we're going to generate the whole matrix from that distance matrix that we had before we're going to use it to parameterize how the correlation falls off of distance and there are lots of ways you can do this in particular I'm going to show you the most common one which is this this so-called L2 norm equation shown at the bottom of the slide where each cell in the matrix K so Kij which is a cell on for the where the ice island in the jth island meet right what's the covariance between those two islands that's Kij is given by this expression which I'll walk you through on the next slide but what I want you to see about this is it only has three parameters in it and I'll explain them to you on the next slide it so you could have a 300 by 300 matrix which we can have by the end of the day actually I'll show you one and it will still only have three parameters in it but you get the whole covariance matrix from it and it's generated from this distance matrix so this is what this L2 norm covariance matrix means so on the left we've got the covariance between islands I and J say that's tome on Hawaii this eta squared thing on the front is the maximum covariance between any two islands we're going to fit that from the data and then this is multiplied by this weird looking thing in the middle this is e to the minus row squared d i j squared so take this one piece at a time row squared is a rate of decline with distance we're going to fit that from the data to and d squared is the square distance between islands i and j comes from that matrix that we feed in his data so it makes sense this is the Gaussian part of this actually it's not the it's not the normal distribution this is the Gaussian part because this is a belcher you may know that you get a Gaussian distribution by doing e to the minus something squared some squared factor that's what gives you the belcher shape in the Gaussian and so this gives you this Gaussian fall off with distance the last bit on this is often called jigger in the literature this delta i j sigma squared delta i j is a very special function it's like an identity function if if i and j are equal delta i j is one if i and j are different delta i j is zero and all it does is it turns sigma squared on and off and what a sigma squared it's the additional variance of an island with itself so you've got multiple observations of a single island you need this factor there so that they've got all predicted to be the same right it makes the matrix positive definite we're not going to need that here because we only have one observation per island so let me show you the working bit of this function in the middle and the shape it generates this generates this Gaussian decline or the squared distance so it's what I show you in this plot the solid curve so the horizontal axis on this plot is distance between two islands in thousands of kilometers yeah and the vertical is the correlation I've standardized this the correlation and I've just made up some values for row to show you the example and if you use the square distance function like the one on this slide you get that solid curve where it starts out initially with a very slow decline so islands that you're really close to one another can have a high correlation doesn't have to be one you can set that wherever you want but the shape will still be the same and then it accelerates like a Gaussian curve does as you move away from the mean this is like half of a Gaussian distribution do you see that yeah the familiar bell curve once you end up within is this accelerating rate means that you you get to the flat part of the tail really fast and you can get a very sharp decline but with distance if you choose a linear distance instead so say you take the function at the top and you take the square off the D so it's not D squared anymore but just D just a distance the absolute value distance between two islands then you get the dashed linear curve and there's nothing wrong with that you could use that model but it has a different set of assumptions in it right it assumes that the the rate of loss of covariance is fastest at the start right well it's the same everywhere actually but effectively the amount you lose is largest at the beginning which I asserted usually not the case in things it's usually not how it goes usually get out of some radius of common exposure and then it declines but this is a scientific question and you're free to do all kinds of things here as the science requires the L2 norm is the most common okay let's put it all together into one terrifyingly large model I put pictures of beautiful island paradises on all these slides to soothe you during this lesson I want to really all we're doing to make a Gaussian process it's all in the prior these are varying effects models where the varying effects are drawn from one giant Gaussian distribution which has some distance matrix inside of it that's all they are the rest of the structure of the model is everything you've seen before they're just GLMs but it's the prior that's doing all the work and creates the continuous category and so this can be here it's distance but we can do ages to you could have all the pairwise differences in ages to eat all the individuals in a population and then all of their particular effects for every individual would be a random effect to be drawn from this giant Gaussian distribution with a big covariance matrix yeah with three parameters inside of it yeah that's like the Gellman paper that's how you do these things so here's the model very quickly starting from the top the Poisson then lambda with the case inside of it as a multiplicative factor then comes all of our vector of cases is multivariate normal the action is in defining the covariance matrix capital K which is the next line and then you've got some priors down at the bottom we have to decide to define priors for 8 squared and row squared the first thing you're going to ask is why are these things squared there have to be this is the convention and I want you to learn the conventions so you know I think in some or paper they were squared so they'd be positive and you could put other things on them we don't need to fight with those issues here but I want to stick with the conventions the important thing is you can just define a square and row squared on the squared scale in your model and that's perfectly fine and that's what we're going to do but we want to simulate from these priors and show you what this prior implies same business what does it mean to assign exponential 2 and exponential point 5 to these things what kind of covariance functions does that imply how do you figure that out you simulate so of course in the text I give you the code to generate this plot it works exactly as you might imagine you just sample some random exponentials for 8 a squared and row squared using the means that I had here on the slide right random exponential with a mean of 2 random exponential that's not a mean of 2 with a lambda 2 right so remember the mean of an exponential is 1 over its rate so if the if lambda is 2 the mean of that distribution is a half and the mean of the other one is 2 yeah it's an inverse the rate is the inverse of the mean in an exponential so we sample random vectors from both of these and then we plug it into that k i j function right and we get a covariance that declines with distance and we can then plot that out we vary the distance and we draw a curve and that's what you see here and I think this is like 50 samples from the prior distribution distance of thousands kilometers on the horizontal axis and then some random covariances from the prior and you can see that in this prior most of them fall off very rapidly so the prior assumes that distance effects are are well they could be anything very close from very small to reach them to is actually not so strong on the scale but they could be moderately strong to absolutely incredibly weak down near the bottom but all of them drop off pretty fast so this prior says and there's probably not much contamination but this is a very weak prior you'll see the posterior is going to look really different from this how do we code this model well as you might expect the only thing that's really going to change is that you've got to stick this Gaussian process thing in there and to make computing that covariance matrix easier there's this helper function in who long that you'll see in the middle of the slide here I'll highlight it in a second the first thing is of course we've got these K's for each society and those come from of there's a vector of length 10 of them right there's one for each island so this is just like a varying intercept vector just like the sort of things you've seen before these come they come from this multi-Norval distribution with a covariance matrix Sigma and Sigma is a 10 by 10 matrix and then there's this helper function covariance gpl2 what does that mean well we're computing a covariance matrix gpl2 is Gaussian process l2 norm that means it uses the Gaussian kernel as it's called all this does is do some loops to do that formula that was on the previous slides right that kij formula this is all it does it's just a function that does a couple loops to calculate that you could write that yourself actually inside you long if you wanted before loops but that would be maddening right so this is just a helper function to save me some time but if you look at the standcode for this model you'll see there's just a loop that is calculating every trajectory right every leap frog step it calculates this matrix again and takes its derivatives and that's how it does the Hamiltonian trajectories in this thing it's a lot of derivatives I know but Stan can do it so the action inside this thing what do you have to give it you have to give it a distance matrix which here I've called D mat that's the matrix I showed you before that I got from an online airline database of the distances between different oceanic islands and in thousands of kilometers and then a to squared and row squared just parameters and then the point zero one on the end is that sigma squared jigger term which does nothing in this model because we don't have repeat observations on the same things but it can't be zero it's got to be a little a little bit above zero but we don't fit it from the data because we don't have any data here to estimate it from the data and then the priors on the bottom that we had on the previous slides good you're in building it yeah okay what happens cool stuff happens this sample is no problem even right a 10 by 10 matrix you know you're not afraid of those anymore you've done way bigger matrices yeah definitely by the end of today we will and when you look at the pricey output for this this is wholly uninterpretable it is this just a bunch of this is the tide prediction engine all over again there's no way you can understand very much from this you could read this the effective population size still the the AB and G's down here in some crude way but remember the scientific model none of those is like a slope so you still need to plot predictions so we want to plot things instead same thing as always how do you understand a model you push the posterior back through it and you get predictions or retradictions as it were of the sample but you can understand is that these k values they're like on the log scale they're expenetrated inside the predictor part of the model and so if you think about any island that's got a zero like island number two here sorry I forgot what island number two is island number two that is exactly as the model expected if it's a negative then that was being dragged down by some neighbor and if it's higher bigger than zero it's being drug up by some neighbor away from the expectation of the model otherwise so it makes sense and then a to square and risk where at the bottom are completely uninterpretable for reasons that I will spend some time on now okay but we want to plot those to get some idea of the posterior covariance that's happening here so again at the top to remind you this is what this gpl2 function is it's this this Gaussian kernel covariance function and on the left I'm repeating the prior for you to show you the prior on the right 50 samples from the posterior distribution to show you the posterior ones relatively small on the raw covariance scale but declines more slowly than the prior did on average so there are some long-distance effects of moderate covariance dragging the tool counts around you can't so I show you here the two lines from the pracey output on the slide to for a to squared and row squared there's no way you can look at that 0.18 and 1.39 and figure out these shapes for a very good reason because these two parameters are really strongly correlated in the posterior because they multiply in a really complicated way inside this function at the top of the slide right so they're not totally independent in their action you can make one bigger and the other smaller at the same time and get the same covariance function almost and so for example I show you now on the left of this slide I'm just plotting the posterior distribution of a to squared on the horizontal against the posterior distribution of row squared on the vertical these are just samples from the Markov chain and you'll see this is this is what our non-independence looks like for zero bounded parameters right so what's happening as you go to the right on this row squared gets smaller right if you make a to squared big row squared gets smaller and if you make row squared smaller a to squared gets bigger on average that's a negative correlation so this is the fact that you can get very similar covariance matrices by making one of these big and the other small so you can't interpret them independently in one another in the precy output this is useless all right you've got to push them back through now I know I've been telling you this time about ordinary regression coefficients for weeks now and so I hope I convinced you of this but reading trying to read how model behaves by looking at each parameter by itself it's folly it just does not work okay we want to understand this model now we don't really care about those covariance functions in and of themselves although we can learn from this right that that gives us an expectation of how contact networks how far they extend for certain pairs of islands but we can think about this on the outcome scale by computing the implied correlations among the islands and then putting them on a map so let's do that now so we take the posterior distribution of the covariance matrices those covariance functions that were on the previous slide and then we can compute for each pair of islands the posterior mean correlation right between them from that covariance function how because we know their distance on the horizontal axis I go back to this slide right their distance on the horizontal axis on the right-hand part of this slide for any pair of islands and then at the posterior mean we can compute their expected covariance yeah and then we can standardize that to a correlation and that's what this big matrix is the code to do this is in the book it's no more complicated than what I just described you just compute that k i j function for each pair of islands at the posterior mean and you get this big thing and now this is a correlation matrix so the diagonal is all ones because every island is perfectly correlated with itself right and then the off diagonals in the symmetric matrix are the expected correlations and tool counts due to the location effects so if you look at the far right column or the bottom row that's Hawaii or excuse me Hawaii or something like that sorry no that was horrible but I'm trying there's a glottal stop in that word and so if you're in Hawaii you know you've got zero correlation with everybody except yourself because Hawaii is really I don't sound stupid when I say it right Hawaii is really far away from everything else it took the Polynesians forever to find it because it was off of all the trade wins and everything right they speed populated the rest of Oceania and then it took forever to find Hawaii you're welcome so now what does this look like on a map in this data set I give you the latitude and longitude of these different islands and so you can just use those to make a scatterplot of each island that's why I'm showing you here longitude on the horizontal of course latitude on the vertical and so this is like the world's least high-tech map of the Pacific Ocean that you're looking at right and the size of each point is the population size of that island right so Hawaii is the biggest point because Hawaii had an order of magnitude more population at European contact than all the other islands societies in Oceania and then the line segments connecting these are shaded with intensity proportional to that correlation in the matrix we just calculated so you're seeing that yeah islands near one another has stronger correlations there's this triad down there right Santa Cruz Decopia and Malacula which have very strong correlations they're getting there's this gravity effect there are lots of trade historical contact there because they're close and their tool counts are more similar they deviate from the expected just from their population sizes as a consequence of that yeah Tolga has an effect on Fiji similarly Tolga was one of the was the next biggest after Hawaii had an empire an imperial ambition right the king of Tonga and and other cases as well the trobrians had a lot of trade as well you're seeing it falls off with distance we can also plot this on the so the population or log population is on the horizontal so this lets us take a look at what's going on with the hypothesis of interest we're interested in log population is the original reason we cared about this data set again island societies points are proportional to their population sizes the trend plotted here is the average prediction just from population size from the posterior so strong relationship again you're not surprised by this anymore with population size and then the line segments showing the same thing as before they're the same line segments from the same correlation matrix you can see them things getting drug around as a consequence of this right so you think about like Tunga there Tunga had lots of tools it the model thinks Fiji was drug upwards from its population expectation it's a consequence of this okay questions yeah right this is an easy one is just gonna be a complicated question okay very little no there's been in this case there's very little the average effect that elasticity doesn't change very much as a consequence you can go ahead and take a comparison but you do get these you estimate the covariance effect so one way to think about this is spatial auto correlation is a threat to this hypothesis that population size matters and this model shows that that survives that threat but there is spatial auto correlation in the data set still but it doesn't doesn't remove the population effect yeah it's a potential backdoor you might think of it that way okay all right let me show you another example and before I get into that I want to give you this slide where I just list a bunch of stuff that you can do if you Google Gaussian process regression you will find huge numbers of results this is a big deal machine learning approach it's used lots of areas and sometimes it's called Bayesian non-parametric regression which is the world's worst statistics phrase right because obviously Bayesian models have parameters so what is the what is non-parametric mean it means there's an infinite number of functions that are being considered and that's literally true you're used to considering an infinite number of regression lines right we these models consider an infinite number of splines that pass through the continuous categories and then they select among them with regularization and so these are very popular in big machine learning projects and you'll find tons of tutorials online about these things and lots of really diverse applications and so there are lots of periodic data sets like seasonality data sets in ecology or social science where the covariance function actually has cosines inside of it to measure the periodicity of it right so every winter things happen human births there's this cool analysis in young at all Bayesian data analysis book of human births and their periodicity and yes human births are highly period periodic using Gaussian process regression with periodic functions so that's one kind of application that goes on here phylogenetic distances which is the example I'm going to turn to next are like this as well all phylogenetic regressions are special cases of Gaussian processes with particular strategies for building the covariance matrix social networks are kind of distance as well and all kinds of splines like that age example I showed you so there there's a person in my department Cody Ross who does child growth data sets this way using Gaussian process curves and then you don't have to make any assumptions about the exact functional shape right you can do this with regularization there's this other technique you don't have to use only one distance matrix you can use a bunch at the same time and give them an estimate weights for how important they are I won't walk you through this thing at the bottom just say that if you're interested in this is called automatic relevance determination another terrible machine learning term it's just it's just that you're estimating the importance of each kind of distance on the covariance between the items when you do it so you can use more than one distance matrix right so you could have like a genetic similarity matrix in a spatial similarity matrix and then estimate things for example okay let's turn to phylogenetic regression so in evolution or biology it's a big deal to worry about when you do species comparisons to worry about how long ago this pairs of species diverge from one another I think if you're outside of biology this is a surprise sometimes it makes sense after you've heard about it because in the social sciences in my experience people rarely worry about this this is kind of like you just compare countries and don't worry about the fact that they're next to one another or they share recent cultural influences you see this all the time in journals and but anthropologists and biologists and worried about these problems for a long time of what you might say shared history shared common history why do we worry about that because this introduces a bunch of potential backdoors in inference it's like a proxy for shared exposures and so some of those shared exposures are just common genes that you share because of your common history but there are lots of other things like the kinds of ecologies you live in and so on that also create common exposures and you may not have measured all those things but phylogenetic history time since you diverged is a proxy for a bunch of those inferential threats so you'll see the biology journals are just full of phylogenetic regressions this is now a standard part of the comparative method and we're going to use this I'm going to show you the most basic form of this and then show you how flexible this approach can be as well there are lots of ways in the literature that people smuggle as it were phylogenetic information into some kind of generalized linear model the most basic one is the Brownian motion model and something that is called phylogenetic generalized least squares we're not going to use generalized least squares we're going to use a posterior distribution but the model form will be very similar to this the Brownian motion model is is the simplest but also the world's goofiest evolutionary model there is no trait that evolves this way I'll explain what this means as we go but it's a it's a place to start right and there was a time when it was all you could do in the computer there are these other processes huge numbers of processes like these Ornstein-Gulenbeck processes which are like Brandian motion but they're variants constrained the variants can't keep it flating forever right the differences there's some attractor that pulls back the traits and there's a huge number of other hypotheses if you've got some process for how evolution works on a branch then it implies a particular kind of covariant structure among the species and that's what the idea is so any particular phylogeny plus a model of how traits evolve on it implies a covariant structure and that's how all of these regressions work so let's use this primate data set that I have showed you before in the class so this is right primates 301 the 301 is the number of species in the data set I think there are more primates in this but these are the ones we have data on okay so they're probably twice as many lemurs they're actually on this tree right it's just like Madagascar's lousy with lemurs and huge numbers of lemurs but so primates 301 and then you can load with primates 301 underscore next you get this tree it's a nexus tree format some of you work with phylogenies you know this and you can load it in and plot it this is how I show you we're going to use the eighth library to manipulate this nexus tree it'll do things like draw this totally inscrutable phylogeny right with 301 primates on the tips to give you some idea what's going on here there are primatologists in the room the apes are in the upper left there there's that little cluster that is the apes the apes are kind of the losers in the primates right they've been going extinct at rapid rates for a long time since the myosin right and then you get the Old World monkeys just to the right of them these are the big winners recent massive diversification especially the circupithecines the macaques and the baboons and so on Gallagos and lorises on the right hand side as you go around the clock small and and very deeply diverse group and then the lemurs lots of them lots of least recent radiations all them on Madagascar then this tiny little tarsier thing on the bottom as we round the half hour mark right there's a few tarsier species they're very very different from one another and then the new so-called new world monkeys I put these things in quotes because this is the European colonialist view right about a new world no-world is sorry this is the standard term still in anthropology until we figure out something else to say the Americas and the and the non-Americans right so the new world monkeys live in Central and South America and there are a lot of them as well and so I know I was asking like where are we we're in here right you see Homo sapiens right next to Homo sapiens near Tallinn says right right next to the best species named gorilla gorilla gorilla just the world's best species there so I'm an anthropologist so I think I could draw this part of the tree but from memory but we're gonna use the whole tree we're gonna be interested in some comparative questions between body size brain size and group size so the anthropologist in the audience here will know but I hopefully everybody else will be interested in this hypothesis as well there's this popular hypothesis in evolutionary anthropology that one of the things a large brain does for you is it makes it possible for you to live in big groups makes you clever this is one of the things intelligence is for and so you'll find lots of papers where people analyze so-called social intelligence hypothesis in various forms it's kind of a squishy field of hypotheses and we're interested in the association between brain size and group size and the first thing you got to think of is that there's this back door if we if we have the idea that a big brain influences group size because it makes it possible for you to well tolerate and the conspecifics right there's this back door through body size because brain size and body size are correlated right because body size causes brain size that's an uncontroversial statement I think if you would flake a gross factor it makes everything different and yeah and brain body size might also have a direct causal influence on group size how well there are lots of mechanisms one of them would be it changes the kind of ecology you're in larger primates live in different places in particular their terrestrial and that changes a lot of things about your life which might also influence group size right like a lot of the circuits seem to terrestrial they live in really big groups so what happens here now is we got to trim the tree there isn't data for some of these variables for all the primates right and a particular brain size I think is missing for lots of the smaller primates because we're not interested in their brains for some unfortunate reason but we should measure their brain sizes yeah and so if you trim this down I think you end up with 151 species that have all three of these variables measured brain size body size and group size and I'm going to use these variables to show you the world's simplest phylogenetic regression so that you can understand what it is and I hope this will seem deflationary as well this is a cool technique but there's nothing magical about it and there's certainly no evolution in this model right it's purely treating phylogeny is some like squishy common exposure measure is definitely not super science going on here it's useful but it's really geocentric just like a lot of things we've done before this is not an evolutionary model it's a geocentric evolutionary model I'm not trying to say don't do it I'm just trying to say you don't have to do things this way so to get here first I want to show you an ordinary linear regression but I want to weird it express it in the world's least convenient form because then once we've done that all we have to do to make it a phylogenetic regression is change one line so let me show you an ordinary linear regression in the strangest way possible the way to think about this is that all of the outcomes for all the species are one outcome they're just a big vector of group sizes this is going to be our outcome this is going to be log group size actually this is what we're going to predict we're going to do log group size as a function of log body size and and log brain size and don't think of every species as being some independent thing think of those being one jump you sample all the tips of the trees simultaneously and they come from one distribution one giant multivariate normal distribution every linear regression can be expressed this way you just take all of the outcomes and put them in one big vector that's what bold G is here it's just all the group sizes and you give you you make the likelihood of multivariate normal and then there's some vector of means mu which is the same as before each species has its predictor equation with its predictor values and then there's this covariance matrix s and in a standard linear regression this covariance matrix s has sigma squared along the diagonal and zero everywhere else and the way we're going to construct this as I show the slide as you say s is equal to sigma squared times i where i is this thing called an identity matrix which I show you in the lower right of this slide so if you multiply the scalar sigma squared into this you get sigma squared along the diagonal which means the residual variance expected for every species of sigma squared everything else is zero there's no correlation you're back to a linear regression yeah and I know this is like why would you ever write it this way and the answer is because now we're going to get to a file genetic regression by changing one thing and that is s we replace s with something that has information about the phylogeny inside of it you get correlations on the off diagonal and then you have a file genetic regression okay but this is a linear regression this is like every linear regression you ever done you would never do it this way right but but here you go this is how it works okay so here's how you do this in the code it said it's just one line here there's a matrix which has number of rows equal to the number of species which is 151 in this case and number of columns the same so this is a 151 by 151 covariance matrix no problem we can do this and the diagonal is sigma squared and everything else is zero I'm at is something I pass in as data it's the identity matrix and I passed it in his data in this case this is just a linear regression and we run it what we find is there's a very strong association between brain size that's what be capital B is the slope for brain size and group size and a slightly negative relationship for body mass which is pretty hard to interpret and you're used to this in linear regression it's right because it's each is conditional on the other so after you've taken out the brain size effect larger species are actually in smaller groups than expected but the total causal effect of body mass is definitely to be in bigger groups because there's this back door through brain size right so this isn't saying this is only the direct effect in this model the direct effective body mass is negative but not the total causal effect bigger bigger species live in bigger groups in the tree for sure you can just get that from the averaging but if you take if you take brain size out of this model you'll see that the coefficient for body mass is positive this is the direct effect in this model of body size is negative but not the total effect yeah okay the inferential threat of course is this haunting phylogeny on the bottom then we have this you which stands for all the stuff we're scared of right all the unknowns but phylogeny gives us a tool an instrument if you will to measure these things in a sense even though we can't measure them directly phylogeny is like an exposure that means that we expect increased covariance among more recently separated species and so but then you've got to go from the phylogeny to that covariance structure and then there's lots of choices and they all imply different evolutionary models the simplest one is the brownie emotion model is the one implied by felsenstein and graphon's original tools for this independent contrasts and the phylogeny of progression both assume this brownie emotion model what does this mean there's no selection and the traits just wander randomly it's a Gaussian random walk that's what brownie emotion is it was named after a biologist brown right we looked through microscopes and saw molecules jiggering around so this is why it's called brownie emotion it's named after a person and but it means Gaussian normal distribution velocity wandering and when this happens then the covariance between any two species declines in a linear function of the time since that they diverged I'll say that again under brownie emotion the expected covariance between two species declines in a linear fashion in as since the time they diverged and so that's what I'm trying to show you on this graph we compute the implied covariance matrix and then plot that against the phylogenetic distance between pairs of species in this data set this is what you get species that are a phylogenetic distance near zero whether you how do you get phylogenetic distance you trace the branch lengths what's the total branch length between any two species that's phylogenetic distance and then you plot it out species that are very close to one another which in primates that's a lot because there was you know this there's been a recent diversification of primates actually because of the monkeys so a lot of primate species are relatively closely related compared to the rest of the mammals and that's all the points of the left top of this of this plot and then declining down if you go out to the end and then you know everything compared to Tarsier's right it's sort of on the right I think that's what all those points are actually is everything compared to Tarsier's sorry only anthropologists here know what a Tarsier is it's okay everybody else Google it they're cool so as I say no one no one in this biology really likes the Brownian motion model but it's easy to use and it's a reasonable place to start in the literature started there because this was the thing you could do with primitive computers I was rereading the original papers this weekend to repair this lecture make sure I didn't say anything too wrong and in Alan Graffin's original phylogenetic regression paper there's this note at the end about how to get the custom program he wrote to do all this you get the mail him a floppy disk and he would send you a copy and it ran on something that I didn't even recognize anymore or some program that doesn't exist anymore now of course you just run this and are like everything else right but this is how the things got started it was like a hobby at Oxford for people to talk about these things so you can get the covariance matrix basically by inverting the phylogenetic distance matrix and that's really all you have to do to get it and then once you've got that we're gonna make it into a correlation matrix so that we can fit the residual variance and then we just pass it in we replace the identity matrix with this correlation matrix which comes from the phylogenetic distances and that's the Brownian motion model that's it right we run this model so the bold is the only thing that changes there's this our matrix is passed in now so 151 by 151 correlation matrix comes from the tree and you run this thing and things change a lot brain size is nothing now it's almost exactly on zero and body size is now positive because all that back door before that was coming through brain has been knocked out and there and larger species do live in bigger groups and primates for sure so this phylogenetic control says that no brain size and the correlation between brain size and group size in the data set is because closely rated species have similar brains and similar body sizes and it's just could easily just be a confound to phylogenetic distance to give you an idea what's going on you can plot this trim tree and just swap the tip labels out for group sizes which is what I've done here and I know you're not going to spend time looking at each of this but I want you to see there's a lot of phylogenetic signal on group size in this data sets and it starts all strongly associated with body size at the bottom we've got a bunch of prosimians right a bunch of lemurs and they they're solidaries as well as ones are right in there they're closely related you get these effects in many parts of the tree so I'm now pointing with my hand and I don't have a pointer here but the gibbons you see the gibbons right everybody sees the gibbons in the tree sort of in the upper left like half of all the apes are givens right so there are lots of gibbons and seamongs and stuff up there they're sort of in the you know it's like 50 minutes to the hour the gibbons and they live in small groups and there are a bunch of them and they're closely related so there's lots of signal on group size here and it's also associated with body size and that's what this Brownian motion model is picking up circuit pathosines as well they live in bigger groups they're closely related let's do one more thing before I let you go let's let's do the Gaussian process version of this so the Brownian motion model is weird in the sense that it says similarity declines linearly with time-sense divergence there the only way to get that is that there's no selection anything else you're going to get some acceleration of the decline with distance closely related species can be very similar but then after some amount of hydrogenetic distance you don't expect any of those confounds to be shared anymore and so you can fit a bunch of functions that way and there are lots of different things in the literature to do that let's just do the Gaussian process because it'll consider a huge number of functions that have just an infinite number in fact and we can just estimate in this data set what is the covariance function that declines with biogenetic distance and since I've already showed you the the world's least convenient linear regression formula again we just have to change one line we just take that line that computes the covariance matrix the 151 by 151 sigma and we just make the right-hand side back into this covariance DPL 2 that you had before and the distance matrix is the phylogenetic distance matrix that comes from the tree now it's not from flying or the Pacific Ocean it's from the inference about the time since divergence of these different species and then we have a to squared and rose greater we're going to estimate them from the data and see what the covariance looks like as you fall off right the covariance in group sizes with with genetic distance same model as before you run this I show you the the pricy output on the bottom brain size is still basically nothing it's more uncertain now and group sizes even stronger than before there might be if there's any relationship with brain sizes negative this is not looking good for this hypothesis I'd say I'm trying to make take a stand on this hypothesis there's lots of other compounds it's the mess with this literature actually is that there are a million confounds nobody knows which direction the arrows go right because species don't aren't described by DAGs this is the problem there's all this reciprocal causation an organism like a machine right and and there's all this stuff that's jointly engineered so it's hard to interpret these things but but it's important to have the example and understand the impact that phylogeny is as a product a measure of confounds can have we plot the covariance matrix here and this is definitely faster than linear so one here is the maximum phylogenetic distance in the data set it's just standardized this is definitely not linear decline right so the evidence is that yeah for closely related species there's a lot of correlation in group sizes in this data but it declines very fast so this is a very different result and I'm not arguing this is the right thing to do but I think it makes a big difference exactly what you assume about how you get covariance from the distance matrix there's no single right way to use a phylogeny and I want to emphasize that because I think sometimes when you're initially taught phylogeny in a progression it's taught like there's one right way to use a phylogeny to control for confounds when you do species comparisons and that's not true it's just like all this other varying effect stuff it's just a varying effects model and you've got some hypothesis about how a shared descent creates confounds is correlated with confounds but we don't know exactly how that works and it's a it's a domain for lively debate you could actually just use a big varying effects model where you use like genus and family and info order and everything else is random effect categories and you get basically the same result I think it's kind of like a dark thing to say but there's nothing special about this sort of stuff it's just a way to get an index on confounds so let me try to summarize lots of different ways you can use these covariance you can build covariance functions from polygenetic information one of the interesting things is variable rates on branches right so some parts of the primate order have evolved faster than others recently and so the covariance will decline faster in those right so they just think about the apes right they're bunch of apes in this room you're really really different from the other apes in fact to the extent that lots of people just drop humans from the data sets because otherwise it's too much leverage I was weird in so many ways humans but there are other species that are weird these ways to and some lineages are very conservative in lots of ways right so the thing about the different subspecies well the different species of gorilla are more diverged longer from one another than bonobos and chimpanzees did they're really really different from one another in evolutionary time right so some parts of the same lineage have evolved much more slowly than others and these ways that we've calculated the covariance makes you assumes that it everything moves at the same rate across the whole tree we know that's not true so there's lots of work in literature to deal with those sorts of issues if you're interested a big threat that I'm I'm fascinated by is this horrible Greek word called hemoplaesy if you've ever studied clodistics you know it's just full of uninterpretable Greek words right just everywhere what is hemoplaesy this is incomplete lineage sorting you can have a single tree which describes kind of the gross demographic history of a bunch of species but then no trait fits the tree not a single trait across all the tips the branches will actually fit that tree without conflict and the reason is because species don't split in an instant like an atom like a like a plutonium atom right there's this long period of hybridization and they may not bifurcate you can get this whole region or multiple species like the baboons are just kind of mating around for you know thousands of years and genes are flowing through that and as a consequence within a single species at some loci you can be more closely related to one species and another locus to another so this is true in humans the anthropologists I'll know this for a slight majority of your loci you're more closely related to chimpanzee something like 60% and for the rest you're more closely related to gorillas for some very small fraction you're more closely related to orangutans right like some of the blood group molecules and that's perfectly understandable because the way evolution works you get incomplete lineage sorting in real speciation processes the way we built these models assumes that doesn't happen and so there's this is a big thing in the phylogenetic community now is to try and deal with this because it means you can't reconstruct internal nodes correctly from a consensus tree you've got to think about there being multiple trees and the character conflict is informative sorry this has been a little sermon but I get really excited about these things because there's cool problems to solve right and all of you I know will go forth and solve these problems right now retire yeah and there's been an equilibria as well there's real biology here and what we really want is a model that predicts if you engineered an organism how do all these things trade off and mutually constrained one another and people work on those things real-life history models and and that would produce group size not as a trait but as an emergent outcome of all of those fundamental strategies okay yeah I have this line on here about key values being weird you know I always think they're weird but in this case they're super weird because there's no null model there are a bunch of different phylogenetic ways to construct covariance matrices and none of them is a unique null so think about this right someone said you do again null hypothesis test that phylogeny explains the data it's like hang on breaks right what do you mean the null model there's no no model here there's no unique no this is not like two treatments in a psychology experiment there's a clear no there's no clear null in evolutionary biology okay I will stop there before I get any more exercise about positive progression and when you come back on Friday we will finish the course thank you