 Welcome to lecture 16 of statistical rethinking 2022. In this lecture, we'll extend the covariance modeling tactics from the previous lecture to deep and wide problems in space and time. To motivate it, let's start with something a little bit more familiar. The English language is a mess. Its spelling is largely unpredictable. All of the words on the screen here contain the letters O-U-G-H. However, at least in my dialect, every one of them is pronounced differently from the upper left. Though tough, trough, through, thought, thorough, bow, hawk, hiccup, and loch. What happened to this language? Is it just some cruel joke perpetuated by the British Empire on the rest of the world? Probably not. Only probably. Of course, the answer to how it got this way is history. Lots of accumulated small events in the lives of people and how they speak. Waves of invaders in the British Isles, bringing with them their languages and their place names and their systems of spelling. And this has made the British Isles linguistically interesting to say it nicely. Place names being replaced by different waves of invaders here, this friendly Viking. And then nice Roman names being slowly corrupted by English pronunciation, such that Wooster now or Warchester, as you might read it as the spelling in the right column, gradually shrinks until it will be unrecognizable. All languages work this way, of course, and English is just a particularly prominent and exaggerated example. The processes that create these sorts of patterns in phenomena, particularly cultural phenomena, are the accumulation of many small processes. And we're going to look at a couple of examples of things like that today. So in the previous lecture we were in Nicaragua. Now we're going to travel to Oceania. We're going to look at a different sort of social network on a much, much bigger scale, and that is the network among the Oceanic Islands. Oceanic societies, rather, that we examined in a previous lecture when I introduced Fossal regression. Over time, of course, these societies, these island societies have had interactions. They'd never been isolated completely from one another. And so when we look at data, whether it's archaeological or contemporary on the languages, material cultures of these places, those data are, of course, partly explained by the historical associations among these places, their trade and their commerce and their demographic historical connections. How do we account for those influences, those very important causal influences when we analyze the technology, the tool counts that we examined in the previous lecture? This is difficult, of course, because all of the individual causes and events are basically decimated and unavailable to us. We don't have the history of all the voyages and contacts. They did not keep records, diplomatic journals of such things. But this is not mean we can't make progress because we do know things about the islands and their potential network of associations, despite that. So to remind you, the data of interest here are the client data set. And I have another data frame called Client 2, which contains a couple of new variables we're going to need for this analysis. To remind you, the S demand here is the influence of population size on the evolution of technology. And the idea being that larger population sizes produce higher innovation rates and help populations sustain larger collections of functional tools. The confound here that I failed to model in the previous lecture, because we didn't yet have the tools to do it, is spatial co-variation. Islands that are close together share unobserved confounds like raw materials and resources that require particular kinds of technology. And they also share innovations through social connections through their social network, as it were. But now this is a macro scale social network, not an individual household sort of scale. And these influences have taken place over hundreds of years. How can we model this? First, let me revisit the conceptual causal graph for this example and just add the unobserved confounds in here and show you what we're worried about. We're concerned that spatially patterned unobserved confounds are influencing both the technology directly and also potentially the populations as well through different resource bases and geographical features. And so if we try to estimate the influence of population size on tools, this is a threat that biases our inference. We had used a model before to remind you that was based upon a very simple innovation and loss model. And I repeated here where the governing equation at the top is that the change per unit time in the number of tools in a society is a function of the inputs. That is the innovation rate. There's some parameter like alpha here that is the innovation. And then the population P and some diminishing returns exponent beta such that as populations eventually get so large that there's effectively no additional innovation rate from them. And then the losses are subtracted from this and losses are proportional to the number of things that can be lost. And I showed you how to solve for the equilibrium number of tools for a given population size and combination of parameters and then we inserted this into our GLM. Okay, how do we get spatial co-variation and spatial confounding into a model like this? Let's back up a little bit and think clearly about how we're going to do that. And it'll make it a little bit easier if we take the population mechanics out of this at first and develop the spatial co-variation strategy without that and then fold it back in. And I want to do this also because this is the way you should actually work on your own problems. You develop each little bit of the machine of the golem and test it. Make sure it's working independent of the other bits of the golem. And then you eventually combine the pieces together into the model you believe you need to get your estimate. You know what your destination is but you can't go straight there. You've got to build and test incrementally. So let's do that. Let's build and test first the spatial co-variation part and then we'll fold that in to our previous model. So what we're going to do here is just kind of look like an ordinary varying intercepts model. I've just taken the outcome variable t sub i. That's the number of total tools in each society. This is a Poisson variable. It's a count. It's a minimum of zero. It's an integer and it has no clear upper bound. And we're going to give this a typical log link for the rate with some mean alpha bar. And then each society gets its own deviation from this mean. And these will be partially pooled. So we'll have a vector of all of them. And we're going to draw them now from some multivariate normal distribution, not just an ordinary normal, but a multivariate normal. And the reason we're doing this is the multivariate normal, as you know, has a full covariance matrix, variance-covariance matrix. So now each of the societies can have a particular co-variation with any of the others. And that's what's going to let us model the spatial influences if they exist. So in this multivariate normal prior, we'll have a vector of zeros. It's all zero centered because alpha bar gives us the centering of the predictions. And then we have the covariance kernel capital K here. And this is often called a kernel in this business, which is, of course, just a seed. It's at the core of what we're doing. And it's just a covariance matrix. And what's going to make it special, as you'll see, is how we're going to construct it. But let's think about the traditional thing you do. If this was an ordinary covariance matrix, it would be monstrous. It would be a 10 by 10 covariance matrix because we have 10 societies to model. There would be one variance to estimate all along the diagonal there, sigma. The sigma squared is the variance. And then there are 45 co-variances, one for each pair of societies. I probably don't need to assert that with only 10 data points, it is not very practical to expect to estimate 45 of these co-variances. But we're going to do it. And let me show you how. The trick is a Gaussian process regression. I'm going to build this up in the abstract, show you how these sort of things work in general, and then we're going to come back to the Oceanic Islands example and fold it in. What is a Gaussian process? A Gaussian process, if you look it up on the internet or a textbook is, and I quote, an infinite dimensional generalization of multivariate normal distributions. What does that mean? What it means is actually pretty simple. Instead of trying to estimate all of the entries in a large covariance matrix, a potentially infinitely sized one, we can use a little function, a function of a small number of parameters that specifies every entry in the covariance matrix. I'll say that again. Instead of trying to estimate a parameter for every covariance in a conventional covariance matrix, in the islands example it would be 45 of them. Instead, we can use a function of a few parameters which specifies a pattern of covariation over the entire matrix. And this allows us effectively, yes, to have infinitely sized covariance matrices. That's what the infinite dimensional generalization refers to. This is a very powerful approach. It's a standard feature of machine learning techniques, and I'm going to show you how to apply it to a couple of different problems in this lecture. The first being the Oceanic Tools, and then after the break, we'll talk about evolutionary history and phylogeny. So as I said, what this kernel function does is it gives us a covariance between any pair of points, and it does so as a function of the distance between the points. This distance is metaphorical or statistical. We can construct it a number of different ways depending upon the details of the problem. In our Oceanic Islands case, it really is distance, but it could also be different on any metric of interest, like temperature, spatial distance, as in our example. It could be time. Many of the most common everyday applications of this technique use time differences, for example. Navigation systems use that. In general, what you can conceive of is that the Gaussian process is a way to generalize our varying effects strategy and its partial pooling advantages to continuous unordered categories, things like space and time and difference or age for that example. What we mean by continuous unordered categories is that things are different. They vary in their state, where they occur, when they occurred, and things that are closer in those states shared more unobserved influences with one another. They're more similar, and we'd often like to model the similarity among all of the objects that are ordered along this dimension, along the continuous ordered categories, and do partial pooling among them, but we'd like the partial pooling to be more local, because, of course, individuals of similar ages share more of those unmeasured common influences. Things that are closer together will share more of those unmeasured influences, and so on. And this is what this strategy lets us do. It lets us extend the categorical varying effects approach with its partial pooling benefits to continuous unordered dimensions. Let's look at some pictures and get an idea how these things actually behave and what an infinite dimensional multivariate normal distribution looks like. It's not that terrifying. When I'm showing you on this slide, on the left are some possible Gaussian process functions. In the simplest case, what these functions need to do is simply map two variables to one another, map inputs X to outputs Y, and X is the dimension of interest, and so you can think of this as space or time on the horizontal axis, and then some observable measure on the vertical. And in the example here, all I've done so far is give you one point from this process that black dot in the middle labeled one, and then three possible functions that relate X to Y. And in Gaussian process regression, there are an infinite number of flexible splines that will pass through the points or near them and relate the variables X to Y. What governs the shape or constrains the shape of these splines of these curves on the left is the kernel function, which I'm showing you on the right. And the kernel function's job, as I've said previously, is to relate distance on the horizontal axis there, the X variable in this case, to the covariance between points. This will make more sense once I start the animation. So right now we've got only one point and I'm drawing from the prior distribution possible Gaussian process functions, Gaussian process regression functions. Now I've added another point, point two here. And let's pause the animation for a second and talk about what's happened. So once you've added the second point, now the functions need to pass through that point. And this fixes the Y value at that X value, but it also constrains where the functions can pass in between the points now. And how wiggly and bendy the functions can be is constrained by the kernel function on the right, which specifies the covariance between points one and two. And all the points in between, because this is an infinite dimensional multivariate normal. And so we haven't seen realizations of the points in between, but the model already has expectations about how they will co vary. And that then effectively on this graph governs how bendy the function can be. So I've added the distance between one and two on the point on the right. Let me show you how they relate. So if we measure the distance between the points one and two on the horizontal axis on the left, and we map that distance onto the horizontal axis on the right, which is the distance axis, then we use the kernel function, the black curve, to get the covariance between those points. And the model uses this to construct infinite number of possible Gaussian process functions. I'm showing only three here that could explain these data. This seems a bit mystical perhaps, but it's exactly the same Bayesian updating approach we've been using for weeks now. All these models are doing is using the definitions of the probability distributions in the model, the constraints imposed by the data. It tells you the relative ranking of the possible functions that can explain these points. Let's start up the animation again. You can see now that it wiggles a lot less between the two points because of the assumed covariance between them. We add a third point and now we've got an anchor down near the bottom as well. And I've added now two more pairs, two more distances to the kernel function on the right so you can see what's going on. So two and three, points two and three are expected to covariate less because they're further apart and so their covariance according to the kernel on the right is much smaller. But a point one in between keeps the, which covariates much more with both points one and two, keeps the Gaussian process functions from wiggling all over the place in between. This is how Gaussian process functions work in the simplest case where the kernel function is known. Before we talk about that though, let's think about measurement error. In typical applications, and in fact some of the most classical applications of Gaussian processes, the points aren't known with certainty because they're produced from noisy instruments or with observation error or just some variable process independent of the Gaussian process itself. And so now I'm showing you an animation where each point has some gray compatibility region that represents its uncertainty, its measurement error if you will. And now the job of the Gaussian process is to do partial pooling in the presence of this measurement error. So when we add the second point, you'll see it flops around in the gray regions because of the uncertainty about where exactly the points are. And when we add the third as well. And so this is a way the Gaussian processes are often used to deal with instrument instrumentation issues and telemetry and the like. But it's also you'll recognize this as a very standard statistical problem, where there's some variance on the measurements that we need to adjust for. What this does effectively is it creates local partial pooling. And this will be much easier to see if we add a fourth point. So now I've added a fourth point on the left there that's near points one and two but below them. And you'll see because of the particular covariance kernel we've assumed on the right, the curves just aren't flexible enough to bend all the way down to point four. And because of the measurement error that's been assumed, the gray bars, the model thinks that four is a measurement error. And it will effectively partially pool predictions in that region up towards the splines to be more consistent with one and two. Of course, one and two are being pulled down a little bit as well because they're not measured perfectly. But there are two of them and only one of four. So it overpowers them. Three is not being partially pooled very much at all because it's not local to those groups of three points, one, two, and four. And this is the effect of partial pooling. Now what I've done is I've changed the shape of the kernel on the right so it can be more flexible. And you'll see this results in less partial pooling because the curves can be bendier or wigglier. I think the technical term is now I've reduced the maximum covariance on the kernel. And you'll see that this also makes the curves less wiggly, right? Because it allows for a lot more functional shapes. But the covariance is still, the maximum covariance is still quite small in this example. And then even stiffer by making it a flat thing as before. And you can see how both the maximum covariance and the shape of that curve as it declines with distance govern how bendy, how wiggly the Gaussian process functions can be. And that affects how the partial pooling works. So effectively when we do Gaussian process regressions, we choose functions that specify the kernel on the right. And that kernel then controls the Gaussian partial pooling that happens on the left. But we have to learn the kernel at the same time as we're learning the functions. So let's look at that now. Here's a full example where we're learning everything at once. We're learning the functions, of course, on the left. We're learning their error. And I have an error bar now that corresponds to each of the functions because how wiggly the function is affects how much error we need to assume about the points. And then on the right, I'm simultaneously learning the kernels. And there's a different covariance kernel for each of the functions we're trying to learn. And of course, there are an infinite number of these wiggly things inside the posterior distribution. I'm only showing you three of them and animating among possibilities. And you can see that as the kernel changes shape, it changes the allowable functions on the left. And that has implications for the amount of error. And vice versa. All of this stuff is being considered simultaneously inside the posterior distribution. It's a little easier to appreciate the connections between these different unknown components of a Gaussian process regression if we look at just one curve at a time. So here I've done that and I want you to see that the error bars and the wiggliness and shape of the covariance kernel on the right are intimately related. Change in one implies changes in the others. Now I've added this fourth point and you can see that now the Gaussian process wants to be a lot wigglier. And that has implications for the curve. But if it's wiggly, then the error goes down. And if it gets flat, because the kernel gets flat, then the error must go up in order to be consistent with the data. So in this sense, as I said before, Gaussian processes sound like a very fancy thing. And I guess they are. They're infinite dimensional machine learning techniques. But they're learned from the data in the same way as every other model we've done in this course, simply using probability theory, sometimes called Bayesian updating. Okay, let's get more pragmatic now. To make this work, you have to choose a function for the covariance kernel. And by far the most common choice, because it is extremely effective in partial pooling and learning in many systems, is the quadratic kernel or the L2 norm kernel, it's sometimes called. And this is a very simple function. I have on the side here, each element of the covariance matrix K, represented by K, X1, X2 for two points at locations X1 and X2 is given by this function, which is extremely similar to the function for a Gaussian distribution. That's why this is called the quadratic. And I show you the covariance function on the right. So as the distance increases, the covariance declines and it declines in a Gaussian shape. You need two parameters to specify this. You need, as I have it on here, alpha squared, which is the maximum covariance, and sigma squared, which is a scaling parameter, a length parameter, it's sometimes called, which just stretches out that determines how quickly the covariance declines with distance. Another very common covariance kernel that we'll look at in the second half of this lecture after the break is the Ornstein-Ulenbeck kernel, or it's also sometimes called the L1 norm. And this is extremely similar to the quadratic, except we're not squaring the distance inside the function. We're merely using the absolute difference between them. And this means that the maximum rate of decline in covariance is now at zero, at the origin. And we get this deeper descent, not a gentle Gaussian slope, but some other slope. And I'm not going to look at an example of this in this lecture. There's simply not enough time. It's a periodic function. There are lots of distances which loop around. Lots of stuff in nature is cyclical. And like time, now you may think what's cyclical about time? Does Richard believe in reincarnation? I will not say. However, that's not what I'm referring to. What I mean is that often we have data that has diurnal cycles in it, or annual cycles or seasonal cycles in it. This is common both in the social sciences and in ecology. And in those cases, we want to fit functions that are circles that have orbits, essentially, to them. And Gaussian processes are often applied to these sorts of problems as well. And here's a very flexible periodic kernel at the bottom of the slide that can allow you to fit flexible functions through diurnal cycles or annual cycles. Let's take the quadratic kernel and stick it into the tools model. So all we need to do is say that we're going to define each entry ij, where i is one island and j is another, in our covariance matrix k, as this quadratic kernel function, where now eta squared, I already used alpha. So eta squared is the maximum covariance when islands are essentially right on top of one another. And then a parameter row controls the rate of decline. And the input into this is the pairwise distance between i and j, and then that is squared. And we have a handy table of pairwise distances. I've provided it for you as part of the rethinking package measured in thousands of kilometers. The scale here is not so important. You can rescale these distances, you just also have to then rescale the priors. And so this brings up the question, what do the priors on eta squared and row squared imply? And I've chosen exponential two and exponential point five in this case. And you're probably anticipating what I'm about to say, to understand what these priors imply, you really need to simulate from them. Because they strongly interact in nonlinear ways to produce the shape of the covariance function. So let's look at that. You know how to do this, you've done lots of prior predictive simulations in this course. Here's the code for this one, we're going to sample from the distribution of eta squared and row squared 30 times each. And then I'm going to plot the implied covariance kernels on the right. And you can see what these priors imply, they allow a wide range of possible covariance kernels. Some of them were very slowly decreasing covariance with distance. But most of them contain some high covariance at close distances that declines relatively rapidly, but a very wide range of possible covariance kernels. It's a good idea to experiment with these priors, and see how they change the results of the analyses we're about to do. Because Gaussian process regressions are often quite sensitive to these priors, maybe not in ways that will change the qualitative conclusions of your results. But often there will be regions of the kernel, which are not well informed by the data. And you can only discover that by changing the prior and doing some kind of sensitivity analysis. Okay, we're we're ready to go. We've plugged in the covariance kernel, we now have an infinite dimensional multivariate normal, where are the infinite dimensions? We could invent a new island at any location and the model would already have a prediction for it. So it's implicitly contained inside our multivariate normal prior, because the covariance kernel can expand to any size without any any new parametric complexity. So let's run this model. We have the client to data set and the island's distance matrix, and we put them in a data list and pass it to ULAM. The only new bit of code here is it's not the multinormal part, you've seen that in previous lectures, it's this covariance kernel for the Gaussian process. And there's a convenience function in ULAM, this code GPL2, that's a covariance kernel for a Gaussian process with an L2 norm that just can uses the formula you see on the right to construct the entries of K. So K is a 10 by 10 matrix and all of its entries are defined by taking the distance matrix D and the parameters A to square and rho squared and executing that function. You can run this model and look at the pracy output and it will not be very illuminating. So we get an estimate of varying effect for each of the 10 islands and these are partially pooled but they're locally partially pooled. So that to the extent that the covariance kernel thinks that islands that are close together are more similar, these will the ones that are close to one another will be shrunk closer to one another. And that shrinkage is controlled by A to squared and rho squared. And if you're like me, there's no way to tell by reading this table what any of that implies. So we're going to simulate from the posterior distribution. I'm showing you on this slide on the left is the covariance kernel again, the black curves are samples from the prior, which we did previously. And now the red curves are sampled from the posterior. And so what you see is that the model has learned from these observations that the typical maximum covariance is lower islands that are close together are similar to one another, but not completely. And in general, after a couple of thousand or 3000 kilometers, there's a very little residual covariance from space. According to this model, we can zoom in on the map on the right, which tries to represent these covariances with the line segments between the islands. This is a longitude latitude map of the locations of the islands and very fancy, I know. And the line segments between them are in a sense some totally saturated social network of possible historical linkages, linkages arising both through spatial connections and through common exposures and resources. And you'll see that what I've done is that the darker line segments are the larger covariances. I've scaled this so that total black is the largest covariance in the posterior. And then it fades to white as it gets lower. So you can see that these islands that are close together, Malikula, Santa Cruz, Ticopea, but also Laofigii and Tonga are more similar than expected, according to this model, given their spatial locations. And then the others, there's some covariance between the others like monocentrobrians, but none of these have any residual similarity to Hawaii, as Hawaii is very, very far away from all the others. This model has no explanatory variables in it other than space right now. So you can think of this as just a raw assessment of possible spatial confounding how similar are the different islands given their locations. So let's fold in the explanatory part now. To do this, what I'm going to do is take our previous model with population size, and its elasticity beta. And I'm just going to multiply this by e to the alpha sub s, as you can see in the lambda line there. So what this does is it effectively makes these varying effects that come out of the Gaussian process kernel into proportional adjustments for the expected number of tools at equilibrium. So remember, this alpha bar times p to the beta divided by gamma expression was the equilibrium expectation from that little dynamic model. And the alpha sub s values are zero centered Gaussian deviates and would be exponentiate them. They're all positive. But if a value is zero, if alpha were zero, then e to the zero is one. And that means at expectation. If it's if alpha is below zero, then you get a smaller expectation, it'll be between zero and one. And if alpha is greater than zero, then it will be greater than one. And that'll increase the expectation. So in this way, these these Gaussian process varying effects, which are locally partially pooled by distance can adjust the expectations from the model to measure the spatial co variation that deviates from those theoretical expectations based only on population size. So we run this model now. And now I'm just comparing the what I call the empty model, which is the model that did not have population size in it, but does have the the spatial Gaussian process kernel. And that's in black there. 30 samples from the posterior, and then in red, there are new posterior distribution from the population model, the model head that has both population and the spatial kernel. And you can see that what the model has learned after introducing population is that there is less residual covariance to explain that's because population still explains a lot of structure in this data. And so the red curves are lower still, but not completely flat. So there's still some evidence that there there may be some additional similarity, because islands are close together, it seems likely. And on the right, I'm showing you now not on a map scale, but on log population on the horizontal against the total number of tools, the same islands again, and again, a kind of network among them, where the intensity of the line segment reflects the residual co variation. But the trend in blue is the population trend in the posterior, and you'll see that's just as strong as before. Okay, that's a lot of work. And that was our spatial confounding example, I think we need a break. Take a quick look through the slides familiarize yourself with the structure, and the basic purpose again, and then take a walk. And when you come back, I will be here. Welcome back. In the second half of this lecture, we're going to look at a analogous problem to spatial contamination, which is phylogenetic contamination and covariance. This is the primate evolutionary tree or at least a statistical reconstruction of it. The primates evolved when the dinosaurs were still around some 80 million years ago, and have since diversified into lots of different types. And in the sub parts of this tree, you have groups which are more closely related and share their own peculiar features, like the apes of which we are a member, which are larger body and have longer lives. The African and Asian monkeys, a very large diverse and successful group, both arboreal and terrestrial, the South American or mainly South American, but also Central American monkeys, which are mainly arboreal, smaller bodied. And then the prosimians, the Tarsiers, and oh, so many lemurs, all of them on Madagascar. And then finally, the mainly nocturnal Gallagos and Lorises. We're interested in this problem. Well, I'm interested in this problem because I'm an anthropologist, but you should be interested in it because there are lots of scientific problems that have a structure where there's a bunch of historical events that have created patterns of similarity and relationships among different entities, in this case, the species on this tree, and the branching relationships by which they share historical lineages have constructed to some extent patterns of some similarity. But how do we account for that when we study the features of these organisms or these entities in other ways? And it's like the spatial confounding example in the sense that the details of all that history are lost to us. But we still know at the large scale, the macro scale, that there should be a pattern of co variation that arises from those historical processes. And so we can have some success, even if we don't know all the details of the past, just like in the Oceanic Islands case. So we're going to work with the primates 301 data set and the rethinking package. This is data provided by Sally Street and her co authors from their 2017 paper, the citation given at the bottom. In this data set are life history traits of these 301 species. We're going to focus for the sake of the example on three of them, the mass adult mass in grams, the brain volume and cubic centimeters, and average group size of social groups. And all of these things vary a lot in primates, actually, they're very small ones, like many of the the arboreal prosimians, especially brain size varies tremendously, even after accounting for body size differences, and group sizes vary a lot as well. There are solitary primates, and they're primates, like humans that live in very large groups. This data set is also a very realistic, perhaps the most realistic sort of data set that I've introduced in the course, but we won't deal with all that realism yet. We're going to postpone that. There's a lot of missing data. Not all of the variables of interest have been measured for all of the species, because that's just how research is. There's also a lot of measurement error in some of these things. And unobserved confounding, almost certainly as well, we're only going to try to deal with the unobserved confounding today. I'm not going to push those other substantial and interesting problems aside. Okay, so we have 301 species, but it's only 451 of them that we have all three of the variables of interest measured. So we're going to work with only the complete cases. This is something that needs to be justified in a later lecture. What I want you to notice for now is that the missing cases are not randomly distributed through the tree, at least not plausibly so. There's much better coverage for the apes due to human narcissism, quite likely. No, there's legitimate research reasons that we study the apes more intensely than the other primates. But there's reasonable coverage for most of the groups. The lemurs are dramatically understudied, in my opinion. So the missingness is not randomly distributed through the tree. And as we'll talk about in a later lecture, situations like this are potentially hazardous and can introduce their own biases and much be approached also from a causal perspective. But let's put that aside for the moment, so that the lecture is not too complicated. Let's work with these 151 species. I've redrawn the tree here and tried to plot the character states with symbols at the tips. This is not an easy way to read data. And I don't mean it as a serious way to do the analysis. Just to say that there are the three variables and give you some quick idea, if you focus, for example, on the left on the apes, the filled circles are relative brain sizes and the open circles are body sizes on the log scale. And the triangles are group sizes also on the log scale. And you'll see that our number of apes have bigger brains on average, but also larger bodies. And some of them, like the chimpanzees live in quite large groups compared to many others. But lots of monkeys also live in large groups. And so on. But of course, we can't just visually do the statistics here, because you know, there's a causal structure that relates these variables to one another. And we're going to have to think about that. There is no general statistical solution to figuring out the relationships among these variables, whether they're on an evolutionary tree or otherwise. And as you might expect, all three of these variables are very strongly related to one another across the tree. So on the graphs on the left, I'm just plotting the values shown there at the tips for the three combinations. So and you'll see that there are strong correlations in all cases. There are lots of causal structures which can produce this. So this is not a shocking thing to you. As you know, correlation is common, and it is often not driven by causes, at least not directly. What do people do when they analyze data like this? Well, I'm sad to say that the status quo in evolutionary ecology is not good. The status code evolutionary ecology is to do what I call causal salad. Causal salad means tossing factors into a regression, often a regression with phylogeny included, and interpreting every coefficient as causal. Now, if you've come with me this far in the course, you know that this is not legitimate. There is no situation in which that makes scientific sense. But it is commonplace in evolutionary ecology to do comparative methods trying to explain the patterns of co variation among the traits of different organisms using this approach. It is not legitimate, and we need to do better. And the way to do better is not a statistical technique that needs to be introduced, but by thinking causally about what's going on. phylogeny is something you'll often hear people talk about that it needs to be controlled for and that they use phylogenetic regression of some kind to control for phylogeny. But as you've learned, adding controls can be dangerous as well as helpful. And what the control means has to be interpreted very carefully and can only be interpreted from a causal perspective. So phylogenetic regression, just like ordinary regression still requires causal thinking. So let's do a little bit of causal thinking. We're going to use these three variables to address something known as the social brain hypothesis. And this is a long standing and somewhat popular. It used to be popular hypothesis about the evolution of primate brain size. The idea being that many primates live in complete live in social groups. Most mammals don't. Most mammals really don't live in big groups at all. Think about a bear, for example, is not really a group living organism. And primates are conspicuous in that many primates, although not all, do live in groups. And the demands of social life may favor cognitive skills that will be reflected anatomically in larger brains. So there's an interest in the arrow from G group size to B brain size. But of course, things like mass are fairly obvious biological confounds of this relationship, because mass can influence both brain size, larger organisms have bigger brains, relatively speaking. And mass can also be an influence on group size, because, for example, larger monkeys can't live in trees as well. And they're more likely to be terrestrial. And if you're terrestrial, you can have bigger groups. And so there are paths of that sort that connect even non obvious things like body mass to things like the size of your social group. There are lots of different ways to connect these three variables. And the file genetic comparative literature will take triplets or quadruplets or, or even larger collections of variables and put them in regressions without considering anything about the relationships among the predictor variables themselves. And I know you know about this because you've come with me this far in the course, but this is really dangerous. So just as an example. So I've pushed the basic confound example where M is a confound between G and B to the left. And I want you to just very quickly with your eyes now, because you're a professional at this, analyze these other examples I've given you. So first in the middle, still an arrow from G to B, and that's our interest. But now there's a mediating pass through body mass and it's confounded with brain size. And so if the right thing to do in the middle graph, if that's the causal graph you subscribe to is to ignore body mass. The reason is because the total causal effect of group size on brain size, part of it goes through body mass. And therefore estimate the estimate only requires a regression that uses G. And furthermore, if you do stratify by body mass, you introduce a confound through the unobserved confounding between body mass and brain size. So again, controls are not always safe. There are bad controls, and there are good controls. And the only way to decide is to state your assumptions clearly in the form of at least a heuristic causal model, but hopefully something more elaborate as well. And then finally on the right, I've done something really naughty here and reverse the causation between brain size and group size. There's no reason it can't be like this, or more likely over evolutionary time there may be reciprocal causation between both of these variables. In this case, all kinds of things can go wrong, because you've even chosen the outcome wrong, and stratifying by G stratifies by collider. No, there is no interpretation without causal representation, and that goes for this business, as well as all of the examples we've done so far. Okay, enough of my sermon. Let's get back to the group size brain and body mass example. What we need to do to augment this is make it a bit more realistic. As I said, there's unobserved confounding in data sets like this. This is not an experiment. This is nature. There is no experimental way to study the relationships among the traits in the primate family tree. We're stuck with observational science, and that means we're stuck with confounds, and we've got to do our best to deal with that if we hope to learn anything from these data. What I mean by unobserved confounding is there are almost certainly common environmental exposures and other things, which are also reasons that these variables can co-vary across species. Some of this could be, as I said, common environments. Organisms with particular traits tend to stay within particular niches or niches, if you say it that way, and that gives them common selection pressures, which can maintain their traits or co-variation among multiple traits. We can't see those things in the graph. Maybe we can get data on those things, but you need a good theory to do that. In our case, this is unobserved confounding. Another source of unobserved confounding is, of course, history. That is, there are innovations on particular parts of the tree that are shared among the lineages descended from that part of the tree. This will also create patterns of co-variation, which are, in a sense, unobserved confounds, reasons that group size and brain size could be associated, which don't have to do with the causal effect of group size on brain size. These sorts of things are really common in evolutionary systems, and this is a famous kind of confound in evolutionary ecology that people have thought about a lot. But let's think about it in a way that's not often thought about. It is thought about this way, but not often, and let's think about it in terms of the whole sequence of causation over evolutionary time. The way that these historical confounds get built into systems is not mysterious at all. Imagine just a cartoon example. We start with some basil, a little mammal there, like this little long-tailed rat. At some point in time, it's the only member of what will eventually be a big, expansive, diversifying clade. But early on, it has some particular group size here g sub 1 and some particular brain size b sub 1. After some time, there's diversification, and now there are two members of this clade, numbered 1 and 2, and we have separate, we have this kind of merged synthetic graph here. This is a dag to represent the causal relationships through time from the bottom to the top among these variables. So g1 from the first time period at the bottom influences the group size of both 1 and 2 at the time at the top, and that's because there was a common ancestor there that has contributed variation to both. And likewise, b1 is influenced at the bottom has influenced both b1 and b2 at the top, again because of descent of lineage relationships. But we're asserting that there's a causal influence of group size on brain size over time, and that means that the group size back in time at the bottom of the graph has also influenced the later brain sizes, as indicated by the red arrows. And we can keep going, adding more time up towards the top of the slide, yet another step, and here I have no branching after this, just continued evolution along the lineages. But still, the group size is in the previous time period, but now not from the bottom because that time is gone, only in the middle rung, are influencing brain size again. And then at the top, I could stop here only because I've run out of a slide, we can have more branching, we get another species there, number three, and additional relationships, causal relationships among these variables that reflect both the causal effects of one variable on the other, but also the through-time causal effects of descent within lineages. So what happens in this kind of system, of course, is that in each horizontal slice through this DAG, there are patterns of expected co-variation among the variables, which have to do with all the arrows that are beneath it, but not above it. And so this is a sense in which causal assumptions, as always, give implications about the kinds of data we're likely to see and help us design statistical approaches for avoiding being confounded. Unfortunately, so we could just apply ordinary backdoor criteria and everything if we had observed all those variables, but this never happens. Well, shouldn't say never, but for primates it never happens. All we get to observe are the tips, this contemporary slice of organisms. We don't know the causal relationships, the red arrows, and we don't know the lineage relationships, the black arrows, and we have to infer all of that. This gives us two problems that have to be worked on simultaneously to conjoint problems. The first is, of course, what is the history of the black lines on this graph, often called a phylogeny, a tree-like structure that represents at some scale of abstraction the lineage relationships going back in time between the different groups of organisms. And then the second, how do we use it to model causes? Let me just take a couple minutes to talk about each of these in a general sense and then we're going to push forward and analyze the primate data. So first, what is the history? This problem of inferring historical branching relationships among organisms is one of the primary ones in evolutionary biology going back a long way. Although this technique of representing historical relationships this way actually started in linguistics and was originally used to describe relationships among languages through time. But it really flourished in biology and especially in the genomic era where high quality genome sequences among organisms have provided really nice and often much revised historical pictures of the relationships among organisms, how quickly they've diversified and so on. This is one of the ways, for example, that we've learned that primates are probably much older than we thought they were and that they coexisted with dinosaurs. Humans not. Humans are very recent just to be clear. Okay, so problems though with inferring phylogenies remain and there's a lot of innovation still to go here. And there are some problems which are probably insurmountable to be honest and I don't say that to be negative. It's just to be realistic. It's good to know which problems you can solve so you can put energy into those and which are lost in the mists of time. Even in the best cases of high quality data, substantial uncertainty can remain in the shape of the branching relationships among species. There are parts of the tree which simply cannot be inferred. That's just true and it's important to know that the job of statistics is not to give us the answers we want. It's to tell us what answers can be justified. I'll say that again. The job of statistics is not to give us the answers we want. It is to tell us which answers can be justified by the model and the data. Second, evolutionary processes are complex. They're not stationary. Evolution does not proceed in all parts of the tree and on through all times in the same way. And that's one of the things we want to learn about in fact. But since all of the tree is lost and it is difficult to connect fossils to living organisms, this is a very hard problem. Third, there is no one phylogeny, no one branching diagram, which is going to be correct for all of the traits that we might want to describe about a group of organisms. And the reason is because species are abstractions. They don't really exist. What exists are individuals and lineages apply to individuals and their patterns of descent from parents to offspring. A group of organisms that are similar to one another, we might call it a species. But that is a social construction, a very useful one. And so one consequence of this individual reality of evolutionary histories is that one particular trait can branch and split into separate lineages and another trait can branch and split into other separate lineages. And this is called incomplete lineage sorting and it's extremely common in in natural systems. Famously there are many parts of the human genome for which we are more closely related to gorillas than we are the chimpanzees and the rest for the most part we are more closely related to chimpanzees than to gorillas. But there are even some parts of our genome where plausibly we're more closely related to orangutans than to either chimpanzees or gorillas. And this is not a mysterious thing at all. Evolutionary biologists know why it happens. But it means that if we just infer one tree there are some kinds of tasks we might try to do with that tree for which it will be ill-suited. There's this little niche literature on cultural and linguistic phylogenies. As I said, phylogenetic reconstruction begins in linguistics quite some time ago. And the evolutionary biology sort of took it over from there. But phylogenetic reconstruction in cultural and linguistic domains has not really flourished and become a big mature field like it has in evolutionary biology. And I think that's a shame. I think there's still a tremendous amount of potential there to do things. But it hasn't been realized yet. So what I say on this slide is that cultural linguistic phylogenies cultural and linguistic phylogenies remain incredible. And I use this as a double entendre. Incredible in the sense that there's a lot of value to them. There are histories to cultural linguistic things because they're shared descent in cultural systems. There's really something to that metaphor. On the other hand, the ways that we need to reconstruct cultural linguistic histories have got to be different than the way we reconstruct genetic histories. And currently that's not so much followed. People use software for the analysis of genomes, but they put in binary cultural trait data or linguistic data. And the software doesn't know it's not genes and is perfectly happy with that. And I think this is a necessary phase that the field has to pass through to get better, but it's going to have to get better. And the consequence of this is that most social scientists and humanists and linguists don't pay any attention to cultural linguistic phylogenies. And I think that's a shame because they will live up to their potential eventually. Whether we're talking about genomics or linguistics, the basic truth is that phylogenies do not exist. Like social networks, they are complex structured factors that compress samples and give us ways to make inferences about it. But we must use them carefully because if you assume there's a phylogeny then you will always find one. But the data will be reflected in it, but it may not reflect the causal processes you're interested in. This is not to insult phylogenies or say they're bad. Abstractions are essential and good. Just like social networks, phylogenies, you have to remember that they're not real. Okay, second item. How do we use a phylogeny once we have one? This abstraction that represents the history of a group of species as a branching process. How do we use it to model causes? Or how can it help us model causes? There's no universally correct approach because there's no universally correct causal model. You have to think about causation, posit in an evolutionary process, and then use some logic to justify how your statistical approach deals with the issues presented by that process. However, the default approach in the evolutionary sciences is to use a regression technique which is actually an example of Gaussian process regression. Although it's not usually explained that way, it's often called a phylogenetic regression. But I'm going to explain it as a Gaussian process so you can see its unity to the other methods we've introduced. And this is going to provide a way for us to get the sort of accumulated historical relationships among the species into our causal graph. So let's begin though by thinking about an ordinary regression like this one on the screen. This would be just an ordinary regression of brain size stratified by group size and mass. And we do this because our estimate is the effect of group size on brain size. And we stratify by body mass m because we need to close the backdoor path that goes through body mass. This is just an ordinary linear regression like from early on in the course, nothing new. The first thing I want you to consider is that we can rewrite this model and we can rewrite any linear regression so that the top level is a multivariate normal. So this is exactly the same model as on the previous slide but now the whole vector of brain masses all 151 of them is represented by the symbol B at the top. And we say that this whole vector of length 151 is drawn from a multivariate normal of dimension 151 with a vector of means mu of length 151 because each species has its own expected value for brain size. And then some covariance matrix capital K and this is you probably can guess going to be a Gaussian process kernel but it's not yet. It's just an ordinary covariance matrix right now. And we define it so it's the world's simplest covariance matrix there in that line K equals that sort of big bold capital I times sigma squared. What is this thing? The capital I is something known as the identity matrix. It's a matrix that's full of zeros everywhere except along the diagonal. This is the matrix version of the integer one. Now if you multiply a matrix by this you get the original matrix back and then we multiply this by sigma squared and all that does is it puts the variance along the diagonal. And so now we have a big covariance matrix where all of the covariance among observations is zero and the variance is the same for each observation and that's exactly what an ordinary linear regression assumes. So this is exactly the same model as previously just written in a bit of a weird way but there's a reason to do this because we're going to use this to move forward. Here's the code for both of these just to show you how it's done actually on the left we have the classical regression. I expect I don't have to explain that code to you. I set up the data list we're using standardized log versions of each of the outcomes of mass and brain and group size and then on the right we have the equivalent code in multivariate form where there's a multinormal distribution at the top and we build up that k matrix literally just by multiplying the the identity matrix by the squared standard deviation and you run these models and you get exactly the same results and I mean exactly usually with Markov chains you you get some variable results but these models are really identical and so they even the widths of the compatibility intervals are the same in this example. Okay that's the first level. Now what we add to this is we want to remember is in our causal graph shown on the right here we've got these unobserved compounds u which are influenced by the history and one way to think about these implications as we've done sometimes in the generative examples in this course is you can think about these unobserved compounds as being just things in the model that are that are particular to each case but they have some covariance structure that has to do with the causes the common causes so this little each species i has a u survive value which is unobserved and our idea is to get some information about the structure of covariation among these u sub i's that comes from the phylogeny. I'll say that again our tactic here is a deposit that each species has some unobserved u sub i which is influencing its expected brain size and there's a pattern of covariation across species in these unobserved u sub i's that is influenced by phylogeny. So how do we get the phylogeny to produce all these u sub i's 151 of them that's what we're going to do. First let's think about a phylogeny and this is just a random phylogeny this is not the primate phylogeny and if you were going to simulate evolution on this phylogeny taking the taking the phylogeny as known then what would happen there at the center is you'd start simulating and traits would come out and they'd branch and then innovations would be shared among all the descendants that follow from that branch and so for any particular branching structure there will be different implications of the same evolutionary process and so patterns of variation in species that we see are products of both the evolutionary process we assume and the particular historical events that have led to branching and so what I'm showing you now is on this tree I'm simulating a bunch of evolutionary scenarios each one each flash is a different evolutionary simulation and they're just for two binary traits in red and blue here but I'm really simulating on the tree and so innovations are whether a flip to red or a flip to blue are shared by descendants until some other flip happens and as a consequence you see there's this clustering in parts of the tree because most of the branching in this particular tree takes place in the outer shell but if we had a different tree like say this one where there's a lot more branching deep in time and we run the same evolutionary process we'll get a different pattern of variation at the tips now much less clustering much more uniformity across the whole tree because many of the innovations happen deep in time and are shared whereas on the left innovations are happening in more recent time and are not shared across the tree and so the combination of the tree and the evolutionary process gives us an expectation for the pattern of co-variation across the tips and this is the insight that's going to get us some way along to where we want to go so the combination of an evolutionary model and the tree structure has implications for the pattern of co-variation at the tips and in particular the co-variation under almost any reasonable evolutionary model I've never seen one that doesn't produce this the co-variation between any pair of species will decline with what's called their phylogenetic distance and what does that mean the phylogenetic distance is the branch length on the tree from one species to the other meaning you start at one tip and you crawl down the branch to the next fork and then you go down as deep as you have to to find the last common ancestor between two species and then all the way back up and the link that you've walked along that branch is the phylogenetic distance and that distance is relevant because it's the amount of time that accumulated changes could have happened that would make the species different from one another so but you can't do anything with that distance unless you have an evolutionary model to go with it you need to say something about rates of change along branches and what the trait space is and a bunch of other stuff and there's a huge range of models that you could explore and people sometimes do but for common sorts of regression analyses and evolutionary ecology they're essentially only two approaches that are common and those are to assume brownie in motion which is a neutral evolutionary model that's unconstrained in brownie in motion the covariance between any two species declines linearly with their phylogenetic distance i'm trying to show you this in cartoon form on the right of this slide where phylogenetic distance is on the horizontal and co-variation on the vertical and the red line is the brownie in motion implication the other common approach is to assume a process actually a family of processes known as ornstein ullenbeck which is a kind of damped brownie in motion i mentioned this process already when i talked about gaussian process kernels because it's a very common choice there as well it's when we say that the ornstein ullenbeck processor it's often called ou is damped brownie in motion what that means is it's a random walk near some particular mean or multiple mean values but if it gets too far it tends to be pulled back and so the variance is constrained what this means when we project this on the phylogenetic distance on a tree is that covariance declines most rapidly for organisms that are closely related and then for organisms that are very distally related it declines more slowly and that's what i'm showing you with the blue curve on the right and that is yes a gaussian process kernel and that's what we're going to put into our model so we're just going to take out our simple k matrix here and replace it with the ornstein ullenbeck kernel and remember the ou kernel is very much like the quadratic one we used before the break but the distance is not squared and that's what makes the decline of covariance fastest at the origin and then we add some priors a to squared and row at the bottom and i'm using what are called half normal distributions a half normal is a normal distribution that is only defined for positives and for positive values and that's because both eta and row have to be normal here i mean sorry have to be positive here and why have i chosen the particular values one and a quarter and three and a quarter for the maximum covariance and rate priors by doing prior predictive simulations so let's take a look at that now if you draw covariance functions from these priors you'll get the graph on the right i think this is 30 or 50 draws from the prior shown in the red lines and you can see that the prior expectation is that closely related species are more similar but the amount of similarity can range very widely and it declines such that there's some covariation over the whole tree but it about halfway through the maximum phylogenetic distance is quite small this is just a prior and what we're going to do is we're going to look at how the posterior differs from this prior as we put in explanatory variables so that means we need to start with an empty model that doesn't have anything to stratify by we're just going to have the outcome of brain size all 151 of them in a long vector and we're going to model the covariation among those measurements as it's related to the phylogenetic distance between each pair of species so yes this is a 151 by 151 covariance matrix but don't worry it's a Gaussian process which is infinite dimensional so we're going to run this model and it already has effectively infinite dimensions inside of it because for any phylogenetic distance we could put in we'd get a prediction we don't have to derive some new covariance estimate for that particular pair of species it's already in the Gaussian process so here's the simple OU model of brain size just raw phylogenetic regression with no explanatory variables and you see here is now the Gaussian process kernel construction a covariance gpl1 l1 the l1 norm the OU process is often called the l1 norm and there's a this is just a convenience function that ulam provides to help you compute it but it's just the formula on the right we run this model take some draws again from the prior shown in black this is what was shown as the red curves in the previous slide and the posterior draws in blue and so you can see that the model has learned that there is less influence of phylogeny according to this model and data than the prior expected on average but still some and importantly still some over the whole tree although it very weak for distantly related species now let's actually introduce the predictor variables and stratify simultaneously by group size and body mass the log of each model looks essentially the same i've just augmented the linear component of it right and added priors for the regression coefficients nothing too surprising here and this is your phylogenetic regression very standardized phylogenetic regression of brain size on group size while controlling for body size as a confound let's look at the posterior distribution of the of the phylogenetic distance kernel the covariance matrix so the reminds you the black is the prior that we started with originally the blue is what we got from the so-called empty model that contains no predictors and now in red the model that can this stratifies by mass and group size and it's flat against the bottom after stratifying by body mass and group size there's essentially no phylogenetic signal and that doesn't mean the phylogeny doesn't matter and primates of course it does it means that after accounting for patterns of covariation according to group size and body mass there's nothing left for the observed distances among the species to explain that's all it means this is a common kind of regression effect right the total variation of the data is still the same we're just partialing it among different components in this particular example it's body mass that actually has this effect so i leave it as an exercise to the viewer but if you take group size out of this model you get essentially the same flat red set of kernels down there at the bottom okay now let's look at the posterior distribution of the coefficient on group size this is our estimate remember so what i'm showing you on the right are two posterior distributions let's think about the black one first the black density there is what i call the ordinary regression this is the regression of brain size on group size while controlling for body mass that did not include any information about the phylogeny just an ordinary multiple regression and you can see that it's substantial all the probability mass posterior probability masses above zero after including the OU process as informed by the primate tree it's essentially halved in size more than halved in size on average so a much smaller effect but not removed right but a much smaller effect this could still of course be a result of statistical bias or additional unresolved confounding or some particularly wicked pattern of missing data or measurement error or any number of other things but research programs like this are always slow and progressive no single analysis is going to tell us what goes on we have to make a model state the assumptions clearly interpret the results openly and think about the next steps okay phylogenetic regression is a huge field and i think its best days are ahead of it actually there are lots of lingering problems with these methods the first of course is phylogenies are uncertain as i said before and we really like to get that uncertainty into the procedure and there are ways to do that you don't have to assume a fixed phylogeny but you can have a big posterior distribution of phylogenies and work with it that way of course the best thing would be to infer the phylogeny simultaneously in the same model as you're doing the causal modeling i don't think i've ever seen that done but there's no reason in principle it could not be done second it seems reasonable for many evolutionary scenarios the DAGs are a really bad way to represent them even if you've got some crazy structured DAG like the one on the right here and that is because the traits influence one another reciprocally over time we can represent those kinds of reciprocal relationships in DAGs it would just mean for example in the DAG on the right that you also have arrows going from the Bs to the Gs and that's fine and evolutionary processes work that way but then the DAG becomes a huge mess there are other options we can model these things this is not number two it doesn't stop effort but it means we have to think more carefully about what we're doing and it means the models we'll use will not be classical regressions in any sense where there's a clear outcome and clear predictor structure to them just a couple of examples to think that indicate that this is a very progressive and promising area of research on the left this is interesting new technique from Ringen, Martin and Yeggy just from last year where they take a drift diffusion kind of model with a couple dynamic equations ODE's that evolve in time and use this to model the evolution of trait dynamics on a tree what this lets them do is deal directly with reciprocal causation and estimate functions by which the different traits influence one another over time and still while inputting the phylogenetic structure and making full use of it and then on the right an approach which sidesteps the phylogeny problem all together and goes straight after the course core issue remember that what we're trying to do is understand how traits are functionally related to one another how how they induce selection pressures on one another and this is a kind of design perspective the adaptationist perspective which has been so productive and successful empirically successful in evolutionary biology so there's this exciting body of work from Oricio Gonzales Ferreiro and Andy Gardner which leverages this using something called the optimal life history approach and what they do is they solve for optimal life histories of organisms of this means optimal lifespan and the growth rates that are optimal for that lifespan and the investments in different body tissues at different stages of your life and all of that co-evolves and they solve for for if natural selection can get its way what would all these variables look like together as a constellation given different assumptions about the energy constraints and so on of the organism and so this is a direct and transparent optimization perspective that tries to answer questions about for example how should brain size and group size be related that doesn't need to use the phylogeny but of course it can inform the phylogeny because models like this one that are deeply inherently positive reciprocal relationships among all the traits can be embedded in the phylogeny analysis so once you have a phylogeny and you have data on the organisms you can fit these models to the data and that is actually what Gonzales Ferreiro does okay this has been a long lecture on Gaussian processes but I've really only scratched the surface let me try to summarize a bit Gaussian processes are in the simplest sense and in the sense that it's most useful to many scientists a way to do partial pooling for continuous categories things like time and distance and age and so on they are very general approximation engines and as such we need approximation good regularized approximation in science and this is a very attractive family of methods for doing it however there's no escaping causation as always and now your causal theory should inform your choice of covariance kernel and there are different choices and you can build your own as well if your theory demands it but you should be able to justify the covariance kernel on some grounds whether it's causal or rather pragmatic but be clear about the justification these sorts of models are often sensitive to the priors to give the shape of the kernel so you should choose wisely and what does that mean you should make some synthetic data you should do prior predictive simulations and you should have some way to understand the relationship between the prior distributions and how wiggly the functions can be at different sample sizes okay there are other methods with Gaussian processes I just want to mention at the very end here so you know that it's bigger than the simple examples I've given there's a very widely used method of Gaussian process regression called automatic relevance determination or ARD this is a form of Gaussian process regression where there are multiple distance dimensions inside the kernel a planer way to say this is you take everything that you'd ordinarily put in the linear model of a regression and you put it inside the covariance matrix and then what the model does is it well fits weights of importance for each of the predictor variables or each of the distance dimensions for shaping the covariances among the observations and this is a very successful and powerful way to do regularization when you have many possible predictors it's very widely used at least in machine learning there are also multi output Gaussian processes Gaussian processes just like other kinds of varying effects are not restricted to draw single values for each of the cases you can draw whole vectors from the kernel as well so if you're interested in that you can just google multi output Gaussian processes there's an example of working code in the stand user manual and now just to mention it one of the most classical uses of Gaussian processes is to do telemetry and navigation that is you're getting measurements of position or speed or temperature through time from some set of instruments and those measurements come with error Gaussian processes were developed in part to do partial pooling on those estimates in real time as the system is evolving and very quickly deal with the measurement error and do effective system control the most famous example of this is the so called colman filter which is named after a Hungarian mathematician but now it's implemented in all kinds of machinery and radar and and gps devices and all kinds of things use use algorithms which are essentially Gaussian processes that run in real time and this this issue of having to deal with error correction and measurement error is an important one that we're going to take up next week okay so that has been week eight of statistical rethinking 2022 so i said next week we're going to start up with measurement error and we're also going to address missing data as a special and particularly disruptive case of measurement error and i'll see you there