 Welcome back everybody. I'm still Richard, this is still what we're thinking. We're in the second week. This is the last week before we go on break for the winter holidays and we'll resume in January. The schedule is up online. I want to get a lot done this week there to build your anticipation for all the wonderful things to come in the new year. What we're going to do this week is the foundations of what are usually called regression models. I want to enter into them in a little bit of an unconventional way. I think one of the exciting and frustrating things about being a scientist is that we're always dealing with explanations that are higher in dimension and more complex than what we can measure about things. I would say that our senses are impoverished compared to the complexity of phenomena we're trying to explain. A famous example of this in the history of science is of course the motions of the planets in the night sky. Many of you will recognize what this is. This is Mars doing its dance over multiple days in the sky. There's a little loop. All of the planets out from us do this. They do these little dances in the night sky. This is called retrograde motion. It's a really cool phenomenon. All of you are prepared to explain this, I'm sure, as an illusion caused by the joint motion of ourselves. We're moving, and so is the other planet. Both of these objects are orbiting in ellipses around a large hot ball of gas called the sun. It creates this illusion that Mars is doing a little loop-de-dance in the sky. It's not. It's moving in an ellipse, and so are we. But the relative velocities of these objects can create these illusions. Famously there's an explanation of this motion that doesn't invoke that, but invokes something way more fantastic and interesting. It's called geocentric motion, or the geocentric model. Usually attributed to an Egyptian astronomer, Claudius Ptolemy, who lived from 90 to 168. Lots of people developed these models. He just published a big compendium of them. These models are really accurate. They're really incredible scientific accomplishments, and they predict to then very small fractions of an arc where the planets are going to be on particular nights. They're full-blown mathematical models, and of course they're also wrong. You can find the planets in the sky with these models, which is what they were built to do, but what you cannot do with these models is get a probe to Mars. They do not predict where Mars is in the solar system. They predict where it appears to be in the sky. This is like a regression model, is what I'm going to argue. Regression models, like the geocentric model, are incredibly accurate for particular purposes. They're also deeply mechanistically wrong, and you have to keep that in mind when you use them. This won't be an argument not to use them. It will be an argument to keep in mind the small world, large world distinction. Let me build this up a little bit by flattering the geocentric model a little more. If you want to do insult to scientists, you might say that they are like a geocentric person. They have a geocentric view of a matter. That sounds like an insult, but actually that means they're very sophisticated. They have an incredibly elaborate approximation engine that can predict where things are exactly. The reason that the geocentric model works so well is that what Ptolemy and others had stumbled upon is a very general system of approximation, what we would now call a Fourier series. What they used is a series of circles embedded in circles, what we're called epicycles. You can see them on this slide. This is a very simplified, reduced version of the Ptolemaic model. Earth is kind of in the middle, but not quite. That's required as well. You can't exactly be in the middle. Then a planet like Mars is orbiting the Earth, and it's orbiting another circle. It's orbit orbits. You can embed circles on circles on circles and have epicycles as well. The more of these circles you use, the better approximation you can get to any cyclical function. This is called a Fourier series. It's actually used a lot in the applied sciences to create functional approximations to anything that has periodic behavior, like an orbit. Anything, and you can get any quality of approximation you want, you just need more circles. This is why the geocentric model is so good. It's still good. Anybody here who's been to a planetarium, that's a geocentric model. You're still in the middle of the room in the planetarium, and the sky is moving. You need the Ptolemaic model to build a planetarium. This is how these things work. We're not here to build models of the solar system. That would be fun. We're going to build models of many diverse things. We don't use Fourier approximations typically, but we do use another kind of approximation, these things called linear regression models. This week I'm going to build up from the foundation as if you knew nothing about regression. I know all of you know a lot about regression actually, but I'm going to start over again and build it up in a different fashion. I want you to keep in mind that linear regression models are incredibly useful. They're capable of describing and predicting lots of things, but if you use them without wisdom, that's all they do is describe the data you feed into them. Then they'll make terrible predictions. You can't get your probe to Mars with a model that you don't understand causally. Some of the delivery there is going to have to wait until the new year. The material this week tells you the structure of a basic regression model when you come back to the new year will do more complicated regression models and in particular how you interpret them causally given that really they're description engines. All the statistical models can do is describe things and then there's extra scientific information you need to make causal inferences from them. With that looming over us, let's think about just describing what a linear regression is. These are simple little statistical golems that model how the mean of some measure changes when you learn other things about cases. You model the mean and variance of some normally distributed measure. I'll have examples of this in a moment. The mean is always modeled as some additive combination of weighted variables. Typically we assume there's a constant variance, but you don't have to. If you want to change that, I can show you how to do that too. I'm going to put pop of Gauss here. There used to be a much more interesting form of money in this country, I think, and you could use it to help you cheat on math tests because the 10 mark node actually had the Gaussian function embedded in it right there. I'm sure somebody probably has these at home. I have one at home, I can collect it. Was it 2001, 2002 when we switched? Something like that. This distribution is named after Gauss. He didn't name it himself. He didn't come to the distribution site and this is the Gaussian. It's named after him because he did so many marvelous things with it. Lots of people have used this before. In particular, relevant to linear regression, Gauss is one of the inventors of linear regression. He had an entirely Bayesian argument for it because this was before the anti-Bayesian movement in the late 19th, early 20th century. When Gauss was doing math, everybody was Bayesian. He has this Bayesian argument for what we'd now recognize as the normal or Gaussian error distribution plus using ordinary least squares approximation. He was using it to do astronomy because he realized he needed to make some money so he predicted when a comet would return. He was in his 20s or something like that at the time. These sorts of distributions, the normal or Gaussian distribution, they appear all over the place in nature and it's a very interesting, scientific and philosophical question why that is the case. Let me take maybe five minutes here to try to deliver some intuition about why normal distributions are so normal that they're all over the place and they arise through very basic physical processes. Then there's also, once you understand that physical reason that you see them everywhere, they're also epistemological or information theoretic reasons that we may want to assume something as Gaussian. Let me try to get both of those arguments across for you. The first reason that normal distributions are so normal in statistics is that they're very easy to do calculations with. They have lots of really nice properties. They're additive. That's one of the things that's really, really nice about them. Every other distribution you can think of is in some way less convenient than a Gaussian distribution. It's just how it is. It's like the universe is being kind to us that normal distributions are so normal because it's easy to do math with these things. Second thing will be that they're very common. Let me try to give you an intuition for that and then we'll get to the third explanation. Imagine a football pitch and if I were to take everybody in this class out to the field and we all line up on the midfield line and ask each of you to take a coin out of your pocket and flip it and then the rule will be if you catch it and it says heads, you take a step left. If it says tails, you take a step right. So everybody does this and you take some steps. Yeah? And then we do it again. Same rule. If you flip your coin again and you catch it, if it says heads, you take a step left. If it says tails, you take a step right. Again. And again. And we can do this a few hundred times and you will scatter through the field in some various ways. And the question is, then we measure your positions relative to the midpoint line. Where are you on the field? And we plot the frequency distribution of those positions. And what will that distribution be? And I guarantee you it will be almost exactly Gaussian. And the reason is because this is how Gaussian distributions are born. They are born from adding together fluctuations. So let me just show you what it looks like in the simulation. This is all in the start of chapter four of the book as well. This is figure 4.2 in the book. This is a simulation of our soccer field experiment. Everybody starts at position zero. That's the midfield line. Start flipping your coins. After four flips, four steps, I don't know, it's like a hundred. I don't know how many simulated students are on this slide. Apologies. A large number. And the black one is just to trace one particular student. So everybody's oscillating. Sometimes you take two steps left and one step right and so on. But a pattern forms in the aggregation. And what happens even after four steps you see at the bottom is you're starting to get this peak with a little tail. This isn't really very Gaussian yet. It's only been four coin flips, but you're getting some spread. And then after eight, now it's pretty Gaussian. And after 16, only 16 coin flips, this is a very Gaussian, a very classical kind of bell curve indeed of distributions. And it will stay that way. It'll get wider and wider over time. Eventually you leave the soccer field. Move that into the Aldi or whatever the stereotype. But you never leave the bell curve once you're in it. It just gets wider and wider over time. Why does this happen? There's lots of mathematical theorems that describe this thing. But the intuition I want you to get is what's going on in these processes is that there are lots of little fluctuations. Each coin flip is a fluctuation in your movement. And the fluctuation can be up and it can be down. And in the long run, fluctuations tend to cancel. And that's why the mean is zero. Because if you get a string of lefts, then eventually you're going to get enough rights that you tend to course corrupt. So that the average particle that is student in this example ends up near the middle, not maybe exactly at the middle, but near the middle. I know this is confusing. Another way to think about this is there are lots of different sequences of coin tosses. After 16, the number of possible sequences of heads and tails is very, very large. The combinatorics make thousands and thousands of possible different coin sequences. But a very, very large number of them exactly cancel one another. And in fact, there's no other kind of sum which is bigger than zero. There are more paths that will give you zero than any other kind of path. I know this is weird, but it's just the combinatorics. And then there are a huge number of paths which give you plus one or minus one. And then a slightly less huge number that give you plus two and minus two, and so on. And then falling off in the classic bell curve. And the Gaussian arises as a consequence of this canceling of fluctuations. And any time you have a process that adds little fluctuations together, regardless, almost regardless, of the underlying distribution, you get a bell curve. That's all it takes. It's just a canceling of fluctuations in the long run. And that's why such a huge diversity of natural systems produce bell curves. Because that's all it takes. Genetic systems. You've got a bunch of alleles. They modify your growth. Height is approximately normally distributed. It has long tails, but it's pretty normal in the middle. Because we don't need to know anything about the architecture of growth or genetics or anything. It's just we need to know that those are little fluctuations. And they tend to dampen one another. And so you get a bell curve from it. Does this make some sense? Yeah. It kind of sinks in after a while. We're going to, in the new year, we'll loop back to this in a later chapter and talk about all the other common distributions and statistics also arise through processes like this. But it's not exactly the same as the Gaussian. That's why they look different. But there are processes like this through which some information is preserved in the system through the aggregation process and other information is lost. And what you're left with are particular shapes, which are called maximum entropy distributions. And then Gaussian is the most famous one. So the kinds of processes in nature that are going to produce a normal distribution, addition is our friend. And what is addition? Addition is the mathematical function where when you make a composition out of things, order doesn't matter. That's how you define it mathematically. And so any process in nature that makes compositions where the order that the things enter doesn't matter is going to tend to produce bell curves if you have aggregations from them. Which I think is super cool because I'm a nerd. But this is a really cool fact. One of the things about it that I think is nice too is that lots of things are approximately addition. And so you get really good approximations in Gaussian distributions under a wider range of phenomena. For example, products of small deviations are also approximately addition. There's a box in the book where I show you how to prove this in the code if this isn't intuitive. And so there are lots of multiplicative interactions which will also produce bell curves, which is also a neat thing. And actually, I think I gave a genetics example. I think given what we know about gene regulation is probably products of small deviations is what's going on there. It's not that they're additive, it's that they're multiplicative, because that's how growth works. And logarithms of products are, many of you know, addition. And measurement scales are a human invention. Nature doesn't actually care about them. They don't exist. Measurement is an epistemological thing. And so you can measure things along a logarithmic scale as well. And lots of measurements won't look Gaussian on one scale, but will if you take their log. What are logarithms? They're magnitudes. And magnitudes are just as good as any other measurement scale. Lots of things are more natural to measure on a logarithm scale, like the strength of an earthquake. So what I've just tried to deliver to you is a little bit of intuition about what I call the ontological perspective on normal distributions. There are lots of processes that produce them, and these are the processes which add fluctuations together. And when you add fluctuations, the fluctuations tend to dampen one another, and you end up with a symmetric curve. Even if the underlying fluctuations are not symmetric, which I think is super cool. There's code in the book you can simulate this with. What's neat about this, and frustrating at the same time, is you end up with very little information about the underlying generative process. When you have a bell curve, nobody intuits that that means anything particular about the process that generates it because you know bell curves are common. When you see that heights are normally distributed, that doesn't prove or falsify any particular theory about the genetic architecture of height. You learn basically zero about it. And this is both cool because there's a process by which all that information about the underlying process has been shed, and all that's preserved is the mean and the variance. That's all you've got left, and that's why you only need two numbers to describe a normal distribution, the mean and the variance. That's all the information that has been extracted from the underlying thing. That's all that addition does is it keeps those two moments and nothing else. And I think that's cool. What's terrible about it is that the process from the frequency distribution, you have to do some science, write some harder measurement and figure out more, dig into the depths of it. As a footnote here, this is true of lots of distributions. All the maximum entropy distributions have the same property, that lots of different processes will produce the same frequency distribution. And the example I gave last week with the neutral theory and testing whether their selection or not power laws arise through lots of processes. And the fact that something's a power law tells you almost nothing about the underlying process except that it has high variance. That's really all you learn. The power law literature is a bit of a mess, but it's saying that something is... you know, you tested some theory because you found the power laws like saying you tested some theory because something is bell shaped. It's just not powerful. The other perspective on this and to some extent, detail of this is going to have to wait until later in the course when we do all the other distributions and there'll be some conciliance to this. And I'll give you a proof of what I'm going to say here. If you're building a model and you want to be as conservative as possible, then and all you're willing to say about some set of measurements like height measurements is that they have finite variance then you should use the Gaussian distribution in the model. Even if it turns out that they're skewed or something else, the Gaussian distribution will cover a wider range of values than anything else with the same mean and variance. It's the most conservative distribution you can assume that has that variance. Any other distribution will be narrower. It'll have more information built into it. So this is the most conservative assumption that you could possibly make. And so it's a very good assumption to use when you don't have additional scientific information about some measure. It's the right thing to do conservatively. Does that make sense? I say that expecting the answer to be no. As I said, later on I'll give you a proof of this later on in the course. And all the other maximum entropy distributions also have this kind of property is that they're the flattest distributions possible with a set of information constraints. That's things you know about the data before you see the actual values of it like that they're all positive that gives you, from that you can derive the most conservative distribution that would fit those measures. Those are the distributions we use in statistics. They're all maximum entropy distributions. The Gaussian one is just the one if all you're willing to say is there's a mean and a variance and nothing else then the Gaussian is the best assumption in terms of being conservative right? It assumes the least. It assumes nothing but that there's a mean and a variance. Okay, so let's build some actual models. I've got another half hour here. So I think if you're like me your early statistics training involved meaning lots of little individual procedures right? Chi-square test I think was the first statistical procedure I did as an undergraduate. We did some dissections filled blood on the table and then we had to do a chi-square test about it. That's sort of how it worked, right? And that's normal. That's how it goes. What I want to do is convince you that basically all of those things are linear models. And what you should do instead to preserve your own sanity and make yourself more productive in modeling is just learn the linear modeling strategy and not worry about all those little procedures. Everything like ANCOVA, ANNOVA, MANOVA and all those things are just linear models. T-tests are linear models. And so we're going to build up linear models from the ground up here. And I'm not going to loop back and show you how to build a MANCOVA or whatever. You don't even need to worry about that. This is some procedure that people used to build into software in a particular way. But you can build the model you need for your thing without worrying about selecting from a menu of different options. To ease into this think back to last week. Here's the only statistical model you've seen so far in this course, right? And this is the globe tossing model. And we're going to write out almost all statistical models in this course in the same standard mathematical notation. This is just a way of abbreviated way of communicating to your colleagues what the model assumes. And we're also going to write these in our code so that it gets reinforced. So to remind you in this case we've got three variables in this model. A count of water observations when you through the globe a number of times you that's w, a number of times you through the globe n, and some parameter to estimate p, the proportion of water on the globe. The little tilde, so we see w tilde binomial the tilde you can read as is distributed so w is distributed binomially that's the data distribution or the likelihood with arguments in and p that is in trials each one having a chance p is success and then p is distributed we have a prior distribution uniformly between 0 and 1 which says we put equal weight on all the possible values remember and if you've done your homework already you've seen the consequence of changing that you can do better than that is this okay you with me remember this yeah so same thing is true for more elaborate models and in general a linear regression model or any model what you do is you've got a list of variables that are going to participate in this model and some of these are things you will observe like the toss the count of water and some of them are things you can't observe like regression slopes they're fundamentally not observable entities right they're rates the things you have to calculate or the proportion of coverage on the globe so we have to list these variables and then define them and a linear regression model just like every other model there'll be more of these symbols because you could have lots of variables that are participating in the explanation but it's the same business you just have to define each of them and the motor of these linear regression models is the second line in these sorts of things you don't have to understand this one yet we're going to build up a simpler one I just wanted to show you a full blown example where there's some mean of the normal distribution of top called mu nearly always and it's got some equation that defines it in terms of some other variable that you've observed here X X is some explanatory thing that will make you famous because you will show that X predicts Y and it'll be on the cover of science sorry I shouldn't be so snarky in the morning but I want you to notice that at the bottom I've also said that X has a distribution because it does all these variables have distributions usually we don't worry about assigning some distributional definition to the observed variables that we aren't trying to predict but they do have distributional assumptions to them and you can we're going to do some really cool stuff with this fact later on in the course like measurement error and missing data the fact that all the variables have distributional definitions means there's no things about them and if you can put that in the model then you can get some automatic inferential power that you were missing before let's do the world simplest linear regression and since I used height as an example before as a nice Gaussian variable let's use height as an example so these height data come from Nancy Howells now classic evolutionary demography book on the W Kuhn and it's built into the rethinking package this is a minimal data set I've taken away lots of the fun variables I think there's a howl too that's got everything else if you're interested and it's just height, weight, age and then an indicator whether the individual is male or not this is 544 individuals we're going to focus just on height for the moment and just the adult heights and think about how we would build some model to describe the Gaussian distribution of these heights so here's the distribution of the height data on the right measured in centimeters of the sample and what we're going to do is you say some variable h which is h sub i which is the height of individual i is distributed normally the height of some individual h sub i this is our outcome variable is distributed normally with some mean and some standard deviation and there's nothing special I'm sure you all appreciate this about the particular letters I'm using just conventional so this mu and sigma are conventional for means and standard deviations but use whatever you want it really doesn't matter the code will not care absolutely does not care at all you're not sinning if you say something else it really doesn't matter you'll make a statistician upset and that would be a bonus right you'll free to troll the statistician but it's important to have the skill that you can see this thing and you can read it to yourself and say it out loud what it means this is just a language this is not code it'll be code in a minute this is not code yet this is just a form of communication it's a simplified code for scientific communication now we've got three variables here one is observed and two are not so we have to mu and sigma have not been observed and we have to infer them from h from the thing we have measured but mu and sigma still need definitions because this is a Bayesian model and so they need prior distributions and we're going to assert two here and then we're going to look at what the implications of these priors are so for mu we give it a normal distribution with a mean of 178 centimeters and a standard deviation of 20 what does this mean so this is just the mean mu this is the central the average height in this population this Kalahari population what is 178 centimeters in my height I'm 178 centimeters tall I'm 178.5 centimeters tall that .5 is precious but 178.5 centimeters tall so I'm using myself as a prior now this is probably not the best prior because I'm taller than a Kalahari forager by a little bit but that's why it's there and that standard deviation of 20 is on the mean it's not on the population it's on the mean and that's very generous and we're going to simulate from these priors in a second to apply and then for sigma uniform between 0 and 50 this is a prior which is way less than probably what we know now sigma is the scatter of heights in the population and this prior says we know nothing except that it's less than 50 we'd certainly know more than that let's simulate from these priors and see what happens what we're going to do is what's called prior predictive distribution we're going to do a lot of this in the course I didn't do this last week last week at the very end of last week we did posterior predictive distributions when you've got a posterior distribution you can push the posterior distribution out through the data model and make simulated observations you can do that with the prior as well and this is the best way to figure out what a prior means before your model has seen the data what does it believe it's not what you believe this is what the model believes you're coaching this thing we're designing the model based upon the scientific information we have about the phenomenon to try and build good priors before we use the data to do any educating of the model so here's the code to do this this is a very simple model so it's very easy to do the prior predictive simulation all you have to do is sample values for the variables so we sample some new values our norm is random normal with mean once up, 78 and standard deviation 20 that's the prior for mu and then we sample sigma from runoff, that's our random uniform again 10,000 between 0 and 50 and then we can simulate some prior heights again, height is distributed normally, this is all the same information that's in the mathematical version of the model 10,000 and then you just put those lists in the slots for mu and sigma and since r is magical like lots of these scripting languages it knows when you give the list of things you want that many results in return and it gives you that and then we just bought the density and this is what you get this is what the model believes where you see it, of course it believes that everything's centered on 178 what you're seeing here is not a normal distribution you probably recognize it as something else maybe this is a t-distribution it's really about the standard deviation so you get these fact-tales where it is what's nice about this at least is maybe not the best prior in the world there's more biological information you could put into this but at least we're in the realm of possibilities but I want to show you though is that there are some really really tall individuals in this prior maybe you don't have a sense about centimeters but some of these individuals are just absurdly tall but at least we don't have any negative heights there's no impossible people but some really tall people let me give you a sense of this so two hundred centimeters and six and a half feet tall for the North Americans in the room and let me use a different prior to show you why prior predictive prior predictive simulations are so useful to look at imagine that I had, I used a much more conventional kind of prior, a flatter prior so typical linear regression priors are incredibly flat we'll put standard deviations on all the priors in the thousands this is incredibly common in Bayesian statistics and those priors I'm going to argue repeatedly over this course are bad news because they create impossible outcomes before your model sees the data so let me show you an example let's take the prior from you and let's make the standard deviation a hundred so now this is a really really flat Gaussian distribution centered on my height but it has a standard deviation of a hundred yeah if you simulate from this using the code on the previous slide this is the prior predictive distribution of heights you will get now still centered on a hundred seventy-eight but now there's mass below zero zero is the height of a fertilized egg so we know that everybody in this sample is taller than that this isn't cheating, I don't have to have seen the data to know this, it's just that I know what height means so whenever you're doing scientific applied statistics you can use information like that you want to use information like that and then I had to look this up but the world's tallest person ever recorded was two hundred seventy-two centimeters tall at time of death, he was still growing at a time of death he had a very strange growth he just kept growing it's a very sad and interesting story, I think there was a movie about it so that's our natural limit let's say, we probably don't want a prior assigns a lot of probability mass above that either now in this case, like in most of the examples in the first half of the course, the models are so simple and the relationships between the parameters and the outcome are so simple that you can use really bad priors and get away with it we're still going to worry about it and we're still going to do prior predictive simulations just so we can practice when it's safe because once you start working with mixed models multi-level models, then the priors can have a much bigger effect and then you want to worry about it and then you use your scientific information to build good ones and you'll already be pros at prior predictive simulations so you feel good about yourself at that point okay, we got to compute the posterior distribution for this model this is the second and last time we're going to use grid approximation now this is a two-dimensional problem so we get a two-dimensional grid and we're going to calculate this two-dimensional grid here mu versus sigma and in each combination of mu and sigma on some finite grid that we choose we choose how sparse it is we calculate the posterior probability how do you do that? well, there's a rule you multiply the probability of the observed height right conditional on the mu and sigma at that point yeah times the prior probability of that mu and the prior probability of that sigma the code to do this is just some loops and it's in the book I encourage you to run it at home make a cup of coffee and take a look at it I don't emphasize this algorithm because we're never going to do it again this is already quite onerous with two parameters you go to 3, 4 by chapter 6 there's an example with 206 parameters at the end of chapter 6 you don't want to do this that way you don't have time computers are amazing but combinatorics will get you in the end but I still think this is nice to understand what's going on so to give you an idea there's a grid here you can do this at different densities of the grid in the upper left just the 50 by 50 grid you get this very pixelated this is like crime scene investigation top work kind of thing like we zoom in starts out fuzzy like that and then you get extra resolution and you read the license plate and 100 by 100 now you start to see the Gaussian hill and then 200 by 200 and the posterior distribution in this case has got a dark area in the center where the most plausible values of you in sigma lie and then gradually as you move away from that point they become increasingly implausible what do I mean by plausible the number of ways that you could see these data assuming that that mu and that sigma are the true values get smaller and smaller as you move away from that area that dark area in the middle does that make sense what we do here of course is we draw samples so that we can do summaries because it's easier to think with samples so again in the book there's a line of code to do this drawing samples is easy once you've got the samples it's just a data frame you can make summaries of this distribution and now we've got fuzzy samples from our hill there in the middle and if you look at cross-sections of this you imagine that's a that two-dimensional plot is a hill and we're looking down on top of it we can look at it from the side if we look at it from the bottom side you see the contour for mu that's the shape of the mountain if you're looking from the south that's the uncertainty around mu and then it gets less plausible as you move away and the same for sigma if you look from the west on this mountain you see another contour and I don't know if you can see it very well here I'll highlight this on the later slide a couple slides a little better this is not perfectly symmetrical the posterior distribution for sigma is not a Gaussian distribution can you see it? if you don't look at distributions as often as I do it's skewed a little bit the right tail is longer than the left tail this is nearly always true of standard deviation parameters and this is going to be a fun thing for us later on I promise you I have a weird definition of fun but why should this be the case? it should be the case because you know something about the standard deviation even before you've seen the data you know it's positive and that's not true of the mean scientifically you know the mean is positive for height but the model doesn't know that but you did tell it the standard deviation is positive because a negative standard deviation is impossible just by its definition as a consequence you nearly always have more uncertainty on the up end right you know it's easier to figure out how small the thing is versus how big it is there'll be many more large values of the standard deviation consistent with the same data than there are small values so you get this skew in variance parameters or what are often called scale parameters in statistics that said it's pretty close to Gaussian right it's not bad and the more and more data you get the more Gaussian it gets so this brings us to what we're really going to do for most of the course first half of the course is use the quadratic approximation grid approximation for teaching is really invaluable I think and you really should sit down with your cup of coffee and run through the two dimensional grid approximation make sure you understand what's going on because that's all Bayes is doing but now we're going to do a fancy approximation of it so we can go to higher dimensional models and that approximation will be to assert that the posterior distribution is a normal distribution in every parameter and this is often an incredibly good approximation it's embarrassingly good approximation actually that said later on once you get into generalized linear models it's often a really bad approximation but for perfectly linear regression assuming the posterior distribution is Gaussian in every dimension is often fine even for sigma strangely how does this work you need remember I said you need two numbers to describe any Gaussian distribution you need its mean and its standard deviation the posterior distribution is more than one dimension so you need more than one mean you need a mean for every parameter and you need a standard deviation for every parameter and then you need the correlations among the parameters so multi-dimensional Gaussians have this covariance matrix that you also have to so you need all these numbers but this is way fewer numbers than calculating grid a grid approximation so it turns out that this procedure of doing a quadratic approximation how would you do it you just climb the hill so remember the hill back the hill right here your computer can just start at any particular point and then it doesn't know where the peak is but it knows what uphill is and so it can just climb uphill through some gradient descent sorry it's usually it's down that the computer is doing inside the computer but now we're climbing so it's climbing doing gradient descent gradient is just a slope right I know there are hikers out here you know about ridge lines and things to save your feet so the computer knows these things too so it can just climb up the hill and go up it and so to find the peak and it'll do this and R has built into it really good algorithms for doing this you can give it a very high dimensional space more than two and it'll follow the you know in dimensional ridge line as it were and climb up to the peak and then when it gets to the peak it just needs to measure the curvature of the peak to know how wide the hill is and this is all it needs to do algorithmically this is a Laplace here because this approximation is sometimes called a Laplace approximation because he used it right this is an approximation of assuming the posterior distribution is Gaussian which will be a very very good approximation under a wide range of circumstances including when we use linear regression models note here if you've done much maximum likelihood estimation you've done this this is a vast majority of maximum likelihood estimates are constructed the same way they have no priors but the algorithm works probably identically to what the tool I'm going to show you does you could use the tool I'm going to show you to do maximum likelihood you just got to leave the priors out okay what is the tool in the rethinking package I have a tool called quap for quadratic approximation yeah and the way quap works is you make these little formula lists you imitate the mathematical definitions of the models and you must write every piece of it because I'm annoying right no because I may be annoying but this is for your own good you need to own every piece of this thing and learn it typically when you do statistics in a computer there's some abbreviated model definition form that only involves the measured variables and so you can do statistics for years and years and years and never another relationship between the parameters and the data and there's some knotting heads out there this is a safe space you can admit it it's all good so I wanted to make a teaching tool that had none of those luxuries and so I did and that's how quap was born and so you're going to have to write every detail of how every parameter multiplies every variable but then you're going to feel great because you're going to be super confident about what the model structure is right so in this case for the simplest linear model we take the first line h sub i is normal mu sigma you can just write this as height is the name in the data frame the variable is called height tilde d norm d norm is the function in r for the normal density and then mu and sigma there's nothing special about those names mu and sigma, anything you like but for communication purposes I'll try to keep using mu and sigma and then we have to define those give them priors d uniform from 0 to 50 that's it, notion model to make some sense these will get more elaborate as we have more variables but this is the idea and then you just pass it to quap and give it the data frame and then what does quap do quap translates this into a statement about the log probability of the data at any particular combination of the parameters and then it passes it to the hill climbing algorithm that's already built into r and optum does all the hard work and passes the answer back and what is the answer it's a list of the means and a covariance matrix and those two things are sufficient to define a Gaussian posterior distribution and then you can get a summary of that using this preci tool that's also in the rethinking package preci, I wrote preci because the summary command in r you guys know this function summary like nobody would aesthetic sense to find, design the output from that you've got variables returned with like eight numbers after the decimal point and there's p-values everywhere it's just, it's abhorrent so preci is minimalistic and gives you a basic summary of each variables Gaussian distribution that is it's mean and standard deviation and then these 89% variables, compatibility intervals defined on percentile so there's equal weight in each tail let's look at them graphically so you can get a better sense of what's going on so we can also extract samples, remember we want to work with samples it's easier to think and this will be a case where I can compare this Gaussian approximation to the exact the grid approximation that we did and see how good it is so extract dot samples is a function you give it a quap model that's been trained on some data and it'll do the sampling it uses that Gaussian approximation it returns a data frame with a column for each variable and as many rows as you like and what are these rows well we'll talk about that in a few slides just hang on a second the values in the rows go together but if we just break them apart now each column by itself on the left plotting mu the samples there are the, yeah blue, sorry I was looking for my key, it's in the upper right so the samples are in blue and the dash part is the actual Gaussian approximation to show you this, how it works and you'll see that the Sigma still has that skew that we had drawn before the Gaussian approximation is a little bit too symmetric to understand what this means we're going to have to draw some lines and we'll do that in a second, that's the next part you're like where's the regression line Richard it's coming, we're going to draw some lines the first thing to, I want to say though is I think of this tool, actually the whole rethinking package is a scaffold for you so you can learn how to do modeling and actually understand what the models are after this course or even during it, it's perfectly fine to use packages which use abbreviated model inputs but you want to make sure you understand what the model is and so these tools like Quap force you to do lots of really annoying chores over and over and over again and you're going to hate it, right if you're a normal person but then you'll cut the love it, this dot com syndrome will set in and you'll appreciate the fact the security, knowing that like yeah, no, I know exactly what that model is assuming because I wrote it over and over again because that asshole Macklery has made me do it over and over and over again, right it's for your own good but it really is just a scaffold eventually you're confident enough of these models that it's perfectly fine to go use some abbreviated model input but it's really not okay when you're starting out to use the abbreviated input it's my opinion at least okay I'm also going to need to think of the algorithm, the quadratic approximation is a scaffold it's a thing that helps you learn your way into modeling but you want to graduate beyond it and stop using it at some point because for generalized linear models, things with non-Gaussian outcomes, it's really hazardous it can do lots of really silly things and I'll show you some examples in the new year okay, let's add a predictor because you want lines, right when you hear regression, you think a line is coming there's been no lines so far we've just got some Gaussian distribution of height congratulations now let's add a line and that means adding a predictor variable what is the idea? The idea is there may be other variables in this data set that when we learn them we can make better predictions about the outcome of interest, that is the height there's some statistical association between some variable in this case it's going to be weight and the variable height and again there's good biology that determines this this is only the adult sample I've cut off the kids because it's really different than the kids and this is the relationship between weight and height in this population so I think your eyes aren't lying to you there is a statistical association between these two things but the question is how would you statistically describe this such that you could compare different populations or talk about the relationship across age or any of the other stuff that you would actually do in a scientific paradigm so what do we do? Well we add another variable to the model and now we have a linear regression in the conventional sense and this model here has all the standard features and every linear regression is just this with some of the components replicated so this is your UR model as it were so again the top line is the likelihood or the data model for height is normally distributed now there's an I by mu you see that? It used to just be an I on H, now there's an I on mu and that's to say that mu depends upon the person I that you're talking about it's different for each person now and mu I now has a definition it's alpha plus beta times x i minus x bar we're going to spend time on the next slide talking about this so hang on a second and then we have priors two of these priors you've seen before our alpha thing is what mu was before it's going to be our population and this is my height again 178 centimeters and sigma is the same as before but now we have this beta thing which is describing the relationship between x and h sub i so let's take a closer look at the prediction part of this the real motor do you see mu I has got this expanded definition now we've taken something that mu which used to be a parameter it's still a variable but now it's defined deterministically entirely in terms of other variables that we've made up you're like how could this be legal well it's science everything's legal until you're caught so it's still a variable but now the equal sign means it's deterministically defined by the things on the right hand side and those things are two new parameters alpha and beta alpha has the same meaning as mu before it's the population mean we're going to estimate it as before and that's why it has the same prior yeah does that make sense and this beta thing is what you would call a slope you probably are eager to say that the rate of change in mu for a unit change in x I know but it's a mechanistic relationship so if you could probably see if beta were 0 then we're just back to the previous model we're saying there's no relationship if beta is a positive number and every time x goes up by 1 mu goes up by beta yeah we subtract why is there this weird thing where it's xi minus x bar that's so alpha has the meaning of the population mean otherwise it won't so when xi has the value x bar what happens to this equation all that's left is alpha yeah because then you get x bar minus x bar and that whole term goes away and then beta doesn't matter and then alpha has the meaning of the population mean if you don't this is called centering your predictor and you should always do it well not always there will be some cases where the meaning of it means you don't want to do it but this should be your default behavior when doing regression is to center the thing it's a really nice thing to do it makes everything easier to understand it makes the priorities easier to define there will be endless examples of this there will also be examples where I violate this for hopefully good reasons and I can justify it for you yes Brett you have a question doesn't that imply that it should say mean when xi equals x bar instead of xi equals 0 on the far left oh did I write that wrong yes you're right that text is wrong yeah it should be that 0 should be an x bar that's just wrong alright um prior predicted distributions let's do it again so we've got priors we've got a new model there's this new beta thing in there it's now predicting lines though so what is the prior distribution got inside of it now and the answer is a whole lot of lines right uh has anybody here ever seen the movie 2001 of space odyssey right okay get to the end there's the monolith oh my god it's full of stars all right that's the prior oh my god it's full of lines right it's just an infinite number of lines no okay one person's smiling okay so in my childhood that was a very important movie it was like the beginning of getting into anthropology for me alright so it's a great movie so anyway moving past my my failed joke we're going to simulate lines from this thing and uh and take a look at what the implications are and we do it just like before let's simulate a hundred lines not an infinite number set dot seed up there there's nothing special about that number this is just so you can repeat exactly my example uh you might want to then change that so set seed just sets the random number uh algorithm inside r so that you can replicate my examples and I'm trying to put this in all the examples so you can get my exact pictures back but there's nothing special I haven't cultivated this number I was in whatever I recently I just typed a number and I get something uh and uh then we sample some alphas you'll recognize that we sample some betas uh and I'm ignoring sigma because we're just going to draw the mean line we're just getting to prior for mu right we're going to draw mu mu doesn't depend upon sigma right the whole scatter of h's depend upon sigma but not the line does that make sense okay so what happens well this is what you get um with uh beta didn't find as normal with mean zero and standard deviation 10 uh mean zero means uh uh you expect your prior expectation is that there may be no relationship between weight and height now obviously that's scientifically silly but it's okay I mean I don't think that's actually so bad you're letting the data speak but getting the scatter right is important here because you'll see in the prior all these lines that have come out of the prior there's lots of really impossibly strong relationships um right so again I've drawn my same lines as before zero centimeters is the fertilized egg height yeah slightly bigger than zero right if you really measure the fertilized egg it's slightly taller than zero yeah but close and then the world's tallest person again 272 centimeters at time of death and uh so some of these slopes are taking you from impossibly short individuals to uh individuals you know approaching twice as tall as the world's tallest person uh we want to damp down the enthusiasm of this prior a little bit right and again we've got enough data here and the model is simple enough and the relationship between the parameters and the predictions are simple enough that even with this terrible prior this will all just wash out you should do this experiment at home but we practice when it doesn't matter so that once it does matter we're good at it right so that makes sense these are safe examples and we still practice on the safe examples where the priors don't matter that much because there will come a time in your life where it will matter and then you want to have some sense about how to deal with it okay um so let's do something instead and instead of of having this normal uh 010 prior on beta let's use our scientific knowledge we know that beta is positive so let's make it positive how would you do that in a prior one of the easiest ways is to use a log normal distribution what is a log normal distribution I know many of you know this because you work with them um a log normal distribution is a distribution of stuff that's normal when you log it when you take the logarithm of the numbers they don't look normal when they're not logged but once you log them they look normal what's nice about this is it means they're all positive uh in the actual variables are all positive log normal variables are strictly positive reels and that's what we want for the slopes we want to assume that the relationship between weight and height is positive of course it is yeah uh that's the thing to begin with and then we want to get the scatter right as well so here's my suggestion the beta should be log normal 0 1 and uh 0 and 1 are the those are the parameters that describe the log distribution right how's this parameterized I know I should know this but I think that's right yeah that's that's the mean of the log after you log it that has to be true because otherwise the 0 couldn't be couldn't be legitimate right so this is that's the mean of the log distribution right uh so that's the mean and standard deviation of the normal and then the actual variables are the differentiation of that and what does this look like when we simulate some lines from it well now we're in the possible outcome space right we get there's still a lot of scatter and there's the occasional crazy line right because we haven't prevented it from being it's unbounded on the top the slopes can still be really big if the data demand it yeah uh but they can never be negative um and between the weight down here of 30 uh up to 65 at least we're in the possible human range yeah so this is a way better sort of prior and this is the kind of thing you want to practice it doing use your scientific knowledge to get yourself into a possible outcome space before you put the data in does that make sense there'll be lots of examples of this uh going on okay I got a couple minutes so I think we've got time to actually run this model and approximate the posterior um I apologize this slide just being a lot of code but uh we're just going to focus on the bottom I just put it all here so you could see how this example gets set up completely from the top where we load the data these data are built into the rethinking package then I slice out the adults right I get all the adults all the individuals who are 18 or above right now of course in Kalahari adulthood starts at 12 but you know this is our culturally imperialistic definition of adulthood right and uh then I measure X bar because we're going to stick it down in the thing below you don't have to you could do this in the model but it's more efficient to do it outside um and then I define the quap model I put the list I just write the list directly into quap and this is how we're usually going to work but you could pull it out and define a separate variable that holds that list and then pass it in if you like it's up to you I tend to do it this way and then let's review the symmetry again between the mathematical definition on the right and uh the quap code um let's just focus on the mu line because that's the new bit uh notice that there's not an equals in the a-list there's that arrow that you may know in R that's the way assignments used to be done the way that old people like me still do them right I learned R in 1999 I partied like it's 1999 still and uh second failed joke but uh and uh no literally I learned it in 1999 first year I used R and so I have old fashioned habits it was tidyverse stuff that people do it's all very cool but you know some of us learned but all you had was you know flint and tender so we do everything in base but uh do that assignment and set the equals and then you just type the mathematical expression and literally you're writing the log posterior here the code you enter into quap is just going to be used directly in R to evaluate the probability of the data at particular combinations of the parameters and that's and then it will climb the hill so you're programming the log posterior literally so there's the potential to do crazy stuff here right but you won't because you're skilled you will do good stuff okay this is a perfect place to stop I've run out of my time I'll advance the slide just so we know where we're going to pick up when you come back next time we're going to keep working with this model and we're going to characterize this behavior and talk about how to talk about the uncertainty around the regression line as well as its location okay thank you and I'll see you on Friday