 Welcome to the third lecture of Statistical Rethinking 2023. In the previous lecture last week, I introduced the basic logic of Bayesian inference, and more important than that, the workflow for moving from a clear scientific question to the development of a causal model, and then from there to development of a Bayesian estimator. We're going to stick with that framework for this lecture and all the future lectures, and hopefully that'll make the material easier to grasp. I'm going to start today's lecture with two stories from astronomy, the history of astronomy. What you're looking at here is a time-lapse set of photographs of the night sky, and the red object that seems to be wandering across your screen is Mars. Mars as one of the planets, planet in Latin means wanderer, it appears to wander across the sky against the fixed background of stars. All the planets do this zigzag motion that you're seeing here, but incomprehensible given your knowledge of the solar system as a bunch of ellipses around the sun, but this is what it looks like from our perspective on the Earth. All the other planets do this as well, Jupiter, Uranus, and so on. But Mars is most shocking in this because zigzag is the most obvious. What explains this odd wandering motion? Not just that the planets move, but they reverse course at some point in their motion. The puzzle is made a bit more acute with some modern knowledge. We know now, of course, that exactly the same phenomenon happens from the perspective of an observer on Mars. We know that because we have put observers on Mars, our robots. The Curiosity rover, represented in the right of this slide, sees the Earth as a zigzag wanderer from its perspective. What is the scientific explanation for this perception that the planets reverse course in their orbits from the perspective of an observer on any of the other planets? A very successful model for predicting this is also wrong, and it is the geocentric model. For more than a thousand years, this model was capable of predicting this path, the wandering of Mars and the other planets from the perspective of an observer on Earth. The way it made those successful predictions was by imagining that the solar system was structured something like this. With the Earth at the center, that's why it's called geocentric, and then the other planets are on orbits around the Earth, but they're on orbits on orbits. I'll say that again, they're on orbits around the Earth, but they're on orbits on orbits. The first dashed orbit around the Earth there is the orbit of Mars, but Mars is actually on a smaller sub-orbit called an epicycle. I'm going to set this little simulated solar system in motion here, and you'll see what happens to the path of Mars. As it goes around, it will zigzag. As it comes to the point where it's closest to the Earth, it will reverse course, and so from the perspective of an observer on Earth, Mars will make that path that we actually see in the empirical data. And with this device, the epicycle, people like Ptolemy and other ancient astronomers are able to make extremely accurate models of the solar system as we observe it from Earth. But of course, the structure of this is completely bonkers. The solar system is not structured with epicycles. The orbits of the planets are elliptical, and they're around the Sun, not around the Earth. This is a great example from the history of science or how statistical models can achieve arbitrarily accurate predictions, but not explain the structure of the system that they study. And this is a fundamental paradox and problem that we face when we do statistical modeling. The real explanation for this, just by way of mention, is that the orbits are embedded within one another. As you see here on the right of this slide, Earth's orbit is closer to the Sun, which I have failed to draw, just for clarity, than Mars is. And this means that there's a point in the orbits of these planets when they're close to one another and there's a point where they're further apart. And this creates the appearance of a zigzag. So I've drawn this white bar, which is just the observer on Earth, their perspective on Mars. And if I start this system in motion, you'll see that there's this projection line, the apparent position of Mars against the night sky, against the stars, as represented on the far left there, that far dashed line is not another orbit. It's our perception of the still night sky where the stars are. It's the curtain, the celestial background. And as Earth and Mars continue orbiting, that position on the celestial background is going to zigzag. So I'll start the animation again and you can appreciate what happens as Earth gets close and then we zigzag and it goes around. I'll let it go around one more time so you can see the phenomenon again. So we now know this is why this happens and we know why it happens to every planet and we can predict exactly this exact positions from the contemporary solar system model. But the Ptolemaic model, the geocentric model, works just fine for predicting the same phenomenon. What it doesn't do very well is predict other things about the structure of the solar system. So let's fast forward a bit to another moment in the history of astronomy. January 1st, 1801. That was when this fellow, Giuseppe Piazzi, who's an astronomer in Sicily at Palermo, observed for the first time a new comet. And he used very modern equipment for the time. For example, this thing that was famous at the time, the Palermo circle, which allowed very accurate observations and recording and tracking of celestial objects. So Giuseppe tracked this new object, this new comet, for a little while and then he became ill and he lost track of it. And when he recovered from his illness, he could not find it again. So a hunt was on to find this object. The hunt was on because just a couple of decades earlier, I think it was, Herschel had found Uranus, which was the first planet ever identified by the use of a telescope. And so everybody wanted to find more planets now using telescopes. And so Giuseppe thought maybe he had found a new planet, maybe it was a comet, he wasn't sure. But now everybody, all of the astronomers in Europe wanted to find this object again because then they could be famous. The object he actually saw was this. Of course, he didn't see it like this. His telescope was not this good. This is a modern NASA image. What he saw looked more like this, a bunch of some blurs because of course the objects are moving and the machinery is not perfectly accurate. There's this bright spot that's reflecting the sun's light back into his telescope. This is Ceres, which is a dwarf planet that orbits between Mars and Jupiter or it's a big asteroid basically, but it qualifies as a dwarf planet. It's large enough. It's about a quarter of the diameter of the Earth's moon, I think, or about the size of the state of Texas for Americans by comparison. And Ceres has a weird orbit because it's not in the same elliptical, the same plane as the other planets, the major planets. And the first person to figure out this orbit, just a few years later and published the results in 1809, was this fellow, Carl Friedrich Gauss, who's famous for doing lots of stuff, probably the greatest mathematician to have ever lived. He was in his 20s at the time and he had just finished, basically, defining the field of number theory, but he was looking for something else to do and he wanted to dabble in astronomy. So he thought he could find this comet again. He would figure out its orbit using mathematics and then he would predict where it would be able to be seen again and then he would send his notes to Giuseppe and Giuseppe would check. And this is what he did. And he developed a bunch of mathematics and ways to use the few observations of series positions at the time to figure out its orbit and then you could use that mathematical model of the orbit to make a prediction about where it would appear in the sky next. And he published it in 1809 and what's interesting from our perspective is that his argument was essentially Bayesian for how to do the data analysis to use the few observations that were available and he invented, in order to do this, what we would now call a linear regression of a sort. It's that he used a normal error model and least squares estimation, which is one of the most common ways that we fit models to data these days. But it was a Bayesian argument. The frequentist approach hadn't really been invented yet. That's a later British thing. But he didn't call it Bayesian because of course people weren't reading the Thomas Bayes at the time. This distribution, the error model that underlies least squares estimation is quite famous now. We often call it the Gaussian distribution and in the old German money, before the Euro replaced it, the 10 mark note had Gauss's face on it and a little Gaussian distribution that you see there in the inset. This made it handy for cheating on exams if you happen to have a 10 mark note in your billfold. Why am I telling you these two astronomy stories? Well, I hope they're interesting in and of themselves because the history of science is worth studying. But they set the stage for teaching you about linear regression, which is a big and extremely powerful family of statistical golems that we're going to develop as Bayesian estimators this week and in the next couple of weeks as well. And linear regression represents both of these stories in very important ways. At first it's essentially geocentric in a metaphorical sense. But what I mean by that is it describes associations among variables and it makes excellent predictions, at least it can, but it's mechanistically nearly always wrong. The universe is not comprised of perfectly linear relationships, additive relationships among variables. You'll understand why I think that as we move through some examples. Nevertheless, it's often an extremely unreasonably good approximation of those associations. And moreover, the meaning of those associations doesn't depend upon the model itself. It depends upon some external causal model that we project onto it. And so we have to keep this distinction clear. The problem with geocentric models is not the model itself. Geocentric models are extremely useful. We still build planetariums with them, right? Because in a planetarium, you're projecting light onto a ceiling. And the geocentric model is extremely useful in that context. What's wrong with the geocentric model is believing it's a generative structural model of the solar system. If you keep that distinction clear, there's nothing wrong with it. And likewise, there's nothing wrong with linear regression unless you believe it's an accurate mechanistic model of the system you're studying. But if you can keep the distinction between the causal model and the statistical model clear, these are extremely powerful golems and we will make good use of them. The second way that linear regression ties to these stories is that it's inherently Gaussian. We're going to build posterior distributions for our estimator and not use least squares estimation directly, but we are going to use the Gaussian error model. What you wanna understand about this error model is it really abstracts away from the details of any kind of generative error model. So when Gauss derived it, he defined very general conditions for how error creeps into observations so that he could derive the Gaussian distribution. That's to be distinguished from the idea that for example, the hand of the astronomer is trembling when they operate the objects, there's rounding error, there's light aberration through the atmosphere and all those little errors add up and you can derive the Gaussian distribution that way as well, but Gauss used a much more general argument about symmetry of errors and we're going to rely upon those general arguments as well. And so this error model is not generative in the same detailed way that we were thinking about generative models last week, but it's gonna do very good work for us. There are lots of special cases of linear regression, things, procedures like ANOVA and ANCOVA and the t-test are built upon the same basic structural statistical model. They're just different summary procedures that are done with them. Okay, let's talk a little bit more about Gaussian distributions and where they come from. I hinted that there are a bunch of little errors to get added together. So let's do a thought experiment. Imagine you and a thousand of your closest friends go out to a football pitch, a soccer field and you line up on the halfway line and each of you has a coin and you're going to start flipping this coin and every time the coin comes up tails, you're gonna take a step to the left and every time the coin comes up heads, you're gonna take a step to the right and so we can start you and your friends going and you're represented by the little dots on the screen and the blue just indicates that you're to the right of the halfway line and the red that you're to the left and you'll see that all the individuals are jiggling around here. Sometimes they move back to the halfway line and sometimes they cross over it. Most individuals they remain relatively close to the halfway line, but some drift further and further away, a very small number. This, if you take the positions of all these individuals they're distances from the halfway line and represent that as a distribution. What distribution does this population of you and your friends attract to? So let's run the animation again and let's track that distribution. So now on the left I'm gonna repeat the animation from before and on the right I'm gonna show a histogram of the distances from the halfway line. And you'll see as the individuals wander the distribution gets wider, but then at some point it's width kind of stabilizes and has some thin tails at the extreme left and right. The most individuals stay in the middle and you can probably see that this is a normal distribution the familiar bell curve of the normal or Gaussian distribution. Why does this happen spontaneously? Well it happens spontaneously for the same reason many natural forces processes attract to Gaussian distributions, normal distributions it's because the process is adding together small fluctuations. So there are many many more ways for a sequence of coin tosses to put you on or close to the halfway line than there are for a sequence of coin tosses to move you far from it. I'll say that again. There are many many more ways for a sequence of coin tosses to get you on or close to the halfway line. That's because the heads and tails will balance in many many different sequences than there are for sequences to move you extremely far to the left or to the right because that would require a run of tails or a run of heads. So this leads us to this idea of there are two arguments for where Gaussian distributions come from and why they're so common in nature and in our data. The first is generative. The argument that I just gave represented by the scenario I just simulated. If we sum processes that add together fluctuations tend towards the normal distribution and there are lots of processes in nature that add or approximately add fluctuations together growth is a very common one, right? If there's a certain amount of added to your height or your body mass every year those are fluctuations. They will be added together over time and so when we look at the size of animals in a population of the same age they tend to have a Gaussian distribution. The second argument is not generative, inferential or statistical. If our goal is to estimate the mean and variance of some variable the normal distribution is in a real information theoretical sense the best one to use because it is least informative. I know these words put together like this sound weird. How is it best to use if it's least informative? What least informative means here is that it contains no other information than a mean and a variance. And so if you know more about the shape of the distribution of the errors then you can use something other than a normal distribution but if all you're willing to say is there's a mean and a variance and I'd like to know them the normal distribution is the most spread out distribution it covers the greatest number of possibilities because it contains no other information except those two things the mean and the variance. Later on in the course and second half of the course I'll talk about a perspective called maximum entropy where this is justified in a stronger sense but the important lesson to get now is a variable does not have to be empirically normally distributed in order for the normal error model to be useful statistically because it's just a machine a golem for estimating a mean and a variance and it's a very good one because it's the most conservative or least informative distribution it contains the fewest assumptions. It doesn't mean you can't do better sometimes if you know more about the process but there's nothing wrong with using a normal error variance in a statistical model for something that is not normally distributed empirically you will still estimate the correct mean and variance of that variable I guarantee it and today I will teach you how to build a golem and you can prove that to yourself. Okay, I want to do three things in the remainder of this lecture to have some modest goals today we're going to try to build some geocentric models and the skills I want you to begin to develop in this lecture are first to learn a language representing models a standardized statistical notational language we can use to represent both generative models and statistical models. Second, I'm going to teach you how to calculate posterior distributions when there's more than one unknown these unknown sometimes called parameters in the previous lecture last week there was only one unknown it was the proportion of water on the globe now there will be multiple unknowns because there are typically multiple parameters used in a linear regression simultaneously. And then finally, I want to show you procedurally how to construct and understand linear models how to produce posterior predictions from them and we'll get to practice all of these skills in the lectures to come so you don't have to understand everything right now so let me pause on that point this sort of material is pretty difficult I think and one of the reasons it's so difficult is because everything comes at you at once you're dealing with new concepts and they seem fairly fantastical and then there's a bunch of code and you're fighting with software installation on your machine and unfortunately there's no way to put any of it aside you have to do all of that at once and so this makes this sort of material a bit overwhelming I sympathize with that but the good news is you don't have to understand it all at once what you do need to be able to do is just have some flow and what I mean by flow is that you can you feel some challenge but you're moving forward you feel the resistance of the water or the air if you prefer the flying metaphor but you're moving forward and that's why you feel the resistance you're not stuck if you get stuck that's okay you can back up and watch the previous lectures and find out what you were stuck on and then start moving forward again but you don't have to understand everything to keep moving forward and as you move forward you will find challenges that help you understand or realize that you didn't understand something from previous lectures so flow is the goal part of flow is that things shouldn't be too hard as well because if it's too hard then you stop moving then they also shouldn't be too easy because if it's too easy you don't feel any resistance you wanna feel the water on your fins as it were so don't stress if you feel resistance that just means you're thinking or as I often say if you're feeling confused at any time it's only because you're paying attention so what are we gonna pay attention to today? before I get into the detail of the explanation I wanna remind you of the general structure of the workflow we're gonna emphasize in this course they are owl drawing workflow the first part of this stylized workflow is to state a clear question this question can be descriptive it can be causal but we should have a clear question as we move forward because otherwise we won't be able to design a statistical model to answer it the second thing you need to do is sketch your causal assumptions and I'm encouraging people to use DAGs for this at least for the examples in the first part of this course because this is a good way for non theorists to realize that they actually have a lot of causal knowledge about their scientific systems and get them down on paper where you can interrogate them from there you can take your sketch and you can define a generative model it's generative because it's code and it generates synthetic observations and then we can use the generative model to build an estimator that last week remember the garden of forking data provides our estimator and the garden of forking data is planted with causal assumptions generative assumptions and then we can also test once we have that estimator we test it with synthetic data for the generative assumption and then finally we analyze the real data and possibly we profit and that profit maybe that we realize that our model was terrible and we go back to step two what we're gonna do today is begin with a simple data analysis problem that is focused on human growth and we're gonna move through this same drawing the owl progression and we're gonna use some data when we eventually get to the data in step five from this great book by anthropologist Nancy Howell it's a bunch of life history and demography data of the W. Punga this data is contained in the rethinking package you can load it with a data howl one we're gonna be focusing on height and weight today but there's other variables in the data set that we'll take a look at in the next lecture this week and our basic question is to describe the association between weight and height this is important for lots of reasons in human biology and it's actually a great theoretical interest because many aspects of human life history are not well explained right now from an evolutionary perspective so maybe this seems like boring kind of data and in some sense it is but the questions that underlie it are quite exciting another way to say this or I'd say we're gonna restrict ourselves for this lecture to adult weight and height because adult weight and height are approximately linear related to one another and so we can start with a simple linear regression in the next lecture we can deal with more complexities and for the scientific model we're going to make it causal any association between weight and height arises causally because height causally influences weight taller people have more person in them and so they weigh more and so here's a little dag H influences W weight which means to remind you this arrow means that weight is some function F of height so there's a big literature on modeling human growth and you've got a bunch of different models and options in terms of generative models of their relationship of how height influences weight the first would be the dynamic models very detailed biological models of incremental growth in an organism how the mass of the organism is a product of its length and how this derives from particular growth pattern life history pattern and the Gaussian variation that we observe in adult weight and height arises from some fluctuations of individual increments of growth we're not going to do this but this is the background of what you could do if you wanted a really detailed model you would make it dynamic you would birth synthetic individuals and you would grow them up to particular ages and then you would look at the distribution so weights and heights at those particular ages there's also static models which is what we're going to do we're going to imagine the association between these variables weight and height at particular age ranges and we're going to still have Gaussian variation and we imagine it's a result of this growth history the accumulated fluctuations in growth but we're not going to model it in a dynamic sense this is all just to make it easier if you really want to get into human biology and model this stuff you can go into the details as deep as you want okay so here's our scientific model I'm going to add one more thing to it before we build a generative model and that is I'm going to add another variable this U with a circle around it the circle indicates that in this class that the variable is unobserved we have not measured this variable U it's some unobserved influence or set of influences on body weight because of course it would be very naive to think that the only thing influencing body weight is height or stature lots of other things matter as well so this is going to turn out to be conceptually useful for us if it seems silly right now this will be conceptually useful both in this lecture and in future lectures so now what we need is a function that simulates body weight and it takes us as inputs height and these unobserved things in our simulation we get to observe it it's only in our real data that we haven't observed these things so weight is some function of height and other unobserved stuff so let's build a generative model for the generative model I want to start real simple as I said so you can show that there's some flow for adults at least we can imagine weight is a proportion of height plus the influence of the other unobserved causes and so here's our equation for this W is equal to some proportion beta of height H plus U and we haven't observed U so it's essentially a source of observational noise a variation around some expected value of weight for each height that is proportional and has the proportion beta we can write a simple little function for this in the spirit of what we did in the previous lecture sim weight is a function of height our parameter B for beta and a standard deviation and where is this standard deviation coming from it's a standard deviation of U of the unobserved stuff it's the amount of noise or the width of the error distribution so we simulate the use in this function using our norm which is a random normal and we simulate one for each individual in the vector H that's the length of H with a mean of zero and some standard deviation that we input and then we just determine W using the equation that we defined earlier this is the world's simplest generative model of weight using height but that's where we want to start and we can get pretty far with this so our goal is going to be to estimate beta when we eventually get to the estimator but right now we're just simulating what data look like for any given value of beta so if we're going to do such a simulation the first thing you need is you need heights so I'm just going to generate 200 random individuals uniformly distributed heights between 130 and 170 centimeters just for the sake of the example our real data may be different and then we simulate weights using those heights with a proportionality of 0.5 and a standard deviation of five the weight is going to be in kilograms here so it gives you an idea of what the standard deviation is that's a variance of 25 yeah and then we plot them and these are our synthetic people this is not a real population but the idea is that you want the simulation to be able to produce something that fits what you know scientifically about the constraints on human height and weight so you shouldn't be getting individuals who are impossibly light for their height and are impossibly heavy for their height but you can adjust this simulation to produce biologically unrealistic relationships you just make B large for example or make B too small okay before we move on to develop an estimator actually we still need to and then test it using the synthetic data we need a little bit of language about how statisticians typically describe conventionally describe these sorts of models and this is going to be very useful because you're going to use this same kind of model description to program the estimator that you'll see and this way you're learning the notation as you go and this notation is very conventional so it's a good communication medium conventional statistical notation first lists all the variables that are in the model whether the model's generative or statistical and then for each of those variables we need to define where it gets its value and each variable can either be a deterministic function of some of the other variables or it can be distributional function of the other variables and I'm going to show you what this means so let's go back to the simulation we just did where we simulate weight this is a function of unobserved u which has normal error and w is a deterministic function of h and u together so we can rewrite this in standard statistical notation I've done that on the bottom of this slide and I'm going to explain each of these lines as we go okay that each of them each of them is mirrored by a line of code when we do this in the computer so in this on the left we have all the variables not all of them there's also beta and sigma we have variables on the left that have their values determined through relationships that are stated on the right definitions and there are these little subscripts i which are new to this course and these indicate individuals in this case but their individual observations is i as an index and it would take on a value from one to however many individuals are in the sample and then an equal sign indicates a deterministic relationship and this tilde represents a distribution relationship you can read it as distributed as so the first line of this statistical model notation for our generative model is the equation for expected weight once we know the values of the variables on the right of this we can assign values to w but some one of these variables on the right u sub i is actually has a different kind of relationship it's it has a distributional relationship a stochastic relationship it's it's not perfectly determined by sigma it rather sigma determines the distribution that each u sub i is drawn from so this is the Gaussian error with standard deviation sigma and then h sub i we also simulated that and the way we simulated that the way i simulated that was to use a uniform distribution from 130 to 170 and so we can look at the code again and you can see how these things can form one thing to note about this is the statistical notation is sort of written in the opposite order of the way you write the code when you write code you have to write the lines in order of execution because all the variables have to be populated with values by the time you use them in a function but the statistical conventional statistical notation is not like that the in some sense the order is arbitrary because they're all just simultaneous definitions these definitions are not executed like code and so you can make the order anything you want in principle but it's conventional and quite useful to state it in the opposite direction for the way we write the code because then as you read from top to bottom you're reading the dependencies below so you start with the most general definition w sub i depends upon the most variables and then you see how those variables it depends upon are defined below it so there's a hierarchy that makes sense and most people prefer to state these conventional models in the opposite order that we'd write the code I know that can be confusing I just wanted to point this out okay that's already a lot of work a lot of conceptual work a bit of history of science I think this would be a great place to take a break review the slide so far mark down anything that's confusing to you and then when you come back I will still be here okay welcome back we're going to pick up where we left off we were we had to find our generative scientific model and now we're ready to develop an estimator for the height influencing way to example when we build an estimator now this is going to be a case where the estimator has some things in it which are not exactly like the generative model and I'm doing this because this is a general feature of linear regression is that it's usually not exactly like the generative model itself because there are some things we do with the linear model estimator to make estimation work better and to allow it to show us problems with the generative model and I'm going to try to unfold those points to you beginning with this lecture but also in the next one so you just have to be patient with me there's a method to my madness um where we want to estimate how the average weight changes with height and what that means mathematically is the average weight conditional on height and so I write this down e of w sub i conditional on h sub i so this for any given individual height or rather each individual height has a different average weight where the e means average here expectation so we say the average weight is conditional on height and we're going to write a linear equation for this conditionality that there's some intercept alpha some slope beta our proportionality constant from the generative model times the height the alpha is usually called an intercept because this is an equation for a line and it's the intercept the intercept meaning when h is zero alpha is the value of the weight for an individual with zero height now of course you know scientifically an individual who is zero centimeters tall also weighs zero kilograms and so alpha should be zero but we're going to put it in the model here because that lets us estimate it and it's a way of testing for model violations right is our linear model adequate and then the slope determines the slope of the line connecting weight and height we're going to build a posterior distribution because we're good Bayesians and that's all good Bayesians do there's only one kind of estimator in Bayesian inference it's always the posterior distribution but it's driven by different generative models so in this case this looks a bit more complicated but hang on i'm going to highlight each of the pieces of this and explain them this is the same kind of entity we had in the previous lecture when we tossed the globe it just has more unknowns now so on the left we have the posterior probability of a specific regression line because the regression line is defined by alpha beta and sigma these are our three unknowns alpha and beta define the line which is the expected weight for each height and sigma defines the error around it how much variation there is in individual people around the expectation and these are conditional on the data h sub i and w sub i the observed variables so alpha beta and sigma are unobserved variables we need to develop a posterior distribution for them and h sub i and w sub i are our observed variables we don't need a posterior distribution for them because they're known then on the right the posterior distribution as you know is proportional to the product of the number of ways we could see the observations conditional on an exact line in this case so for specific values of alpha beta and sigma given h sub i how many ways could the generative process produce w sub i and this is going to be a Gaussian error distribution right so the Gaussian model is going to populate our garden of working data in this case and then it's the as I said the posterior probability is always proportional to the product of that the number of ways the the observations could arise according to our assumptions times the previous posterior distribution which is often called the prior and remember that is the normalized number of ways that the previous data could have been produced given a specific set of unknowns alpha beta and sigma and then there's this normalizing constant I keep calling it z which is just a convention in lots of mathematical statistics and we're not going to pay much attention to this this is not a math stats course so we're not going to spend time doing integrals and figuring out what z is but it's there just to normalize so that the left hand side of this is a proper probability okay so in in the conventional statistical model notation w sub i is normal with some mean mu sub i and some standard deviation sigma and that mean mu sub i is our linear model alpha plus beta h sub i and this is the way you'll you'll tend to see statistical models written you won't see them with the posterior distribution written out like on the top part of this slide instead you'll see what's at the bottom part of the slide and this is also how we're going to program statistical models in this course it's using notation that mimics the lower half of this slide and you can read this as w is distributed normally with with mean that is a linear function of height h so we need to do it so in the notes i show you how to code this up and we can do a quick grid approximation of it there's nothing fancy about the code or interesting about it but it's all in the book and for just 11 different possible values of the slope and and fixing alpha at zero and sigma at five I think in this example you can figure out for one simulated observation a posterior distribution this is the every all posterior distributions begin with a first observation right remember there's no minimum sample size and basic inference and it looks like this for some particular simulated observation where the true generating value is 0.5 and and you remember from last time if we make this grid finer and finer we can get more and more possibilities so as I also show you in the notes we're continuing with the grid and and considering many many more slopes and sampling now three points on the right we're looking at the relationship between height and weight and letting alpha and beta vary there are two unknowns we need to estimate them simultaneously what this model really represents in the posterior distribution is lines and so for models like this inspecting the posterior distribution of individual parameters is often not very useful it's it's usually much more useful to look at the functional implications of those parameters together because those parameters are joint inputs into some function in this case a line and so to know what they mean together you need to plot the lines you could say the posterior distribution is full of lines and so on the right of this slide I'm showing you the implications of the posterior distributions of beta and alpha given these three observations here and where the width of the line is proportional to its posterior probability and you'll see that the lines that sort of lie between the points are the more plausible ones but there's still a lot of spread of probability mass everywhere because there's only three observations as we add more and more observations the probability mass masses up on fewer lines and you can see 1, 2, and 3 for some reason backwards on this slide sorry about that and 10, 20, and 89 in the bottom row and you can see that the probability mass masses up by the time we're up to 89 individuals in the bottom right almost all of the posterior probability is on a couple of lines there in the middle this is still a discrete grid approximation just to help you understand it conceptually but when we work with models like this linear regressions of course we're going to have an infinite grid a smooth grid a smooth posterior distribution but I haven't showed you how to do that yet when we do that you want to think about what these linear regressions are really doing and so bear with me a moment I got a couple of animation slides here that I want to draw your attention to so you can get some sort of organic feel for what these things are doing on the left of this slide what I'm showing you is a posterior distribution or you could say it's a prior because there's no data yet it's the what the model believes before it's seen any individuals and on the horizontal I have the intercept alpha and on the vertical slope and I've zero centered these just to make this a conceptually easy thing to think about this is not about height and weight necessarily it's just an anonymous linear regression with an inner where the lines are defined by intercepts and slopes okay and so those concentric rings it's like you're looking down on the top of the distribution this is a hill and in the middle are the most plausible values in the prior and out towards the edge they get increasingly implausible and now the little colored balls that are bouncing around rolling around on this hill they tend to spend most of their time near the middle with the high probability but sometimes they wander out to the far edge and before coming back in they explore this distribution in proportion to the posterior probabilities of each combination of intercept and slope okay and then on the right we're looking in the outcome space some x axis value on the bottom this is the the conditioning variable the predictor variable some people call it and on the vertical we have the outcome variable y which is conditional on x and then the I'm showing the lines that are implied by the positions of the same colored balls on the left and this dances around here and this is n equals zero there's no data yet to constrain these lines and they can do just about anything including completely ridiculous stuff like be almost perfectly vertical yeah maybe there are samples where vertical is the right thing but it's not too common in science now I'm going to let this thing go forward we're going to introduce one data pointed at a time and you'll see that the posterior distribution contracts around increasingly plausible values determined by which observations are made and this then constrains the dance of the lines on the right and they start to settle down but they keep moving and this is the sense that the the posterior distribution is full of lines and it has to consider an infinite number of them but this is no problem for calculus it's no problem for our computers to do it so here we go start it again here comes the first data point one data point doesn't do a lot of constraining but there's already information then the second and the third and you'll see that the rings on the left contract and the lines on the on the right dance less and less five six seven that gray bow tie on the right is just meant to guide your eyes it's not a real thing it's just a visual proxy for the range that the lines tend to dance in given the updating to that point I'll let this animation run through one more time first data point second data point the third the contraction happens on the left because the probability in the posterior distribution masses up on smaller range of values for the intercept in the slope as we get more data and then the implied lines the real predictive implications of that posterior distribution they tend to occupy a smaller and smaller range okay you can do all of that with grid approximation actually there's nothing wrong with grid approximation it's just that it's a grid what we're going to do from here out through the first half of the course is use another kind of approximation which is continuous so it doesn't need a grid and it's much easier to set up and run it's called the quadratic approximation and then the quadratic approximation what we do is we approximate the normal distribution we approximate the posterior distribution as a multivariate Gaussian distribution I'll say that again we approximate the posterior distribution as a multivariate Gaussian distribution and we can do that because posterior distributions very very often and especially in this case are in fact multivariate Gaussian if the sample is even reasonably large even even a handful of observations in this case and and we're going to use a tool that does this for you you don't have to do the detailed algorithmic programming yourself a tool called quap which just stands for quadratic approximation and quap is in my rethinking package and I designed it as a teaching tool it was sort of the birth of this course now more than 10 years ago was to write a statistical tool so students could it would have to enter every assumption of the statistical model so that they would know what the assumptions were and they could adjust them and see what the consequences are and so that's what this tool lets you do you enter a list of deterministic and distributional assumptions that define a statistical model and then quap runs it and finds the the quadratic approximation the Gaussian approximation of the posterior distribution for this model and the day you pass it it's called a quadratic because the Gaussian distribution is in a deep sense quadratic because it's defined only by its variance right so the mean and its variance and that's all there is so in the log space the Gaussian distribution is a perfect parabola quadratic function the thing that we need to do here and that's new is in the statistical model unlike the generative model you need priors you need to say when there's no observations what does the model believe and in the previous lecture I punted on this because it wasn't very exciting to talk about what we believe about the proportion of water on the earth before we've tossed it now there's more scientific input here we can think about and we can have productive discussions about priors a little bit more so I need to when you're designing an estimator and you put the priors in you could use really really flat priors which is quite conventional I don't tend to do that I tend to use more concentrated priors not priors that build in the answer actually and indeed actually it's quite the opposite these priors build away from learning from the data and that's intentional but at the same time they constrain the implied observations of the model to scientifically realistic ranges and that's what we're going to be doing when we design estimators in this course is designing priors so that before the model has seen data it doesn't hallucinate impossible things and this is going to turn out to be really useful for estimation I'll explain I think in two weeks time okay so there's a more to say about about the quadratic approximation how it works and there's much more in the book on that if you're interested take a look and also of course lots online it's a workhorse approach both in Bayesian and non-Bayesian statistics to do quadratic or Laplace approximations I want to spend time and lecture instead of on algorithms for computing posterior distributions I want to spend the time talking about something much more useful which is how to construct a prior predictive distribution we did posterior predictive distributions at the end of the previous lecture the prior predictive distribution is exactly the same thing because prior distributions are posterior distributions posterior distributions are prior distributions they're the same animal there's no essential difference between them and so there's the same procedure we can push predictions out of the prior even if the model hasn't seen data right we can force the model the golem to make predictions before it's made any observations and what we want is a set of priors so that the predictions are not crazy yeah we don't want predictions that that mimic our sample because we haven't looked at our sample yet in any detailed way that's not what we're doing we just want boundaries that look in some sense like a human on the planet earth yeah so this is what I say a prior should express scientific knowledge but do so softly we do so softly because often the real generative process in nature is different than what we've imagined so we want a statistical model that's flexible enough to see and learn those violations so that we can think about the structural flaws with the statistical model so what is scientific knowledge in this case you are humans and you know a lot about human biology whether you are aware of it or not so for example somebody who is zero centimeters tall also is zero kilograms so that's a that's a strong constraint so alpha should be close to zero if the linear model of weight and height is a good one and so I'm going to put a prior on alpha which is a normal distribution centered on zero with a variance a very wide variance to allow this is the softliness so that if this is not true the model will be able to learn it what about the slope well you know that weight increases on average with height which isn't to say that every taller person is heavier than the people shorter than them only that on average is true that taller people are heavier so this means that beta is positive yeah it's not negative taller people are not lighter on average and then weight in kilograms is a is less than height and centimeters right somebody who is 170 centimeters tall is not normally on average 170 kilograms yeah they're less than that and so beta is probably less than one so i'm going to put a uniform prior on beta between zero and one this is not super informative right there's lots of stuff that could still happen there but this is just to get into the habit of using scientifically reasonable priors so that the prior predictions are not totally crazy all right we don't want vertical relationships between height and weight and then sigma we could think more about how much variation there is each height and so on but i'm just going to assert that the minimal scientific information here that's required to even get the model to run is that standard deviations are positive and so sigma must be positive and so it's greater than zero and i'm going to put a uniform between zero and 10 this means a 10 a standard deviation of 10 is a variance of 100 there's not a variance of 100 kilograms around the average weight so it's not even going to get that large yeah it'll be smaller than that so it must be in this range whatever the actual standard deviation is okay how do you understand the implications of these priors you're not meant to just look at this definition and intuit it like some kind of savant what you're meant to do is perform a prior predictive simulation and then draw the observable implications of these priors in the observable outcome space that is we're going to make some lines so here's a little bit of code that's efficient to do that i'm going to draw a thousand that's one e three thousand random samples from the prior of alpha and beta as a is normally distributed with a mean of zero and a standard deviation of 10 and be uniformly distributed between zero and one just draw a thousand random pairs of those two things from the prior definition on the right and then plot the implied lines so i set up a null plot that has no points in it and then i've just put 50 lines in here from the random samples of a and b to see what they look like and here are our random regression and relationships what the golden imagines before it's seen any data this is a bit wild right many of the lines are too low the slopes are not bad so the the the constraint on beta from zero to one that's not so bad it's the intercept that is wild the low ones are impossibly low right individuals who are individuals 170 centimeters is not going to on average be 30 kilograms yeah and then the high ones are too impossibly high and off the top of this graph if you expanded the range at the y axis you probably see some some lines that were way too steep as well that we can't see because they've been cut off by the the range of that i chose on the y axis so this is what prior predicted simulation does for you it gives you an idea to see this you are an expert in your field you are and when you see the implications of your assumptions in this way in the outcome space you often realize that your assumptions are bad but it's very hard to realize this just from looking at the definition of the model and so we're going to do a lot of these predictive simulations prior and post in order to understand what the model is actually thinking in its tortured little mind I'm going to leave these priors for now and we're going to come back to this issue of priors and this particular scientific modeling problem later on in the course we'll come back to it a few times actually these priors aren't going to do any damage right now this is a really simple model and you'll see that we could choose crazy priors in fact and base is going to learn the proper relationships among the variables very quickly anyway okay short the first of what will probably be many sermons on priors as I've said before I think it's unfortunate that people typically take their first Bayesian statistics course after they've had some other kind of statistics course and then the concepts from the non-Bayesian course are interfering with their learning and one of the things that happens often in Bayesian courses is you get this impression that it's everything else you did before plus priors and so I'm showing you a linear regression and it looks if you've already had a course in regression it looks like the regressions you did before but there are priors added to it and so you get the impression that Bayes is frequentism plus priors but that's not true the whole approach to probability is different in Bayes and that's why last week I started with the garden of forking data I didn't even talk about priors very much last week yeah priors are just posterior distributions it's the posterior distribution that's different and how it's computed and so there are no correct priors only scientifically justifiable priors and this is exactly true of posteriors there are no correct posteriors they're just scientifically justifiable posteriors what justifies them the scientific assumptions that motivate the generative model that inspires the estimator all right and those things have to be defended and they can't be easily tested with data at least not in the context of a single study we have to justify these things the structure of the model and the structure of the model includes the priors with outside data in very simple models like linear regressions the priors don't do much work actually they get overwhelmed very very quickly you'd have to make them crazy strong and you'd never do that by accident in order for them to influence your results and I think a problem then is if we don't if you never move past simple models in Bayes you never see some of the really important ways that priors work to help us do scientific inference and so in the second half of this course you're going to see in more complicated models priors do a lot and we're going to have to think much more carefully about them and non-Bayesian approaches for complex models have all the same problems but they don't have priors to solve them with and so they have other often mathematically equivalent ways of sort of smuggling priors into the computation things like regularization so it's not that this is a problem unique to Bayes and complex models it's just a feature of complex models and Bayes we solve it with priors at least we solve some problems with priors okay but we're going to practice with priors now even though they don't matter very much because you want to you want to practice when it's easy so that you know what to do when things are hard okay we're up to step four now we're ready to validate we have a generative model we have that can produce some synthetic people and we have a statistical estimator now we can use quap to get a posterior distribution and so let's feed synthetic people into our model and I keep harping on this and I'm going to keep doing it you've got to test your code like this you really do this is sort of a bare minimum thing and it's not because you're programming the estimator yourself in fact you're not programming the estimator I wrote quap remember and quap does the searching so it doesn't matter what package you're using you've got some generative process you need to write it as a code to generate synthetic data so that you can show that your estimator whatever structure it has whatever notation your statistical software requires could could work even in principle and that's a minimum scientific standard so your test statistical model with simulated observations it's the bare minimum you can do much more stringent tests than the sorts I'm going to show you in the lectures there's this thing called simulation based calibration there's a literature online about that where you can test very rigorously how calibrated Bayesian estimators are over the full shape of their posterior distributions okay so the the simplest sort of test here though not a full simulation based calibration would would have this code as a scaffold we're going to simulate a sample of 10 people at the top of this code block and the data generating slope is going to be 0.5 you're going to want to vary this across runs and make sure that the estimator tracks it and then run the model using quap and quap takes a formula list that mirrors the mathematical definition of the statistical model and you see it here and then when you're using quap there's a function in my package it's called pracey which will give you a quick summary of the marginal posterior distributions of each unknown yeah so this means the mean and standard deviation and 89 percent percentile interval for each of the unknowns in the posterior distribution and this is just a quick summary it's not sufficient for understanding what the model has learned but you can already see from here that it's not having any trouble recovering the data generating slope having more trouble with the intercept and it's also recovering the standard deviation quite well so again when you're testing it's not sufficient to just one run like this you want to go back up to the generating code and change B the input B and make sure that your posterior distribution follows it and so on you want to try very large samples make sure it converges try small samples and see that it's characterizing the uncertainty correctly and so on okay we're almost at the profit stage we can take the Nancy Hale's data and we can put it into the analysis code we have already written so in this in this drawing the hour workflow it's very nice about this is when you going to have to do a lot of work especially for more complicated models later in the course to get through the validation stage there will even be multiple rounds of model development and validation before we can pass on to step five but by the time you get to step five you can put your real sample in and it'll run and that's a great feeling absolutely great feeling a couple of projects I've worked on where I spent literally three months moving between numbers three and four layering in different components of statistical model and making sure they worked on synthetic data before finally putting the real sample in and then the first time I put the real sample in it just ran it was great okay so we can do that and this is a much simpler case we can just put the real sample in we feed the weight vector and height vector from Nancy Hale's data where weight is measured in kilograms and height is measured in centimeters directly into the same model that we used to do the validation and we can pull up the summaries the same way and you see beta is about point five little bit above it you'll see it's estimated quite precisely standard deviation really point zero three sigma is about four the intercept is definitely not zero so this this suggests it really is something globally nonlinear about about the relationship between height and weight right because you can't have negative weights at all nevertheless this is I'll show you on the next slide this approximates the adult relationship quite well and that intercept is just a weird nonbiological term let's need it to make the line fit what I'm showing you on the right of this slide is something called a pairs plot pairs p a i r s pairs and it's called that because it plots pairs of unknowns from the posterior distribution against one another so the diagonal in this grid of plot is just the density like a continuous histogram of the posterior distribution of each unknown of each parameter a b and sigma and then you see that they're Gaussian yeah they're nice bell shapes and because that's what crap produces but if you used another technique you'd see that they were Gaussian as well in this case and then in the upper triangle the upper right you're seeing posterior distributions as plotted from above for each pair and what I want you to see is that the intercept alpha and the slope beta are very strongly related to one another they covary along a very tight line they're almost perfectly negatively correlated with one another and that's this is how linear regressions often work there's strong co-variation among the different unknowns to different parameters that determine the shape of the line and so you can't interpret the intercept in the slope separately in these cases and once we get more complicated linear regressions with many predictor variables and many slopes and interactions you really can't use a table of coefficients to understand what the model is doing and behaving you have to push out posterior predictions so we're going to do that quite a lot so if we do that in this case right this is what I call what I just told you what I call obeying the law the first law of statistical interpretation parameters are not independent of one another and cannot always be independently interpreted so instead they have a lot of variance in covariance with one another instead we push out posterior predictions we do that by extracting samples and so I show you a little bit of the code on the right once you fit a quap model you can use this function extract dot samples for the model and get essentially a synthetic data table which contains samples from the posterior distributions with each column being an unknown a parameter in each row being a set of correlated samples that have the right covariance structure for the given posterior distribution and then you can simulate lines from it that's what I do here so first in this little block of code we extract samples from the model of weight as a function of height for the actual howl sample then I plot the howl sample on this graph so each red dot is a person in Nancy House dataset and then I draw 20 lines randomly sampled from the posterior distribution and those are in black and you can see that the slopes vary a little bit but they're mostly just a little bit above a half and they're in a fairly narrow range there's over 300 individuals adult individuals in this sample so that gives a lot of information to estimate the average relationship between weight and height this is only the average because you'll notice that most of the points are not on the line right there's a lot of residual variation from the you the unobserved other factors that influence weight we want to get sometimes you want to put that on the graph too and see that sort of prediction envelope that comes from sigma which defines the variation around the expected relationship so let me show you how to oh I'll show you that in a moment what I wanted to show you first is the Bayesian updating perspective on this dataset of course with Bayes you can do it one data point at a time and you can do that with this data set as well so if you enter only one individual that's what you get in the upper left here the model is still quite unsure is this just one individual what do you want from me right and for five individuals it's starting to get much more confident with its random lines from the posterior for 25 even more so 50 it's basically in the right it's already learned but it's still wider than what it will be with the full sample and then repeating on the far right the inference for 352 but this is important to start thinking about this is that there's not one true line in a Bayesian analysis there's no one point from the posterior distribution or one line that is the right line or the right answer it's the whole distribution and so this is the kind of summary you want to aim to produce is one that shows that variation yeah and the easiest way to do it is just draw multiple lines from the posterior and plot them all and then you see right away the variation okay the posterior is full of lines as I often say it's also full of people the individual points and there the golem will make a prediction about the how far out most of the individual should be from the expected relationship and you can simulate that as well using sigma and that's what I do with the code block on the bottom of this and that produces the dashed line to sort of prediction envelope around the anticipated relationship the average relationship that is the solid lines in the middle this is just the percentile interval from the uh simulating individuals with this helper function called sim which is part of rethinking that I use in the bottom block here and you just give this a fit cap model and some new data here are some new individuals just ranging between 130 and 190 centimeters in height and it simulates from the whole posterior distribution the whole posterior predictive animation that was in the previous and in the previous lecture it does that it does it behind the scenes using the model definition that's in the quap model that you supplied it and then it returns a big data table of those simulations and those will be individual weights that were simulated and they will use all the parameters including sigma to do that simulation and then you just take a percentile interval for each height value that's what the rest of the code does code does there and when you apply W post-pred to PI which is the percentile interval function and then the lines commands at the bottom of that block of code just draw the dashed lines if you want to understand code like this what I recommend is executing each line one at a time and then inspecting its output and then moving on to the next line this is always the way to learn how these other someone else's code works okay that's enough for today and it's really possibly too much I apologize but back up and review and make some notes about what was confusing and this will provide the foundation for the next lecture where we're going to stick with the same Nancy Held dataset but we're going to take a look at some other variables as well that influence these things or maybe they don't the thermometer on your screen is meant to remind me to say that things like linear models most statistical models are akin to thermometers because they're measurement devices they're ways of measuring something measuring associations between variables in the case of linear regression and thermometers it's something we call temperature but of course you have to define the temperature in a particular way about what's being measured and your interpretation of what the thermometer reads depends upon other issues as well like who you are and your perception of the temperature or the conditions outside how damp it is so on there's no real subjective agreement on the meaning of what the thermometer says right and then that you may know about these things that North Americans are weird because we talk about this thing called wind chill that's sort of how apparently cold it is Europeans are starting to adopt this but it's still not very common here this is just to say that temperature is a socially constructed object and the associative but it's incredibly useful the things that the statistical models like linear regressions measure or also socially constructed objects and our meanings come from the purpose we assign and our scientific beliefs about the processes that generate the sample this is not to demote statistics and its power is just to be honest and realistic about what it's doing as we should be honest and realistic about how useful the information we get from a thermometer is okay so this distinction then in our linear thermometers between the generative model I'm showing here on the left there's some background scientific idea about how a sample is generated that weighted some function of height and other stuff that we haven't even measured but since we haven't measured it we need a statistical model that can cope with that and so we use a Gaussian error distribution that tries to cover for the stuff we haven't measured and it can do unreasonably well at that and lots of situations but there are also situations where it can't the other thing that goes on this statistical model is we need priors we need to define an initial state of information and there's good scientific information that can be added at that stage as well okay I appreciate your attention for this material we'll continue in the next lecture building directly upon this and talk even more about causal inference and linear models and I'll be looking forward to seeing you there