 Welcome back everyone to the third lecture of statistical rethinking 2022. In this third lecture, I want to introduce linear regression as a practical method for approximation and I want to show it to you both from a causal position thinking forward to how it makes predictions and counterfactual predictions after interventions and then also as a piece of statistical machinery we can use to find associations in data. But first let's talk about planets. So this is a picture of the night sky taken in June 2020 and what you see now is stop-motion continuation of Mars moving month to month from June to February as it moves through the night sky and now what you'll notice is relative to the stars in the background it's reversed direction. This is a well-known phenomenon in astronomy. It's been going on as long as people have been watching obviously as long as the solar system has existed and now it is reversing course again and continuing on its original path through the night sky. Planet Latin simply means wanderer and this is because they move relative to the stars, but they also wander in very irregular paths. All the planets do this to a certain extent, but Mars is the most elaborate one because it's closest to us. Now what accounts for this? It's a general phenomenon that has to do with the way orbits work and once a clue to this fact is that if you were observer on Mars you would see the Earth behave the same way. So what you're looking at on this slide on the left is Mars seen from Earth. This is the phenomenon you just saw on the previous slide and then on the right is the curiosity rover which is currently on Mars and can watch the Earth and the position of the Earth does the same zigzag motion in the sky over the months. So ancient astronomers wanted to explain this. They didn't know the structure of the solar system and they had to come up with models to not only explain this behavior, but also be able to predict and I want to draw contrasts between these two things between explanation and prediction. So what ancient astronomers were definitely able to do is to predict when Mars and the other planets would be in a certain position in the night sky relative to the constellations that is the the background stars that are very far away and are essentially fixed from our perspective on Earth. So the real explanation we now know since we know the the structure of the solar system quite accurately now is that Earth and Mars both are orbiting the Sun at the center of the solar system and here on this slide I've pictured in a cartoonish way the orbits of Earth and Mars and then on the far left we have a representation of the zodiac curtain as it were the the background stars. It's not quite far enough away obviously, but it'll it'll work for illustration purposes. Now I'm going to start these planets on their motion and I'm going to cast a line between the Earth and Mars to show a viewer's perspective and we can continue this perspective through Mars all the way on to the far right. There's this perception of where Mars is on the background curtain, the zodiac curtain on the far left. Now I'm going to continue this animation. I want you to watch the virtual Mars as it were on the far left. That is the perception of viewer on Earth where they see Mars against the night sky. And here's the zigzag and it happens because the Earth is moving faster around the Sun than Mars and so when it catches up with Mars, there's this parallax effect, this effect purely a perception whereby it looks like Mars changes direction against the night sky. I'll let it go around one more time. You can appreciate this. The other planets do this too from our perspective, but Mars, as I said, is the most dramatic swing because it's so close to us, but Jupiter and Saturn do it and the others. Okay, but you can explain the same observations with a different model that has the wrong structure for our actual solar system and this is the geocentric model. So you can construct a model in which Earth is at the center of the solar system as pictured here. Again, this is a cartoonish simplified version of the accurate model and then Mars and the other planets orbit the Earth, but they orbit on orbits and so what you're seeing here are two circles. There's the first circle around the Earth and then there's a circle that goes around it. And I'm going to start this animation and you'll see that Mars orbits the Mars has an orbit and that orbit orbits the Earth and that's the geocentric model and you'll see the red path here is tracing out the perceived path in the sky and you'll see it reverses as a consequence and this kind of geocentric model using what's called an epicycle, the epicycle is the second circle, the circle on a circle is an epicycle. These geocentric epicycle models can make exactly the same predictions about where planets like Mars will appear in the sky and this is the model that that ancient astronomers like Ptolemy came up with and they're extremely accurate models. They have no problem predicting where Mars is explaining observations from an observer on Earth. This is a great example of prediction without explanation and prediction without explanation is routine in statistical science. It's a huge hazard actually. It's it's possible to create arbitrarily accurate approximations of observations without actually understanding the true state of nature at all. So let me tie up this little moral story here. On the one hand, we have geocentrism, these these epicycle models. They're descriptively accurate models. They're mechanistically wrong in almost comedic fashion. There are no orbits on orbits in the true structure of the solar system. But they're based upon a very general method of approximation, which is a Fourier series, which is for any continuous path, you can always construct an arbitrarily accurate approximation using circles on circles. That's what's called a Fourier series. So they're known to be wrong in an explanatory sense, and yet they can be incredibly useful for prediction, these sorts of geocentric models. And I want to analogize this to the topic of this week's lectures, which is linear regression, which is also descriptively accurate. You can describe all manner of associations between observable variables with linear regressions. But it's also mechanistically wrong in the sense that nature is not composed of a bunch of little straight lines that connect variables to one another. But it's a general method of approximation. So even when the actual generative or causal model of the system is not strictly speaking a linear regression, and I'll say what that is as we go along today, you can construct very useful approximations of the relationships among variables with it, and even relate them to generative phenomena before you know the mechanisms about a system. And that's good news, actually. But you should always keep in mind that you shouldn't take the linear regression too seriously because it's working at some effective scale of prediction and description. And we have to be cautious about it. So linear regressions are essentially geocentric models. They're incredibly useful, but we have to remember what their mechanistic limits are. So what are linear regressions? They're little golems, if you remember, golems from the first lecture of the course last week, and they model the mean and variance of a variable. And that's it. And typically what we do is we make the mean some weighted sum of other variables so that as those other variables vary, they change from case to case, and observation to observation, the model makes a different prediction for the expectation or the mean of some variable of interest, which we usually call the outcome variable. And probably more than 90% of all the models in applied statistics in the sciences are special cases of linear regression. ANOVAs and ANCOVAs and t-tests and MANOVAs are all, in a sense, linear regressions are examples of what's called the general linear model. And these special cases are all useful in their proper domain, but they all have the same kind of mechanistic structure, if you will, the same machinery. They're golems with the same machinery, just slightly different expressions. Linear regression can also be used as generative models at a high level of description, and I'm going to show you examples of that today as well. In Bayesian statistics, the linear regression and and how it is traditionally fit using least squares estimation is quite traditional in in a literal sense. In 1809 Carl Friedrich Gauss used a Bayesian argument. We now recognize as a Bayesian argument, but of course at the time it wasn't thought of that way because all of statistics was Bayes in 1809. So there was no distinctions, but now we recognize this as a Bayesian argument. He had a Bayesian argument for using normal error and this method of least squares estimation, which is the most common way the linear regressions are fit. And and he had an astronomy purpose in this 1809 book, and this is the reason that the normal distribution is named after Gauss. It's often called the Gaussian distribution is because he had this essentially Bayesian argument for its construction and application in an estimation procedure. The old German money, West German money from before the the Euro actually had Gauss, the 10 mark note, had Gauss's face on it. You see him here and the normal distribution in small print. Very handy for cheating on tests in those days. Okay, so where does the normal distribution come from? The normal distribution is important in linear regression because it counts up all the ways observations can happen. Just like we were talking about last week, the whole business in Bayes is to count up all the ways the observations can happen given a set of assumptions. And so given some mean expectation and variance, the normal distribution gives you the relative numbers of ways that the data can appear. And I want to explain that to you more with a small set of animations here. I want you to imagine that we take everybody who's who's paying attention to this course at the moment, which I think it's a few hundred people. And we line you up on a football field, a soccer field on the midline. And each of you has a coin and you're going to flip it in the air and catch it. And if it turns up heads, you're going to take a step to the left. And if it comes up tails, you're going to take a step to the right. And we can simulate this out now. And you'll see them scatter. And what I've done in this, each point is a person. And the white dash line is the midfield line. And the blue points are just to the right of it. And the red points are just to the left of it. And you'll see they're all jiggling around as the process continues. Kind of this agitated boiling motion. But they're scattering and there are, now we can freeze it at this point. And you'll see there are more individuals near the midfield line than there are far out. But how are they distributed actually? What's the distribution now of positions on this field? So now I'm going to start the animation again. And on the right hand side of the slide, I'm going to show you the histogram of the distances from the midline. So here we go. I'm going to start it again. Now on the right, this is a histogram. We're just summing up how many individuals are at each distance from the midline. And that's what's on the right hand side. So there are lots of individuals near the middle dash line, the white dash line, and fewer individuals at the end. But the distribution is a very special one. This is, on average, a Gaussian distribution, a normal distribution approximately. And it arises simply from these little summed up deviations or fluctuations, the coin tosses. Each individual coin toss is a random fluctuation. And your position on the field is the sum of those fluctuations. So if you find yourself 10 steps to the left, that means that the difference between heads and tails was 10. That there were 10 more heads than there were tails. And that's how you ended up 10 steps in a particular direction away from the midfield line. And likewise for every position on this field, it's the difference between heads and tails that explains your position. But for some positions, like the middle position, there are many more ways that you can end up with a difference of zero than there are than you can end up with a difference of say 10. Many different sequences of throws will get you close to the midfield line than there are sequences of throws, which will get you very far from it. And that's why more individuals end up staying closer to the midfield line than there are individuals who end up migrating very far away from it. One way to think about this intuitively is to end up very far out, like some of the blue and red individuals, far on the right or far on the left, you would need to get a long string of heads or a long string of tails. But if instead it's a fair coin and you're getting a mix of heads and tails, then you're going to remain relatively close to the middle. And the Gaussian distribution simply arises from that summing process. The heads and tails cancel on another. The fluctuations tend to cancel over time. And there are many, many more ways for the fluctuations to cancel than not. And that is where normal distributions come from. And they're very common in nature. So we should ask the question, why normal? That is, why are there normal distributions? Why are they so normal? And there are two ways to think about this. And they're both important to us in the work we do in this course. The first is the generative argument, which I just showed you with that animation. Normal distributions arise quite often in nature because they arise from some fluctuations. And when you, some fluctuations, fluctuations tend to cancel. And then you get symmetric variance around a mean. And that Gaussian shape arises as a consequence of that. The second is statistical. And it's equally important when we think in the statistical framework of doing estimation. If you want to estimate the mean and variance of a variable, that's your goal. And that is quite often the goal. We want to estimate the mean and also have that mean response to changes, to interventions. The normal distribution is the least informative distribution. And what that means is the only information in it is that there's a mean and a variance that exists. And then it has the widest probability dispersion of all distributions with that same mean and variance. And this is a kind of argument called maximum entropy or max int. And we'll, we'll don't need to worry about it that much right now. I'll bring it up again much later in the course when we talk about other kinds of models, generalized linear models, where there are also maximum entropy justifications for them. The takeaway for now is that when we use the normal distribution in linear regressions, it's giving us the relative numbers of ways that each of the distances from the midline, if you will, from the mean, could arise given the assumptions. That is, that mean and that variance. So it's still the garden of forking data. It's still the globe tossing machinery. It's the same idea. But we, we, the counting is given, the relative counts are given to us by the normal distribution curve. It's important, a important upshot of all this is this, of the statistical argument is the variable, its frequency distribution does not have to be normal for the normal model to be useful. Statistically speaking, a normal distribution is just a device, a golem for estimating the mean and variance of a variable. And it'll work regardless of the frequency distribution of the actual variable. The variable can be uniformly distributed. And, and if your goal is to estimate its mean and variance, then a normal distribution will do it. You can, you, you will learn the skills in this course to prove that to yourself. I don't have to, you don't have to take my word for it. And so the generative argument and the statistical argument are distinct, and we need to keep them distinct in our minds. Okay, so let's, let's think about doing some work now and learn some skills. Here are the goals for this lecture, the first two lectures this week, focusing on linear regression. And the first lecture, all I really want to get done is to present to you the language that we use to represent statistical models, both for linear regressions and all other statistical models, actually. And so you learn this language, and you'll be able to understand all the other models in the course. And not only that, this is a standard statistical language, a mathematical statistical notation that is common in scientific journals and reports and the like. And so you're learning a language that makes you a colleague in a large scholarly space. The second goal is to actually learn how to draw the owl, how to calculate bigger posterior distributions. Now there are going to be more than one parameter. And the same procedures are going to work, but they're going to be more complicated. And we want to talk through how that how that functions. And then finally, you want to see how to actually construct and understand linear models, or at least begin to do that, it you should be patient with yourself because these things always take some time. Okay, before we get into the work, though, I want to talk a little bit about sort of the the emotional attitude, if you will, or corporeal attitude you should have towards this work. These kinds of technical topics have a lot of details in them and complexity. And it all comes at you at once because it's all there to get the work done, you can't omit part of the linear regression and still have it function. So you have to learn it all at once. And you're learning the coding and the concepts and the interpretation, all at the same time. And it's a bit much, actually. So but it's not important that you understand it all on the first go. What's important is that you experience what I call flow, that you keep swimming. You just need to understand enough to keep moving forward, right? So you don't sink. And that is possible, but you shouldn't be stressed. If you don't understand everything on the first pass, you can swim backwards as it were going backwards in the lecture and pick up the bits you miss. And it's actually quite ordinary in these courses that students will become very competent at doing statistics before they fully understand statistics. I'll say that again. It's quite common that students will become very competent at doing statistics for all of the purposes that they need professionally before they fully understand statistics. And that's not a bad thing necessarily. That's good news, I think. And so that's the flow. I'm here to help you achieve the flow. And you can repeat the lectures multiple times and pick up additional bits of understanding. And the thing about understanding, I really believe is that it's contextual. The details of your research and how you use statistical models and the kinds of causal models you're interested in will influence how you understand these methods. And the different understandings can be equally valid. So what's important is to focus on drawing the owl first, and the understanding will come in time. Don't panic. Okay. Let's learn a language for modeling. And to start, we're going to talk about a model that you know. It's the model from the previous lecture, the globe tossing model. This was a binomial Bayesian updating model in which we had one observed variable W, the count of water, and that was binomially distributed. You learned last week. So I'm going to rewrite this model. And I have rewritten this model in the center of this slide in standard mass stats notation, so that you can understand, well, this is just a reexpression of what you already know about that model because you did the calculations. The first thing is we have our outcome variable W. And it's on the left of this first line. And then there's this little tilde. And the tilde means in this context is distributed. So the outcome W is distributed. And then the right-hand side says how it is distributed. It's distributed as a binomial random variable. And this is the data distribution is what I like to call it. In non Bayesian statistics, you'll hear this called a likelihood function. But it's not, the mathematical expression of it is the same as it would be in non Bayesian statistics. But our interpretation of it is different. So it's probably better to get into habit of not using the term likelihood and just call it the distribution we assign to the observed variable W. And then inside the parentheses on the far right of this top line, we have N and P. And N was the sample size, the number of times we toss the globe. And P is the parameter of interest, the proportion of water on the globe. And so N and P together tell you the shape of the binomial distribution in this case. And then on the second line, we also have another variable in this model, and that's P, and it's unobserved. It's the thing we're trying to estimate from the data. And this is, so this is the parameter we estimate. So now P is distributed, there's another tilde. But now it's uniform. Why uniform? Because that was the prior we used in the examples, most of the examples last week. And its shape is given by the number zero and one, which just says in this case that it's a uniform distribution from zero to one. And that's the prior you used in your, in last week and the prior you're using in the homework for last week. So how, let's stick with the globe tossing model just for a little bit longer, then we'll leave it behind and get to linear regression. But first I want to show you how this mathematical language relates to what you learned last week from Bayesian updating. So if you get these kinds of mass stats statements of models, they imply a way to do the Bayesian updating, to draw the L. And the way to think about it is you need to take the variables on the left and insert them back into these functions, right? So these functions define probability expressions. So the binomial W and then that vertical bar is called a pipe. And you read it as conditional on or given. So the binomial distribution of W conditional on N and P. This is the definition of the probability of W conditional on an N and P. It's just that it's binomial. And the same for the uniform, the probability of each individual value of P is given by the uniform distribution for P conditional on 0 and 1. And then once we have those probability expressions you learned last week that the posterior distribution that is the the probability of each value of P conditional on the data W and N is proportional to, that's what that little fish means, it means proportional to the probability of the data times the probability times the prior probability, the probability of P. And so in this case it's proportional to the binomial probability of W and the uniform probability of P. So no matter how complicated a model gets, this is always the case. This is the way that posterior distributions are calculated by doing this multiplication of the different probability terms. And this is the code that you used to do exactly this so you can see the correspondence. Okay, none of that is technically new, it's just a little bit of a new notation. Now let's do the really new stuff. So we're going to walk through linear regression for the remainder of this lecture through these five steps in drawing the owl, that is this scientific workflow. And what I want to do using this workflow is break it down into steps where each step justifies the next step so that it doesn't, so that we don't feel like we're making anything up at any point. Well in the beginning we have to make up everything, but the later in the later steps are all implied by the early steps. And the data example I want to use comes from this book, pictured on the right, Life Histories of the Dobigung by Nancy Howell. And Nancy Howell is an anthropological demographer who worked in the Kalahari for a couple decades I think and collected really fantastic data sets on growth and nutrition and life history of these native Kalahari peoples, the Dobigung. And I have added parts of these data to the rethinking package that you can load and we're going to work with a data set called Howell 1. There's a Howell 2, which has a bunch of other variables in it, but we're going to stick with Howell 1 and stay simple for the moment. And we're going to examine just two variables for now. For this lecture and the next lecture we'll look at some other variables, but for now we're just going to look at the relationship between the height of an individual measured in centimeters and the weight of an individual measured in kilograms. So now if you're an anthropologist, especially a biological anthropologist, data like this are really interesting actually because there's a bunch of stuff about human growth and development that you can get from height and weight graphs. They vary a lot across populations and events in individuals lives can have big effects on their height and their weight. So these sorts of data are exciting to people like me. If they're not exciting to you, I want you to keep in mind that simple examples like this are really sufficient to bring out all the complexity of statistical modeling. And so it's nice to have variables that seem simple to you. Yeah, so if it seems simple good, I don't want to talk you out of that. But this is a real modeling problem that for many fields it's important to understand how humans grow. So we're going to start with a basic question about these data just to describe the dissociation between weight and height. And what I want to convince you of and I don't think it'll be hard is that this is not such a simple problem actually describing this because you have to think quite carefully just to describe these things. The first thing to notice is we're going to be working with linear regressions and it's clear from the scatter plot of individual height against weight. Each of the points on that graph is an individual. It's clear that this is not linear. There's no straight line that's going to fit that, at least not on this scale. So to keep things simple for a moment, we'll deal with all the data in the next lecture. But for this lecture, remainder of this lecture, let's only look at adults. And for this population, actually for many human populations, if you only look at adults, the relationship between weight and height is fairly linear, at least the averages are. Yeah, you can think of those points as being scattered around a straight line. So I'm going to show you how to fit a linear regression to this and also how to think about these points as being generated causally by some linear regression like a generative process by which height influences weight. So what do I mean by that? This is step two of the workflow. We think about a scientific model. How does height influence weight? And I'm going to make a little dag. I remember dag's from the first lecture directed acyclic graphs. H and then an arrow pointing into W. So that's height and then the arrow pointing into W weight means height influences weight and that influences causal. And what that means is that if we change an individual's height, it will change their weight. But if we change their weight, it won't influence their height. I'll say that again. What the arrow means is it tells us the consequences of an intervention. If we intervene on an individual's height, it'll change their weight. But if we change their weight, it doesn't change their height. Yeah. And that's a scientific claim. But I think you can imagine biologically why that's the case. Yeah. And what that little dag directed acyclic graph with the two variables in the arrow implies in terms of modeling is that we need to find some function. And I've written it here in the second equation on this on this slide. W equals f of h. That is that weight is some function of height and we have to find that function. And we're going to use a linear function to make it look like a linear regression in this case. So when we make generative models of something like height, a biological process like height, we can make those models at different levels. And I want to do something very simple for now so that I can keep as many of you on board as possible. The really fancy thing to do in this kind of literature is to have a dynamic model where you have incremental growth of the organism over time. So you actually simulate out from birth all the growth increments that the animal undergoes over time. And then both the animal's mass is what weight is supposed to be a measure of under constant gravity. And its height or its length is we often talk about in the biological literature derived from that growth pattern. And then in these models, the Gaussian variation around the mean at each at each height results from some fluctuations, just like that simulation I showed you earlier on the black slides, the football field simulation. We're going to do something a little simpler today so that we don't have to simulate the whole life history of the organism. These are called static generative models in which we simply have a function that relates changes in height to changes in weight. There's still causation in this model because, again, if you change weight in the simulation, it doesn't influence height. It's only changes in height and influence weight. But there's no mechanism in there about why it works that way. In contrast, in the dynamic generative model, there's a real mechanism in there about why things work in particular ways. And again, in the static model, though, the Gaussian variation is justified as a result of growth history. In the static model, that doesn't happen automatically. The Gaussian variation is merely imposed, but it's justified by the same intuition that when you have some growth histories, growth history over time some fluctuations and you'll end up with a Gaussian pattern of variation. In the dynamic model, you don't have to assume that you get the Gaussian model just as an epiphenomenon, as a consequence of summing up growth over a number of years. OK, now let's actually meet a linear regression. Now, I know many of you will have had a course in regression and so you're prepared for this quite well. Those of you who haven't, that's fine. I'm going to teach you every piece of this. You don't need to need any preparation for this really. I'm going to assume that you've never seen this stuff before and go forward. All I'm going to assume is that you graduated from high school and therefore you know the equation for a line. And this is the equation for a line here on this slide. We have some variable y and y sub i, this index i is something I want to get you used to. The idea is that there's a bunch of particular instances of this variable y. There are a bunch of different cases and the index i is just which one and it takes integer values one, two, three, four and so on. So in the case of the height and weight data, the i's are the numbers of individuals, the row number of the individual in the data table. So this value y is equal to some intercept here called alpha plus a slope here called beta times an x value. And if we plot these y's and x's on a graph like on the white, on the right-hand side of this slide, different choices of alpha and beta give you different lines. I've just done three imaginary lines here in the three different colors which would represent three different combinations of alpha and beta. So then if you put x values for every given x value you get different y values as a consequence. And this is just reviewed. This is all from secondary school. This is just the equations for a line. What's different about a linear regression is that the equation for the line doesn't tell you where the observed values on the vertical axis are, the y values are. It tells you where the mean values are. I'll say that again in a linear regression. The equation for the line doesn't tell you where the observed values are. It tells you where the expectation of the observed values will be. So now I've drawn this. Now we have a linear model on the slide. And so y sub i, y is some vertical axis variable, some outcome. In our case, it's going to be body weight, but I'm trying to keep this anonymous for the moment. And we say that that's distributed. There's that tilde again. It's distributed normally as a normal distribution with mean or expectation mu sub i. The mu is just a convention that we use, a Greek letter. But you can you can have anything there you like. And some standard deviation sigma. And then mu sub i is that defines the line. It's mu sub i that is equal to alpha plus beta times x sub i. And then the lines on this slide are lines for the expectation. And there's scatter around those lines. Scatter that's the width of that scatter is determined by the standard deviation sigma. So you can read this as each x value has a different expectation. And that's what a linear model is giving you expectation. The expectation of y conditional on x is equal to mu in this case. So you get used to this pretty fast. If this is new to you, it's it's it's truly weird. If you think it's weird, that's just because you're paying attention. But this is a this is like a geocentric model. That's why I talked about linear models and geocentric models. You can describe many highly complicated relationships among variables with these little linear models. You really can't. And I'll show you that starting in this lecture and then really in the next lecture. But let's do a little bit of scientific modeling with it, actually, too. So you can simulate synthetic individuals with simple linear models like this. So here's a generative model of how height influences weight. Now I've replaced the y's and x's with W and h so we can relate it to the height and weight data. And now reading this the model on this slide what this says is the weight of each individual is distributed as a normal distribution with mean mu sub i and standard deviation sigma. And that mean mu sub i is the sum of some intercept alpha and plus a slope beta times the height of that same individual i. So you can understand this is a little better when we do it in code. This helps with most people. If we're going to do a forward simulation in generative model we just pick the values of these parameters, alpha and beta, and sigma. And so in the R code at the bottom of this slide I've just chosen alpha of zero which means that an individual who has zero height also has zero weight. I think we can agree that that's a good assumption. Right. Again I'll say that again and alpha of zero implies that when an individual has height of zero that their weight is also zero. Why is it has that implication? You look at the equation for mu and you see if you replaced h with zero the only thing that would be left would be alpha. And therefore the definition of alpha in this model is an individual's weight when their height is zero. So if you have no height you have no weight. And then the second parameter is beta is the slope and I've just picked point five and then a sigma of five and we're going to simulate a hundred individuals. So now the next line of code I just simulate some heights as a uniform distribution between 130 and 170. And then using the model the model implies a distribution of weights for those heights and then we again simulate from that first line there calculates mu for each individual as merely the sum of alpha plus beta times their height. And then we sample individual weights using our norm which is a random normal deviate for each individual using their particular mus and sigmas. And then we can plot this and we have our synthetic people on the right. And yes this looks a lot like the real data. And that's perhaps a hint that you can use a model like this to describe the real data and learn its distribution quite accurately. OK. We've already done a lot in this and I want to suggest now that we take a little pause a traditional pause like we'll do in many of the lectures. And this is a good chance for you to get some coffee or walk around the house or go to the park and come back and finish this later. But also to think about which elements of this were confusing and you can go back and review before you continue forward. So what we're going to do next is we're going to take that scientific version of linear model and we're going to turn it into a statistical model. So go take your break and I'll be here when you get back. All right. Welcome back. Let's continue on with the drawing the owl drawing the linear owl. We're at step three now just before the break we had produced a simple scientific simulation of a linear relationship for how height influences weight. Now let's think about the statistical version of this and to make a linear model statistical we need a little bit more now. What do we need in addition? Well we need priors and it's also the case in the statistical version of this is we don't know the values of the parameters alpha and beta and sigma we have to learn them. So there's going to be distributions for them that are the the posterior distributions of each of those. So these these parameters alpha beta and sigma are like the proportion of water from last week but now all three of them are in play and what they do is they jointly define where the model thinks weights will be given heights. So on this particular slide we're not looking at weight and height we're just looking again at our anonymous and x and y because I want to I want to present the general structure to you in general and the first two lines of the model on the left are just like what they were before the break in the in the scientific version but now I've added priors for alpha beta and sigma and I've just made some some arbitrary priors here made alpha and beta both normal zero one and sigma uniform zero one sigma is uniform because it's got to be positive it's a standard deviation and there's no such thing as a negative standard deviation. So this is something you'll get used to as we move through the course is there are parameters like sigma that we call scale parameters. I'll say that again parameters like sigma are scale parameters. What they do is that their value stretches some distribution and it scales it and scale parameters are always positive. They're they're above zero and so you when you give them a prior you need to use a distribution that is also positive is above zero and in this case I've used a uniform zero one which is not the best choice and as we we move on in the course we'll use something else but I want to use something that you already saw last week. So I'm using a uniform for now. Now what is this animation that's been dancing distractingly on the right. Well I've been talking. These are lines sampled from this prior distribution. So the the three bottom lines on the model on the left are the priors and they define a joint prior distribution which defines a bunch of lines. So now the prior distribution is full of lines and let me take away the model and show you the prior distribution on the left just for the intercept and slope. I've hiding sigma but you know all sigma does is create scatter around the line. So the lines the lines aren't defined by sigma. The lines are defined jointly by the intercept and the slope. And normally distributed prior for the intercept with a mean of zero and a standard deviation of one and a slope with the same distribution will give you this Gaussian bowl normally just a bivariate normal distribution and that's what you're looking down on on the left. Those rings are rings of decreasing plausibility as you move away from the center. So you could think about this as being kind of a bowl that these little points are rolling around in and they tend to spend most of their time in the high plausibility region near the middle but occasionally they get out towards the edge if they've got enough velocity as it were. And so we're sampling randomly from this space and each point then represents a line and the red point is is matched to the red line on the right and the cyan point to the cyan line on the right and the blue to the blue as it were. And so what you want to think about is that for a linear regression the posterior distribution is filled with lines because the intercepts and slopes go together as points and we have to learn about it's a combination of an intercept and a slope that implies a line and then the line implies probabilities of seeing data in a particular part of the XY plane. So how do you actually simulate prior, simulate these lines from the prior? This is an important thing to do. We're going to do this a lot in the course is if you want to understand the implications of a prior distribution, you need to simulate from it. And so we can do that because we understand these statistical models generatively as well as statistically. So in this case, how would you do it? Well, you just decide how many samples you want to take, in this case 10, and then you just sample from the defined prior. We sample 10 values of alpha from a normal distribution with mean of zero and standard deviation one and 10 values of beta, same distribution, different values. And then we can plot those lines. And the R code to do this, there's this function AB line in R, which you've given an intercept and a slope and it draws a line in different scripting languages. There'll be equally convenient ways to do this. And then the result is shown on the right. And that was what you were seeing in the animation is these lines are all over the place. They just cover the whole space all over. Bayesian updating just in cartoon for now, and we'll talk about the mechanics in a bit, works the same way in this business. And what I'm showing you is we progressively add random data points in this space and then update the prior distribution. You see it shrinks. That is a smaller and smaller region of alpha and beta. The intercept and the slope remains with high plausibility and it gets more and more concentrated in a particular region. So the little balls appear to lose energy as it were. There's just as energetic as just their energies concentrated in a much smaller space. And this is the Bayesian Golem learning where the plausible lines are. And what I'm showing on the right there, that gray kind of bow tie is just a visual guide to show you where the high density region of lines are. But the posterior distribution doesn't have the bow tie in and it just has lines. And so it's like the Golem believes in a certain set of lines are highly plausible and the lines that consider as highly plausible are in this bow tie. You should get an idea. And the dash line is just to show you the sort of highest expectation line in the set. But there are lots of other lines that are really good, too. And the Golem has no prejudice about that. It loves all the lines. OK, this animation is just meant to give you some intuitive understanding of what's going on. What's really going on here. Let's work with the height and weight data and build up to actual calculation. The structure of Cisco model, as you've seen, is very similar to the generative model. But there's some additional kind of mechanical things we always need to do when we do statistics relative to forward simulation. And the first of them is it's often really useful to rescale variables and linear regressions because it makes the Golem function better. You don't have to worry about this when you're simulating because the simulation will always function. But estimation is harder. So you think about this. The simulation is a forward task where data is generated from a causal process that you specify. You're totally in control. Inference involves doing integrals in some way so that you can get the posterior distributions out. And there's always residual uncertainty in this process. So it turns out that there are things you can do in the computer numerically to make all of that machinery function better. And one of these things is to rescale variables. So we're going to do that. I'm going to show you about data rescaling. I'll explain it on the next slide. And then second, of course, we have to think about the priors. We always must think about priors. And that means we're going to simulate from the prior distribution to see if the prior distribution makes sense scientifically given our assumptions about the generative process we're trying to learn about. And these two things obviously go together in a very tight way. OK, let's think about the first of these. Rescaling, just for a second. Rescaling, what we're going to do is we're going to rescale height so that implicitly inside the linear regression so that the meaning of the intercept parameter, alpha, is the value of the average value of weight when an individual is of average height. I'll say that again. What we're going to do is we're going to rescale height so that the meaning of alpha, the intercept, is the expected weight of an individual when they are of average height. The reason it works this way in this equation, you'll see what we've done on the right, is instead of multiplying the slope beta just by h sub i, the individual's height, I'm multiplying it by the difference between the individual's height, h sub i, and the average height in the population or in the sample, rather, that is the mean value of h i. And what this means is that that difference is zero when an individual's height is h bar. That little line over the h is read as bar, so h with line over it, that means, I'm going to say h bar. So when h sub i is equal to h bar, then the slope term vanishes because it's equal to zero and then mu sub i equals alpha. And so alpha then means, as a parameter, it means an individual's expected weight when they're at the average height. And this turns out to have a lot of numerical advantages in running linear regressions and it also links into the next thing that we're going to do is we need to pick priors for things. So when the meaning of alpha is the average adult weight, we can get a really good scientifically informed prior by looking up information that's independent of this sample. That is other regional samples of human growth and just find the average weight and then define some generously spaced prior around that. So you can look this up on Wikipedia, for example, and in Africa, the average weight, Africa's a huge place with a tremendous amount of diversity, cultural and biological diversity, but the average weight is about 60 kilograms. So I'm going to choose a normally distributed prior for alpha centered on 60 kilograms. With a really big standard deviation of 10, just to emphasize that this is not a very informative prior. We're putting in some prior knowledge, but it's going to, it'll be allowed to move quite generously. The slope, the thing about slopes linear regressions is they have units, they have meaning that comes from their function. And in particular, what they mean is, in this case, the slope is kilograms per centimeter and that's what it means. So whatever the value of beta is, it's how many additional kilograms an individual gets heavier on average per centimeter of growth in height. And I'm going to start with that centered on zero 10. This is a quite conventional kind of slope parameter. You say like we're being neutral, we're not saying whether the relationship is positive or negative, and 10 is a very wide dispersion. And then for sigma uniform distribution between zero and 10. Keep in mind that the standard deviation of 10 is a variance of 100. So that's a huge spread on weight for any given height. But the model is going to learn sigma, and it's going to learn alpha, and it's going to learn beta. These are just priors. Okay. What do these priors mean? Let's sample from them. Here's some code to sample regression lines from this prior distribution. And the magic part here, I think you'll understand all this code. The magic part is the bottom part where for each of the n, in this case, 10 individuals that we've sampled lines for, or rather 10 lines we've sampled from the prior. They're not individuals, they're 10 lines we've sampled from the prior distribution. You'll see in that the lines function in R just plots a line. You'll see alpha i plus beta i times a sequence of heights minus the mean height. And that is the equation for mu in the linear regression. You just write it in as code. So if you know the model, you can always simulate from the prior this way just by using the piece of the model that interests you. In this case, it's the equation for the line. So what do those lines look like? Well, they look terrible. These are not human lines that relate human height to human weight, maybe on some other planet for some other organism. The obvious problem is that a lot of them go the wrong direction because we made the prior for beta symmetric around zero. And that doesn't make any sense at all. So we can fix that. We can use a different distribution. Let's use a distribution that's constrained to be positive. Let's look, on average, the relationship between height and weight is positive in a population. So let's use a log normal, not a normal, but a log normal. If you're not familiar with log normal, that's fine. A log normal distribution is just a distribution that if you took the logarithm of it, it would be normal. I'll say that again. A log normal distribution is just a distribution that if you took the logarithm of all of the values in it, the distribution would be normal. Another way to think about it in the other direction, if you take a normal distribution and you exponentiate each value, yeah, that is you raise E, Euler's constant to that value, then you'll get this distribution. That's what a log normal is. And to understand log normal distributions, you really need to simulate from them. And so that's what I've done here. What I've shown you, I'm assigning a log normal with a mean of zero and one. Those zeros and ones refer to the normal distribution that you get if you logged it. This is, I know it's confusing, but we're stuck with conventions of statistics. I have to teach you the conventions. But if we simulate from this, we can always understand it. So we sample from this in the chapter in the book, I show you the code to do the simulation. There's just a function RLNorm, which samples from log normal distributions. And you can see that it's constrained to be positive. All simulated values are above zero. Most of them are quite small. So the expected slope is quite small here. Most of the time it's gonna be less than one. But really big slopes are vanishingly possible. Log normals often have very long right tails like this. Okay, let's redo the sampling from the prior. And now you see this looks much better. There are still lots of relationships that in the prior that say that height isn't strongly related to weight. You see those quite flat lines in there. And then there are some really strong ones which are probably too strong to be really biologically plausible. But I don't want the prior to determine the relationship. I just want it to put some biologically realistic constraints on what's going on and not allow physically impossible relationships. Okay, a little bit of a sermon on priors now. And this is one of those things about understanding that comes with time. What you don't wanna say is that there's a correct prior. Like what's the right prior? There is no right prior because the prior is, it's just the prior state of information before the data arrive. And priors can have lots of different scientific functions. And so you need a scientific argument to justify them in a sense or some statistical argument because they also have statistical functions. And we'll see several in the course as we move through. But what's important is that you're justifying the prior with information outside the data. That's what it is. You always justify the information outside the data. Like in this case, I justified this prior being constrained to be positive because I understand humans as they get taller, they get heavier on average. And we don't expect there to be any negative relationship here. It's important to realize that that's true of every other aspect of the model is that you justify the structure of the functions that relate to variables to one another with information other than the sample itself. I'll say that again. You must justify the structure of the model. That is how it structurally relates the variables to one another with information other than the sample itself. If you use the sample itself to decide the structure of the model as cheating and it's gonna lead to massive false positives and all kinds of bad things can happen to you. Don't let that happen. You need to use your scientific expertise here. The statistician can't help you, right? General statistical principles will not save you here. Okay. In these models, simple linear regressions, the priors hardly matter at all, to be perfectly honest. They're not so important and you could use the world's worst priors and get exactly the same posterior distribution with even a modest sample size. You don't need a lot of data even. But you need to practice now because this isn't always true. There are lots of models used routinely in the sciences like multi-level models or hierarchical models where the priors do matter more and you need to have some understanding of them. It's not just that the priors matter more but they also relate to the data in different ways. When we get to multi-level models, you're going to actually learn the priors from the data itself. And I know that sounds weird because it violates everything we've said about priors but it will be true. That is that the distinctions between priors and data are going to blur a bit in more complicated models. And in machine learning models, like neural networks, the distinctions blur radically. So we practice thinking about priors and their implications now because that practice will give you a path to flow forward like the koi and the pond into more complicated models as you need to. Okay, let's get back to some real work here. Let's fit this model. How do we do it? And now we've got the model and I've put the mass stats notation up again on the screen and on the right hand column here, I've written down what each of these implies in terms of the probability. I remember probabilities in this business are the relative numbers of ways that each of these things, each value could happen. So the top line, the probability of W sub i conditional on the mean mu sub i and sigma, the normal distribution gives us the relative number of ways each weight value could be realized given a particular mean and standard deviation. And then there's the equation for mu, which is not a probability statement in and of itself, it's just embedded in the first line. And then the three priors for alpha, beta and sigma, and these are the probabilities that give you the relative plausibility of each value of alpha of each value of beta of each value of sigma. Before we see the data and we update them and then we get a posterior, which is defined as the probability of alpha and beta and sigma conditional on the data W and H. So to say something about this, it's all three at once because the posterior distribution, there's only one in the model. You don't have different posterior distributions for alpha, beta, and sigma, although sometimes we'll talk like that. What you're really learning is what's called the joint distribution of all of the unknown variables in the model, the joint distribution of all the parameters. And so it's a posterior distribution simultaneously of alpha, beta, and sigma, which means we consider all combinations of alpha, beta, and sigma, and that's where we do our calculations. All right, so how do we actually do this then? We need to consider all the values of alpha, beta, and sigma simultaneously, and that means all their combinations. So when you can think about that, suppose we consider 100 distinct values of alpha and 100 distinct values of beta and 100 distinct values of sigma, what we must do to compute the posterior distribution through grid approximation, that method that I explained last week is we have to consider all the combinations of each of those 300 different values. And so this implies 101 million different calculations. What is that calculation? It's the calculation on the slide. You just multiply the normally distributed probability of W given a particular H, which is inside of mu and a value of sigma, and then a value of alpha and a value of beta using this product of four different probabilities now. And you do that one million times and you'll have a posterior distribution. And if you see page 85 in the book, I give you code to do this. It's just looping through all those possibilities, all the different combinations of alpha, beta, and sigma. And you can compute in principle any posterior distribution this way. What does this look like in cartoon form? Let's do the linear regression animation again, but this time let's use the first 30 adults from the Howl 1 data set. And I want you to see once I start this animation that we go from this large prior, which is shown on the left with three high plausibility lines sampled there in the middle, the intercept and slope. And we're gonna start this moving and we're gonna add in each data point one at a time so you can see the updating. And you won't be surprised all by what happens. Well, maybe you will. We get, as each point is added, the plausible ranges of intercepts and slopes change and contract. It's already got so small you can't see it, so let's zoom back, let's zoom in. So you can see it again and at each point it shifts around the high plausibility region of joint combinations of intercept and slope of alpha and beta. It's their joint that matters. And you'll see that they're not independent. That is that there's a tilt to the posterior distribution so that when the intercept is large then the slope is likely to be large. And when the intercept is smaller, the slope is likely to be smaller. That is it's tilted. As we add more and more points, it contracts more and more. And it's tilt changes and we keep going. Now we're up to 20 data points. Not much action is happening now because linear regressions, they only have three parameters and so as you get a reasonable sample size, assuming the sample is distributed in a reasonable way, you very rapidly get a high quality posterior estimate of the parameters from it. This will change as we add more, as we make the models more complicated, obviously we'll need more data. But there we are after 30 points. There are many, many more points in this data set and we're gonna use them all. But I hope that gives you an impression in the animated sense what happens when you do all these calculations. Superimposing the grid now to help you understand you imagine dividing up this intercept and slope space into a big grid. This is a very coarse grid, you could make it finer but just for the sake of illustration. And then for each of those empty spaces in this grid, that implies a combination of a value of alpha and a value of sigma and what the updating does is it calculates the plausibility of the data, that is the observed weights, given the alpha and beta values at that point. And then the posterior distribution is just the whole collection of those plausibilities. And at this particular point after 30 data points, most of this space looks empty because most of the space is highly implausible. Most of the values are quite unlikely and the likely values are concentrated there where the rings are and the points we're bouncing around. Okay, in practice we're not gonna fit these models with grid approximation. We could, they're not that complicated yet and it doesn't take that long. But what we're gonna use instead and we'll use this throughout the first half of this course is an approximation of the posterior distribution which is often very high quality actually and it's, the approximation is a Gaussian approximation of the posterior distribution. So it turns out under a quite general set of circumstances that posterior distributions are approximately Gaussian in shape. For individual parameters, they're approximately Gaussian and then for the joint parameters, they're multivariate Gaussian. So instead of the grid approximation which achieves its approximation by taking a continuous space and cutting it up into discrete pieces, we're gonna use a Gaussian approximation which is continuous like the real parameters but instead imposes the Gaussian shape but that's often a very high quality. This is sometimes called a quadratic approximation and that's what I call it in this course and there's a function in the rethinking package called quap which just means quadratic approximation. But you'll also see this called the Laplace approximation because one of the greatest mathematicians to have lived Laplace, a picture there on a French postage stamp, used it. Laplace actually arguably more than anybody else did the most to make Bayesian probability into an inferential science. So Bayesian probability should probably be called Laplacian probability but it's far too late to rename things now. If you're interested in the mechanics of the quadratic approximation, how it works, you should see page 41 in the book. I'm not gonna spend time in this lecture on it. Okay, let's continue on with our workflow. We're getting towards the end here of drawing the L, the linear L. We're gonna do steps four and five with the same code and so I've highlighted them both at the same time. They rely upon the same code that is the statistical code from step three. Step four is to validate the model. What I mean by that is we'd like to take data from the scientific model, step two, from the simulation and put it into the statistical model to verify that the little golem works, that we've done both pieces of this right. So in the scientific model, we know the data generating process because we picked the parameter values to generate the data and then if the statistical model can recover those, within general accuracy, you're never gonna get exactly the same values back. Then that's reassuring that the machine is working. So that's what I call the bare minimum of simulation-based validation is you test if the statistical model, having been fed simulated observations can capture the simulated process that you designed. We need to do this because it's pretty easy to make broken golems. I still do it all the time. You just a moment of inattention and a typo and then things go wrong. So it's worth checking. And linear golems are pretty simple so they behave in very predictable ways but there are more complicated kinds of models where they just may not function as you expect and we'll talk about that when we get into those domains later on in the course. But I don't want you to think that everything acts like linear regression because it doesn't. The really strong test using simulation-based validation is something called simulation-based calibration. I'm not gonna show you an example of that just yet but if you're interested, that's the phrase you want to Google and you'll find the literature on how to do it. Simulation-based calibration is how you verify that numerical algorithms are working as expected in the Bayesian context. Okay, so let's actually do this now. Let's get the quadratic approximate posterior. If you're using my rethinking package to go through the course in R, you're going to get used to writing code formulas which look a lot like the mass stats formulas. So what I'm showing you here is on the left is the code equivalent of what's on the right hand part of the slide which is our linear regression of weight on height, of the relationship between weight and height and you'll see that the top line in the code corresponds to the top line in the formula and so on, all the lines correspond. But D norm is the normal density in R. It's just the function that returns the probability of W for any given mean and standard deviation sigma. And then the other difference to note is on the second line for mu, there's not an equal sign but that left arrow and this is just a convention in R for assignment where equals means something different in this context unfortunately. So just use the arrow, don't use equals. And then the priors I hope are self-explanatory. Notice that for beta, I've written it as B, it's DL norm which is log normal. Okay. And then this is all the code you need. First we're going to do validation. So at the top part of the code on this slide is a simulation, the same kind of simulation code we had before, I've just compressed it. I pick an intercept in this case 70 which means that when an individual has average height, the expected weight is 70 kilograms, a slope of 0.5 and a standard deviation of five for the scatter around that expectation. And we simulate 100 individuals and they're shown on the right hand part of this slide. And then on the bottom part of the slide, we run quap which stands for quadratic approximation and all you need to give quap is the formula list and the data. So I've created the object dat which is just a named list of the variables we need. In those case, these are H for height and W for weight and then H bar, the mean height. And we give quap those two things, the formula and it uses the formula to construct the probability expression exactly as I've showed you on previous slides. It knows that each of those is a probability statement and so it can construct the product which the posterior distribution is proportional to and it can construct the quadratic approximation of the posterior from that. And it usually proceeds very fast. With models run with quap, there's this summary function and rethinking called preci and which just gives you a summary of the model and these are these little coefficient tables that you've seen quite often in statistics. I want to discourage interpreting these preci tables too much though. Of course, it's fine to look at them and they give you an idea of the precision that you've estimated things at. In this case, we just want to look to show you that the statistical model is working. We can capture alpha with reasonable precision. We get very close to it because we have 100 individuals, a lot of data and the same for the slope of 0.5 and sigma of five. We get very close to these things and you look at the intervals, these are 89% compatibility intervals. You'll see that we're covering the true value. To really test that statistical model works, you'd want to do this a bunch of times for different generating values and that would be simulation-based calibration. Well, the generating values would be from the prior actually but that's not important right now. Okay, now with the real data, the only thing that changes is the data object, the data list object you feed in is the actual variables. So the top of this slide, I've just taken the HAL data set and I've restricted it to individuals who were 18 or older, an arbitrary definition of adulthood. Very culturally specific one but we'll work with it for now and then I construct the data list and then the quap code is exactly the same. It's exactly the same. Okay, so I want to introduce something I call the first law of statistical interpretation and I want to encourage you to obey this law. Like all laws there are times when you will break it but this is good guidance I think. The first law of statistical interpretation is that the parameters are not independent of one another, that's a fact which means that they carry information about one another. They co-vary in the posterior distribution. High values of an intercept will be associated often with high values of a slope and vice versa. And so what this means is when they're not independent you can't interpret them independently. It's also true that they're not independent in a second sense which means that their effects on prediction are not independent of one another. They all act simultaneously on predictions. You have to pick values for all the parameters in order to get a prediction for somebody's weight in this model and so just knowing the precision on one parameter doesn't tell you how the model behaves. So now of course if you get really familiar with a certain kind of statistical model you'll be able to just read the parameter values to understand what the predictions will look like but that takes time and by no means is it even necessary that it ever happens. So what I want to teach you to do instead is something that works for all kinds of models and doesn't require you to be clever. My general philosophy in life is to never rely on being clever. Instead I want to rely on being thorough and having a justifiable workflow. So instead what we're going to do is we're going to just push out posterior predictions from the model and we did posterior predictions at the end of the previous lecture, that animation where we were sampling proportions of water and then simulating globe tosses with it. We're going to do the analogous thing now for linear regression. So let me show you how that works. Okay, step one, we're going to plot something called the posterior predictive distribution for the estimates we've gotten here. First thing we'll do is we simply plot the data that is plotting the sample and that's what I've done on the right. That's just the scatter plot of adult height and weight in this sample. Then we're going to plot the posterior mean. I'm not showing you the code yet to do this. I'll do that next. It's also in the book. The posterior mean is the, if you take the posterior distribution of the intercept and the slope and take the mean of it, there's an average line that's kind of at the center of mass of the posterior distribution. You could say this is the most plausible line. Now there are lots of other lines that are almost exactly as plausible as it, but this is the most plausible line. But it's not special. This is not our answer. Again, the posterior distribution is the answer and we're going to put the whole posterior distribution up here. So what we need to put around it is three, we plot some uncertainty around the mean and this sort of thin bow tie shaped envelope of uncertainty now that I've plotted in pink. This is the high probability compatibility region around the posterior mean. It'll give you an idea. In this case, the model is very confident that if you're going to use a line to describe the expected relationship between height and weight, it's in this narrow region here, but there's some uncertainty on the end, as you see. And then the fourth step, there's still additional uncertainty and that's the scatter or how much of the scatter is there around the mean and the model has also learned that through sigma and so you can incorporate sigma and have another uncertainty region around a prediction region around and I put that up here on gray to cover. So what are we doing with these predictive distributions in this case? Well, they do two things. First thing they do is they give us a way to generate predictions, an envelope of predictions out of the model. The second thing they do is they let us see if the model has done a good job of learning about the data. You wanna see that there's some sensible relationship between the raw data and the fit model and that's another thing that it's like a second check on whether the golem has worked or not. Now, the correct expectation is not that the model exactly spits out the data. That's something we definitely don't want. We'll talk more about that in a couple weeks but there should be some general correspondence if the machine has worked well. Sometimes it won't and this is a way to check that but exactly the same procedure is how we generate predictions as well and later on in the course we'll use it to generate counterfactual causal predictions and other kinds of things. All right, I'm not gonna spend a lot of time walking you through this code. If you wanna spend some quality time with the code that degenerates this you should see section 443 in chapter four starting on page 98. Just very quickly, the first block of code here to plot the data, that's nothing but a scatter plot. Now, to put the expectation, the expected line I'm using this function called link. It's on the second line in the middle block here and link takes your fit model and it finds all the linear equations in it, things like mu and it computes their value for a series of numbers that you feed in using the whole posterior distribution and so what it returns is a posterior distribution of expected values and that's what the symbol mu here holds and if you should work through section 443 in the book and work with this code and see what each of the objects in this code hold at each step so that you understand what's going on but at every step in this there's no summary being done. We're working with the entire posterior distribution until the last line so in this case it's only when we plot the line that we take a mean you'll see that in the line that starts with lines here on the slide lines x sequence and then apply mu to mean that apply thing in R as a way of applying a function to a vector or a matrix so mu is a matrix in this case and we're taking the mean of the columns in it that's what two means and then the shading which is the pink shading similarly we're computing a percentile interval which is something you learned in chapter three and that's what the PI function is and this is a 99 percent percentile interval so that pink region is really almost all of the posterior distribution of lines are concentrated in that narrow region and then finally we put up the prediction interval using a function called sim and sim sim simulates weights it simulates observations it doesn't just tell you the expectation that is mu but it spits out synthetic data it simulates from the statistical model as if it were a generative model and again if you read through section 443 you'll, there's a lot more to say about this if you're using a different script scripting language like Python or Julia the mechanics will be different but the job is the same okay these, let's take a slide detour for a second because I want to make a closing point in this lecture so this is a thermometer and it's a fairly antique one but it still works this is an old fashioned mercury thermometer I believe so there's this general, thermometers are funny things because like lots of scientific instruments they start from some very innocent and quite simple question like how hot or cold is it but in order to get a good answer to that question you have to deal with all kinds of side nonsense so thermometers are actually quite complicated and it took people a long time to make good ones and now we have lots of different ways to make good thermometers and we take it for granted but it was a long procedure of coming up with the scales and then understanding viscosity of what's inside the glass tube with the glass tube and all of these things to get them to function correctly and statistics is both fortunately and unfortunately like this in a sense and these linear regressions are like flexible linear thermometers they're ways of, they're very powerful ways for measuring associations between variables and you'll start to appreciate some of that power as we move forward the next lecture I show you more of it but we have a quite innocent question right how does height influence weight weighted some function of height and to get an answer to this question which I think we have gotten in this lecture in a general descriptive sense that is we can make predictions with the model we've gotten about how someone's weight will relate to their height within this population we had to deal with all kinds of annoying little machinery and these things called parameters and particular mathematical functions and picking some priors that we then had to simulate to understand them and so on and there's just no way out of that I'm afraid but just like thermometers now seen simple and trivial once you work with these sorts of things for a little while they'd also seem simple and trivial it becomes motor memory for you and you'll be able to read these statistical models like the one on the right hand part of this slide very quickly just by skimming them because you start chunking them because you see them as statements about the relationships between things and they describe how the thermometer works and you design and modify little bits of the thermometer to help you answer your question where your question is typically going to be framed as some kind of scientific model that specifies relationships usually causal relationships between variables and this is we'll move forward with this and dealing with these thermometers as we go okay, next lecture we're still in week two of the material next lecture will also be about linear models and we'll inch forward a little bit more and thinking about causal relationships and how these linear models can answer them but we're going to look at how linear models can actually express highly nonlinear relationships among things so I will see you there