 Okay, before I resume with chapter 4, I want to remind you that or say maybe if it's not a reminder. I'm very conscious that part of the struggle with this material is we have to get into the weeds of processing numbers and we're going to do a lot of that increasingly and today there's going to be a lot of shuffling of matrices of fake data which are you know samples from the posterior distribution and all of that it's easy to get lost in the algorithm and there's no avoiding the algorithm. It's just a necessary part of the life of science. It's the rich part, middle part of cooking the cake. But at the same time I consider it my responsibility to try and pull all of you guys up from those waters and remember that this has a purpose and the purpose is to process inferences and understand them and I'm teaching you all of these painful algorithms because it empowers you to understand the output of the models and that's the goal and so there's this upfront cost right now of learning this probably totally unfamiliar way of doing statistics but I'm teaching it to you because it is completely general. You can use it for any kind of model with any kind of data and it empowers you to interpret what's going on and hopefully build better models as well. So with that let me try to deliver on that fairly lofty promise. Linear models everybody's favorite thing right so I recall we're doing geocentric statistics all of these models are wrong but we hope that they predict where Mars is nevertheless that's the only goal. They're not even wrong right they're golems but they don't represent the actual physical thing they're meant to describe relationships among variables and that those relationships then cast light on external hypotheses that have inspired the modeling exercise. So I think when we ended on Wednesday I had just we just finished fitting our first linear regression. Yay I know all of you have fit linear regressions before but hopefully it's new and fresh now and we had put what I'm calling the maximum posteriori line in the prediction space this space the scatterplot between height and weight and this line represents the peak of the Gaussian hill in the posterior distribution just the peak and it's just one point and it's of relevance because it's in the middle and so it's interesting but the whole distribution is of interest and we need the whole distribution somehow on this graph so that we can understand how uncertain our inferences are and that's our goal today is to spend all of today basically putting uncertainty on graphs. This is the thing that your stats programs do in automated fashion for you but I want you to understand it and know where those shapes come from and then empower you to do it for any kind of model not just the ones pre-programmed into your statistical software. So the output from Precy to remind you just gives you this first column called the mean. These are this is the peak of this multi-dimensional Gaussian right so in this case it's three dimensional so it's a snowball remember and those three points define the center of the snowball and then the standard deviations define how wide it is in each dimension. There's also other stuff that's not shown in this output which are the correlations between the different dimensions and I visualize this for you in the book apologies for not putting it on the slide. So now we want to get all that uncertainty on this graph not just those two numbers alpha and beta which draw that line there. How do we do that? So in the case of a simple linear regression there are analytical formulas for doing it because it's a super simple model and it's possible to do the math. Generally it's not for lots of very interesting statistical models essentially all GLMs there are not simple formulas for doing it. We'll get to GLMs later if that's not familiar. So it's useful I want to teach you a general method that works for everything that will come later. Here's the algorithm we're going to fit the model you don't have to use map to do it but you fit the model and you get some representation of the posterior distribution. We're doing it with map and the quadratic approximation right now later we'll do it with Markov chains. Then you sample from the multivariate normal distribution of parameters. When we do it with Markov chains it won't be you won't have to do that stuff because that's all you'll have. You'll just have samples and it won't even necessarily be multivariate normal it will just be multivariate anything. Then third you use these samples you process them like data piece by piece each set of samples. You push them back through the model and you force the model to make predictions assuming those were the parameter values. Then the collection of predictions is the posterior distribution of predictions from the model. It tells you what the Golem thinks. Does that make enough sense now for me to advance the slide? There are examples coming I want to get the recipe across first. Here's the algorithm. You saw this at the end of the day on Wednesday. In the rethinking package I've made a function called extract.samples which does everything on the previous slide for you. You pass it a map model and it samples from the multivariate snowball. It returns to you any object again you call it whatever you want. I encourage you to call it post or posterior just for sanity but again call it pickle if you like whatever you like. It's a data frame if you use the data frames except these aren't data they're samples from the posterior. They represent the distribution and each row in this has the right correlation structure for that snowball. That's very important. Now we're going to process each row there are 10,000 rows. I'm only showing you the first five but there are 10,000 rows in this data table. We're going to process them all because each implies a prediction. Then we get a distribution of predictions and we can look at that distribution and learn things about what the model expects. I encourage you to think about this posterior distribution as being full of lines. Hopefully most of you have seen 2001 of Space Odyssey. Heard of it? Yeah right full of stars. This is full of lines. How many of them? An infinite number just like there are stars in the sky. There are an infinite number of lines in the posterior distribution but some of them are more plausible than others. When we sample from the posterior distribution we're sampling lines. We can draw these lines up as a way of representing the uncertainty in the map. How confident is the golem what the relationship is between height and weight? I'm showing you here as we sample 10 lines. This is just 10 data points. Refit the model with only 10 data points. I sample 10 individuals from the HAL data. I train the golem on just those 10 individuals. I sample I think that's 20 or 30 lines from the posterior distribution and I plot them up. As I take 30 samples from the posterior distribution fit to these 10 individuals and plot the lines that are implied by the A and B in those samples. Does that make sense? I'm ignoring Sigma for now it'll come later. Oh yes it will. Let's stay simple for a second. Does this make sense so far? What I want you to get from this is with only 10 individuals there's a lot of uncertainty what the relationship could be. You can visualize this very readily by plotting lines from the posterior distribution because it's full of lines. They're all plausible to some extent. The center is okay fine but there's so much scatter here that you shouldn't bet your house on the map line. Make sense? Are you with me? Let's do some other examples. Let's take 50 individuals, add 40 more from Nancy Howell's data and train the model on those instead. And now I should do the same exercise we pull, I forget I think it's 30 or 50 lines. It says in the text what to do. All the code to do this is in the text. You see the scatter has shrunk because now there's so much more information about the relationship between height and weight. What does it mean the relationship? The goal is saying if we assume that the relationship is linear so that with every additional kilogram of weight that's associated with some fixed and constant additional centimeter increase in height then these are the plausible lines. The goal can't tell you that it's a line. That's something you put in. That's the large world thing. You decide okay line, we're going to consider all the lines. And then the goal is like okay I can do all the lines. How many are there? Infinite number, no problem. Alright, infinity is no problem for calculus. It's good at infinities. Infinities actually make math easier quite often. So it considers all the infinite possible lines and then you can sample from the relative possibilities of them and it shows you the highly plausible ones which are in a pretty narrow range here. Yeah? Does this make sense? It should be a little weird. I hope if you're paying attention right it seems a little strange and that's normal. That burning sensation in your brain is thinking. It is a bit strange because this is a epistemological device we use of imagining this fantasy distribution of plausibilities but we're interested in lines and it helps I hope to think of it being a bunch of lines and we want to rank the relative plausibilities and that's what the posterior distribution does for us. This is very hard for us to do. It's easy for the computer. What's hard for the computer, making models. You have to tell that what the model is. It won't do that for you. Okay, let's keep going. 150 individuals now and retrain it. Again, sampling of lines from the posterior distribution, it's narrower still. Notice that the uncertainty is greater on the ends, right? There's a lot more certainty about what the relationship is in the center. This is a general feature of linear regressions and now the full dataset which is 352 adults. These are just the adults in anti-house data and you can see it's a little bit narrower yet. There's some uncertainty but it's a good idea where the central tendency is. Now, this is the average, right? There's still prediction scatter. All the individuals don't have to be on that line. Obviously, they're not. But that doesn't mean the model is doing a bad job because we have not attended to sigma. Poor sigma has been left sitting aside for the moment. Sigma explains the scatter. That's its job. Alpha and beta just explain the central part and that's what there's uncertainty about right now. You guys with me? Does this make some sense? Okay. The positive distribution is full of lines and those lines have equations. If there's a particular, you want a prediction for the height of any particular individual, you can do that just by plugging the weight into the model and then the thing about predictions for any particular weight, X that we might choose is, again, the map is not sufficient. You want the uncertainty around the prediction. I want to show you how to do that now. If you choose any particular weight in kilograms and you plug it into the model, what do I mean? I mean the model. I mean the actual algebraic expression of where the prediction comes from. We wanted to say what does the model think is the average height of an individual with weight X and we can plug it in. Let's take the example of 50. We say we take an individual who weighs 50 kilograms and we plug them into the model. Again, you use this object post which contains samples from the posterior distribution to get a distribution of predictions. One way maybe that will help you remember this is to say anything that depends upon the posterior distribution is also a distribution. I'll say that again. Anything which depends upon the posterior distribution is also a distribution. Predictions depend upon the posterior distribution. Predictions are distributions. You like distributions or distributions in your distributions. Later in the course, we'll have functions in distributions and we'll have distributions of functions. It's good times. We'll go slow. We'll just start with one thing now. Since predictions are generated from parameter values, which are inside the posterior distribution, they're uncertain, the predictions have a distribution. How do you calculate that? The answer is use the model. That's always the answer. Use the model. Just stick it in. What that means is now, where the alpha and beta appear, you just put the data frame that has the samples in. Post A and post B are those columns. Those of you who are good friends with R, and if you're not yet, you will be soon, know that when you do this with R, you give it vectors and you give it an equation like this. What it returns is a vector of the same length as the longest input. These post A and post B are 10,000 elements long, and so you get an answer that's 10,000 elements long. You get 10,000 predicted mu at 50s, which is what? The expected height for an individual who weighs 50 kilograms. It's a distribution now. We're going to plot that in a second. Does this make some sense? You're going to do this so much in this course. You'll love it. Absolutely love it. Again, why am I teaching you this? Because it works for any model. The model doesn't have to be linear. It can be any kind of function you like. It doesn't matter. This works for astronomical models. It works for anything you want to do. Child growth curves, where they're highly nonlinear models. This works as well. Anything you want to do. Here I'm showing you the posterior distribution of expected heights for an individual who weighs 50. You can see there's definitely a mean expectation here around 1.5 million, I guess. Sorry. It's looking at the slide from the side. It's hard for me to line things up a little over 159. But there's a lot of uncertainty around it, about a unit to the left and a unit to the right. It gets you within two standard deviations. Then vanishingly small. Vanishingly implausible short and super tall people on the other sides. This is what the model thinks. It could be wrong, but this is what the model thinks. This is just the expectation. This is not the distribution of realized heights. This is the expected height. This is like saying your best guess. If you could only guess one number, what would it be? This would be the distribution. It gives you an idea of your confidence in that expected value. Now what we want to do to get the uncertainty on the graph, you'll see where this is going in a second. Just bear with me. We want to do this for every height on the x-axis. Think about, along the x-axis, maybe I should back up to this slide here. Along the x-axis on the plots of this graph, you can think about every integer height on this graph. We repeat this exercise of calculating the distribution of mu, the posterior distribution of mu of each of those. At each of those values, what we'd like to do is get that. There's a hill popping out of the graph that represents that Gaussian distribution at each of those values. We want to represent that hill on the graph. That's where we're going to get our uncertainty bands, which will represent this. As you'll see where it's going is, it will end up representing this scatter. This scatter of lines is recapitulated by that Gaussian distribution at each of the weight values. Because that's all it's doing is showing you, it's full of lines, and it's showing you the plausibility of each of the lines at each point. We were there. Here we are. Here's how you do it. There is a convenience function in the rethinking package. You're welcome. That means you don't have to do this manually for every value on the x-axis. You can if you want. There's a box in the textbook which shows you how to do it. It's not hard. Actually, in fact, I think I have slides coming up where I'm going to walk you through it real quick. It'll be the last time I do something like that for you, but I think it's worth demystifying these things. Link really just does that. It goes over all the values, x values you input. It does the exercise we did on the previous slide, and then it puts them all in a return object for you. Here's how it works. You do have to define the sequence of x-axis values that you want to calculate over. Here, weight.sequence, you can use the Seq function in R to define an arbitrary sequence of numbers. Here, I want a sequence from 25 to 70 in increments of one. That's what that means. This just makes a list of weight values that you want posterior prediction distributions for. Then we use link, and link just iterates over them and calculates that posterior distribution of mu, and then it returns it to you as a big matrix of samples. I'll explain. There's called link. You give it the fit model and you pass it as a data frame, the thing that you prepared up there, the weight sequence. You've got to tell it its name. Weight equals weight sequence. I know it seems strange, but remember weight is a symbol inside the model, and link needs to know what to replace with the numbers in the list you give it. So again, the naming is arbitrary. You could call these things Pickle and Tartus or anything else you want. That's fine, but you just got to manage in your own memory what goes where. You could even call the thing at the top weight, but that would be different than the thing called weight inside the model, and then you'd have a line that says weight equals weight. I did not do that to you. You're welcome. But that would be legal. That's why we hate computers, but we need them. There's a codependency that cannot be broken. So the result is a matrix. As I said, a 1,000 in this case, because there's a 1,000, my default link returns 1,000 predictions. Y46, because there are 46 weights in the list we created. From 25 to 17 increments of one, I assert. I leave it to exercise to the student to prove it's 46 long. And for each of those, it gives us 1,000 predicted muses. So it's a rectangular matrix that's 1,000 by 46. So for each of the columns is a weight, and each row is a predicted mu. So we're going to summarize each column to plot up the results. You with me? Again, we're deep in the weeds of annoying algorithms, but again the goal is we've got to project the uncertainty of this frustrating computer thing onto the scientific space that we care about. That's how it works. This is how you have to do it. And if I teach it to you this way, then you really understand what's going on, and you can do it for others, other models that you're interested in. So this is the last time I'll explain a helper function in the course, but I want to do it just to demystify that if you get even moderately clever with our programming, you can write this stuff yourself. You figure out the algorithm to do it once, and then you just put it all in a function. And you can do it over and over again. It's a very thinking package group. It was things I wrote for myself, because I didn't like doing anything twice. That's the lesson of my life. If I do something twice, I write a function for it. And so all Link does is it samples from the posterior distribution of the model. You give it. It defines a series of predictor weight values. That's an input that you give it. You have to tell it that, but then it uses that and plugs it into the model. And so for each predictor value in the list that you give it, for each sample in the posterior distribution, it computes the linear model. It computes the value of mu. Yeah? That's the thing we did manually in an earlier slide. And then summarize, magic summarize. Right? Put it in a matrix. The result doesn't have to be a matrix, but that makes it easy to process at the end, as you'll see. So this is all the code needed to actually do that. Just do it on our command prompt by yourself, and this is all Link does. Now, this is for a specific example, and Link will do it for an arbitrary map model that you pass it, with any number of linear models in it. So that is a little bit fancy, but it is just doing this, and you could write it yourself. So you know extract samples. Then we make a function that just has the linear model in it, right? Post A, post B times weight, with some argument weight that is passed to it, some number yet to be decided, right? Could be negative 100 for all the computer nodes. That'll be fine. But it'll give you an answer, an impossibly short person. Right? And then we define the sequence of weights again, just as before. And then the last steps require a little bit more care and explanation, so let me do that on the subsequent slides. I should pause and ask if there are any questions. Enjoying the ride? Is this, is it okay? Yeah, question, thank you. That's just the x-axis range on this example. Yeah, that's all it is. Yeah, the question was, sorry, I should repeat for the audio. The question was, why the weight sequence? Why 25 to 70? And that's just the bounds in the data. That's just Nancy Howell's sample. The lightest individual is 25 kilograms, and the heaviest individual is 70. That's all. It's the range. Okay, but you can generate, it's a great question, because you can generate predictions outside that range. You can make the model predict for data that can't be observed. It's often a great way to embarrass your model, right? So think about it in this case. Eventually, the relationship between weight and height is not linear, right? So eventually individuals, imagine a 200 kilogram person, and I don't say this to make fun of weight or anything like that, but imagine a 200 kilogram person. They're not that much taller, right? At least not in our species, right? In cetaceans. Maybe it scales linearly with weight, but not in primates. And so it can't be linear outside the observed range, but in this data set for adults, it is. Linear is really good, actually. It's almost exactly linear. Okay, so let me step through just on this slide. And again, this is the last time in the course I will agonizingly go through the guts of a function for you. This is not really a course in our programming. That said, if you want to know how a function works, you can always ask me, and I will be happy to show you. I wrote the code, and so I should be able to tell you. It'll be a good exercise for me to remember how it works, right? Some of this code I wrote a decade ago, so you understand. I started using R in 1999, right? Party-like, it's 1999. And back when R was... Everybody thought you were crazy for using R. It was open source. How could you trust it? Sort of things my colleagues would tell me. So some of this code is old. It still works. Okay, so first step, post, extract, and we just look at the data frame you get. We're familiar with this. Samples from the posterior. Good times. Second step, I write a function that passes in weight and just computes the linear model. That's all. You don't have to do it this way, but it makes it easy to repeat the calculation. But it does it for everything in post, and so for any weight you put in, it returns 10,000 predictions, because it gives you a one prediction for every row in the object post. This is what R is great for. It's a programming language designed to process data. And so the code to process data to repeat operations on a list of data, you just write it once and it does it for every element. That's what's nice about it. Then our weight sequence, 25 to 70. We just talked about that. That's all it is. It just defines it. So 25, 26, 27, 28, 29, 30, all the way up to 70. There are 46 values in that. Now here's the code that's a little weird. R has a family of really convenient functions in the apply family. They all have the word apply on the end. And then there's some incomprehensible letter on the front. And they're all great because what these do is they repeat operations over matrices, which is what you want to do a lot. So here, why do we want to do that? You probably never said in your life, you know what I really want to do today is repeat an operation over a matrix. That's my goal. No, that's not your goal. Your goal is to get an inference. But sometimes that inference demands repeating an operation over a matrix. That's what we're going to do. And in this case, our matrix has got predictions in rows and weight values in columns. And so for each column, we want to calculate statistics. So for each possible weight, we want to know the average. And we want to know the interval, some plausibility interval. And we want to put those up on the graph. At least this is one very common and useful way to represent uncertainty of the Gaussian distribution. So for example, you could put the average prediction for each weight on the graph. That's the line we already drew. You'll see we'll get the line back out of this. If you take the average of each of these columns, you get the map line back, which creates a coherence and conciliance to the whole thing. And then you get an interval. You could call it a confidence interval if you want, but technically in Bayesian, we're supposed to call them credible intervals. I don't care. Let's just call it an interval. And it's an uncertainty interval. And that could be everything within two standard deviations of the normal distribution. Yeah. So S-apply, well, the first thing we do with S-apply is we just take every element. Sorry, I apologize. I skipped a step. We haven't even got the thing yet. So the first thing we do is we get our matrix using S-apply. And then we're going to use apply. So S-apply takes every element in weight sequence and it passes it to the function mu link, which we wrote above. And then it returns a vector. It stores all those vectors in a matrix. And that's the answer we get. So this makes the matrix. What does S mean? It means simplify. Because the answer is actually a list of vectors and who wants to work with that. So it makes it into a matrix. That's what the S does. So this is to say, what does S-apply do? If you have the list of numbers 1 to 10, S-apply 1 to 10 function of z, z squared just returns the squares of every element in the list. All the code on the previous slide is doing. But now our function isn't squaring. It's computing the linear model. So it's a little bit more complicated. And the result is a matrix instead. So anything can get crazy, right? So here's, we want the function of z, the product of the numbers 1 to z to the z-th root. You may recognize that as the geometric mean. And so you can apply arbitrarily complex functions over a list of numbers this way. That's handy. So then we use apply, and this is the thing. Same idea. What does apply do? Just like S-apply, you repeat an operation over something. Here you repeat passing elements to a thing. Here we're going to take the matrix mu, which was calculated up here, instead of 4. The 2 means columns. It's the second dimension. There's no way to remember this. Just use my notes. Sorry. It's the first dimension, rows, and dimensions columns. That's just how matrices work. What's the third dimension? Time. No. The third dimension is going back in the screen. It's like depth. And then at the end, you put a function that you want to apply over all the columns. And in this case, it's mean. So apply mu to mean takes the mean of every column in mu. Then you just need one line of code to do it. The first time you do this, it seems ridiculous. But then you'll use it over and over again because it's a great way to summarize matrices. And then that's what you get. You get a vector of 46 numbers, and each of them is the mean prediction for each weight value you put in. And this gives us our line back. So now you're saying, but we already had the line. Yes, I know. But you could have models that aren't linear. And then you really need to do this because there'll be no other way to get the complexity of the predictions out of it. So now the last thing we want to is intervals. So for each weight sequence, we want to compute the uncertainty interval. And I'm not going to belabor this example for you. It's talked about in the book, but this gives us the confidence interval or the credible interval or whatever you want to call it for it. Sorry, there's some animation here. I'll skip over it. We get a matrix up for the intervals and then we can plot them up. So here's the plotting code. This just makes the scatter plot. And then you can draw the line right from mu-me. That makes the line on the upper right. And then shade is a function in rethinking that you give it an interval at each of the values and it draws the shading, which is hard to see in the class maybe. Can you guys see it up there? It's super thin because we know where the line is with high confidence. And that shading is the same region as the scatter of lines earlier. So we've recapitulated that the posterior distribution is full of lines and the shaded region is the high plausibility region of lines. This is a little bowtie shape that you're maybe accustomed to with linear regressions when you ask it to plot uncertainty, it draws a bowtie on the graph. Yeah, it's like bowtie pasta. Yes? Yeah, okay. There's some nods of familiarity and most of you were like, why did he make that joke? It's not a joke. It's the actual. It's a technical term. It's the bowtie. So, on the top middle of this graph, I'm showing you what happens if you just plot all the elements from all the predictions as points, right? You get the same shape. It's just really hard to see. It's just ugly, right? But it's the same data. So that's why we do the shading is because it makes it easier to see. But you could, in principle, just plot every element of that prediction matrix that you got, every 46,000 of those numbers, and that's what it looks like, is the graph in the top middle. Yeah? But it's the same bowtie shape as you had before. It shows the uncertainty. Yeah, question? Would it be preferable to have me use the median instead of the mean instead of the small data set? Good question. Is it ever better to use the median to get the line in the middle instead of summarizing with the mean? Yes. When the posterior uncertainty is skewed, there's often a really good reason to use the median instead of the mean. So that won't happen for these models because the posterior is Gaussian. In fact, we're assuming it is. And so the mean and the median will be the same value. But as soon as you get a posterior distribution that's some highly skewed hill, and that'll happen in this course later, then the mean can be very unrepresentative of the typical value. And then the median is way better thing to do. Well, so the last part of chapter three tries to wax poetically about this problem a little bit. And I don't think there's any summary statistic which is always right because it depends upon the decision goal. And so at the end of chapter three, I walk through different circumstances where the mean is optimal and the median is optimal. So actually there are cases, even with skewed distributions, where the mean is the best number to report. But it depends upon the utility function. That is how you measure the losses when you make bad predictions. So what's the cost? So the example at the end of chapter three that I talked through, I don't do a calculation for you, is trying to predict tropical storms and how strong they'll be when they hit a continent. And so from the perspective of the meteorologists sitting in some storm bunker, looking at satellite data, deciding what they're going to report to the public, whether the order of evacuation or not, they have a very strong asymmetry in the costs on one side or the other of the air. So you could say, even if in the posterior distribution there's perfectly symmetric error in say the wind speed, which is I'm told the best predictor of human fatalities when tropical storms hit land. From the perspective of human lives, there's a big asymmetry in which direction the error goes. So if you embarrass on the predictor, the storm will be stronger than it actually turns out to be. You're really embarrassed, but people don't die. Maybe they drove two states away and they're mad at you because it cost them a lot of gas money, but at least they're alive. On the other side, you can lose thousands more lives because you didn't order an evacuation. And so most meteorological services tend to be pretty trigger happy with ordering evacuations because they're managing the risk. Their goal is not to report to the public their best, the mean of the posterior distribution because that's the best unbiased estimate of the wind speed. That's not the question. The question is whether to order an evacuation and that depends upon cost and benefits. And so this is the long-winded answer, but I think when it comes down to decisions and what the best summary will be, that's the thing. And so in basic science, I think all we want to do is to honestly convey to our peers the shape of the posterior distribution. And so I think the median has a strong advantage there whenever they're skewed. It gives you the quantiles. You want to show quantiles. If you can't just plot the distribution, you could just do that. You could just draw the profile of the distribution as well as show it. But yeah, yeah. Okay. We're making great time, actually. This is good. I was worried that I was going to, like, end the lecture here when I was looking at the slides this morning. It's like, wow, there's a lot of slides. Yeah, so now I just keep talking about the slides, so I ran out of time. Anyway, is this clear? No, I'm going to keep reminding you what's the point of all this. The point of all this is this is what your computer actually does when it does this in an automated fashion. And by telling you what's going on, I have two goals, two approximate goals. The first is to demystify it. You're perfectly capable of doing all this yourself. There's nothing super fancy about it at all. It's just processing numbers. The second goal is to empower you to be able to do this with any arbitrary model you might need to later on in your scientific career. So let's look back at, you know, the posterior distribution being full of lines again, that previous slide. And this is just that slide from before, right? These are the, I think it's 50 lines sampled from the posterior distribution at different sample sizes from Nancy House data. So with 10, you can see the posterior distribution, the plausible range contains a wide scatter of lines in the upper left, right? Because the model doesn't have a lot to learn from and so it's not very confident what the average relationship is between height and weight. Yeah, makes sense? And then we add more data and it gets narrower and narrower until the end where the scatter of lines are actually in a very tight range. And now all the stuff in between that we've done is about taking the scatter and representing it as a shaded region. How do you calculate the shaded region? And that's what I just did for you, is for each of the x-axis values, there's a Gaussian hill and the width of that Gaussian hill changes as you move across the x-axis and so you can plot the widths in the middle of that Gaussian hill and represent it. If we had a 3D computer screen you could just plot the hill, right? Like 3D print your graph, I guess. We have 3D printers in this building, I should do this. Let me work on that. I have no excuse, right? I work in a great institute where we have 3D printers, I should do this. So, I don't know if the Fairly-Palji department will like my proposal to 3D print a statistics graph. But, hey. Okay, so now represented as bow ties in the upper left there's this wide bow tie. Yeah, very easy to see, but this is I'm going to oscillate back and forth. It's the scatter of lines. We have the region of highly plausible lines. In this case, I think within 89% interval of the mean. It's the 89% interval. Then with 20 it gets narrower, with 50 narrower yet, 200, 350 gets quite narrow until it's actually quite hard to see the shaded region because there's a lot of data and so we have a lot of high confidence about the average relationship between these two variables. Yes, make sense? This works for any model. It can be nonlinear. The prediction space can be highly complicated. The model gets more complicated, but this procedure stays fundamentally the same. This is how you generate predictions. Once soon as you have a posterior distribution, you just push it back through the model and it makes predictions. You can use those predictions to interpret what the model thinks. What does it think it has learned from the data? This is how you figure that out. What do I mean by prediction intervals? I've been talking about prediction intervals, but really all we've done so far is the expectation, mu, which is deep inside the model. Remember the likelihood in the model is Gaussian. Each height is distributed Gaussianly with some mean and standard deviation. All we've done so far is the posterior distribution of that mean, but there's additional uncertainty for any particular height because there's a standard deviation in the sigma thing that we also had to estimate. We need to get sigma on the graph too. You don't always have to put sigma on the graph as well, but this will help you explain the additional scatter in the data, or at least how the model expects it. Predicted heights themselves are not all going to be on the line. They're going to be scattered around the line, but how close to it? Sigma tells you that. We have uncertainty from the posterior and that contributes to prediction uncertainty. Then there's uncertainty from the process that's embodied in the likelihood function that generates additional uncertainty. It isn't a product of the posterior epistemic uncertainty. It's a product of whatever you think is the likelihood. I think of that as epistemic too, but if you thought of it as a random number generator, there's uncertainty from that. There's uncertainty from the coin flipping. Think of it this way, with the globe tossing model, even if you knew P, even if you knew the exact coverage of water on the planet, how much uncertainty? Well, it comes from the likelihood. It comes from the coin tossing model. I still think of that as epistemic uncertainty because remember physics is deterministic. If we knew enough about the initial conditions of my holding the inflatable globe and tossing it, we could predict with certainty whether it be water or land. It's just that we lack that information. Nature's not random, at least not at the globe level. Again, at the level of quarks and other things, maybe, although I think not even there. Maybe not going to today. The easy way to do this, if you wanted to do it yourself, and I show you this in the chapter, and actually you've already done this in your chapter three homework with coin flipping. This is how you generated BERS if you did the last homework problem where it was, remind me, I've only wrote the book. What is it? Siblings, when the first born's a girl, what's the sex of their sibling? Because there's a lot of dependency you discovered in that data set. Spoilers. Human families manage the sex of their kids a lot with birth order. It's a very famous thing in human populations. There's a lot of management of the sequences of boys and girls and families. That data set represents it. And so you use our binome to simulate BERS. And so you were doing this already. It's just in our binome there's no sigma. The sigma, in a sense, is fixed. For all binomial distributions, there's a fixed standard deviation that arises from the number of sides on the coin. But in a Gaussian distribution, you've got this additional scale parameter called sigma, which adjusts the uncertainty that arises from the sampling process. So you could do it with our norm. And again, in Chapter 4, I show you how to do it. But there's another convenience function in rethinking called sim, which will do this for you for any map model. You want to pass it. It just reads the likelihood out of the model you gave it. And it just generates random numbers from it. And then it returns those in a matrix. So it works the same way as link. In fact, it calls link first. When you call sim, you use it just like link. You give it a list of weights you want predictions for. The first thing sim does is call link with the same argument and get that matrix. And then for each of the muses, it generates a random height using sigma. And then it returns all those heights. And so now the data look very much the same, but there's a much wider scatter. And let me show you what those look like when you put them up on the graph. So now the shaded region here, loaded two shaded regions, there's the narrow shaded region in the middle, which we had before. You see that still? And then there's the uncertainty about actual realized heights. And that's the other shaded region with the wobbliness around the edge. Why is it wobbly? Because it's hard to calculate wide intervals for things. So there's this wobbliness of the exact samples at the edge. Yeah, that's all it is. You can take an arbitrary number of samples from the posterior and get the wobbliness out, but it just arises from sampling variation. Yeah. So these are 95% intervals. And this is another chance to remind you that there's nothing special about 95%. It's conventional. It's the tyranny of having five digits on each limb. For some reason, we make base 10 number systems, and then we're obsessed with fives. But there's no scientifically valid argument for using 95% intervals or anything else, especially not as a criterion for career advancement, which unfortunately it is. But not in my department. You know, I joke, but this is deadly serious, right? People go out of their way even to the extent of faking data so they can get within these sorts of intervals. It's bad news. It's a form of scientific superstition. So I want to break the spell as part of this. 95%, yeah, fine. But let's use something that kind of reminds the reader that there's nothing special about 95%. I'm fond of other things. You can do 50%, 80%, 99%. Throughout the textbook I use prime intervals, that's what I'm calling them because it sounds kind of magical. 67%, what are they? 67%, 89%. They're prime numbers. So what? Nothing. They're just as good as any other interval. The goal is to communicate the shape of the posterior distribution, not to use the boundary. The boundary is not meaningful. There's nothing magical that happens in this situation. Nothing happens at all. There's a continuous reduction in plausibility as you move out. There's nothing special about the boundary. We're just trying to show the shape. You'll hear this term so many times you'll hate it unless Stockholm syndrome sets in first and then you'll love it. But again, remember shape not boundaries. And I will say it's true. I use 89% intervals in papers and occasionally, usually it's no problem but occasionally there's a reviewer that says these should be 95% intervals and I just ignore it. And the editors are very savvy about this now. And that's my experience. Okay. How SIM works? Just one slide on this. Works just like Link, but there's one additional step. I'm not going to spend a lot of time in this code just to say it's in the book and I don't want there to be any magic for each weight rather we're going to simulate a height for each posterior distribution each sample for the posterior distribution. So we've got a posterior distribution for mu and sigma and we need to pass those into our norm. So this is where the posterior distribution is full of distributions. Now. So get the samples. Again, weight sequence. Here's another way to generate that same weight sequence means 25 to 70 in increments of one. It's a shortcut you can use. And then we make a vector SIM height using s apply where we apply each of the weight values from 25 to 70 to this function where the weight goes in and then inside the function what happens is we generate a random normal that's what our norm means. How many of them? Well, one for each row that's how many random numbers we're going to get. Where the mean is our definition of mu post A, post B times weight and standard deviation is post sigma. Because we don't know the standard deviation with certainty either so we have to get that in certainty there's uncertainties about uncertainties. You'll get very comfortable with this. It's just how it is. It's uncertainty all the way down. It's like in chapter one I have this joke. It's turtles all the way down. Hindu cosmology. I did a story about Hindu cosmology. I do not believe it's turtles all the way down. So and then we can get a prediction interval for height by calculating the interval for each column in SIM height and that's what applies. SIM height to PI means PI is a function that gives you literally a percentile interval which is an interval with the same probability on each tail. Same probability and so for each column in the matrix SIM height two means column take the matrix SIM height for each column giving the percentile interval. And the default is 89% because I wrote it. Yeah. Makes sense? Yeah. If you like 95% go ahead. I won't frown on you. It's perfectly fine. But 89% is better. It's prime. Going to prime interval club. We should have badges. Okay. We laugh because the alternative is to cry. The thing about I make a lot of jokes about pure conventions in science because I'm trying to demystify them. It's fine to break these things. There's no good justification for many of these arbitrary numerical standards. And if someone tries to enforce it you just have to ask, well just remind them it's a pure convention. It's like which side of the road you drive on except no coordination incentive. It's like we all had our own private roads and people still cared which side you drove on. That's kind of what the 95% thing is. It's really infuriating. So I joke about it instead. Okay. Last thing I want to talk about within chapter four is to talk about as a way of transitioning into multivariate linear regression is to introduce polynomial regression which I actually think is a form of regression you should not use. I think there are always better things to do if you're after trying to solve the problem that this solves. I'll say more about that as we get towards the end of this lesson. But nevertheless it's worth teaching you because it's incredibly common and it's a gateway into thinking about what it means linear models don't actually draw lines. It's this super frustrating thing about statistical jargon that linear models are not linear. Not in the prediction space. Linearity has to do with the additiveness of the model of the mean. That's what's linear about them. But when you draw predictions in a scatter plot they don't have to be straight lines. So I want to show you an example of that and then I'm going to tell you that please don't use this kind of model. But you're going to see it all the time and so it's worth understanding it. And I'll try to persuade you that you want to find something, some other process some other curved model to use instead usually. Okay, so here's the basic problem to set this up. So what we've done so far is just the adult data and that's why I show you on the left and I hinted earlier today that if we extend the range of the x-axis variable out. So we let say McDonald's comes to the Kalahari which I'll pass judgment on whether that's a good idea or not, but I suspect it's bad. Then they get very healthy indeed and they gain a lot of weight. Individuals, this population could easily double in average adult body mass as a product of that easily. Humans are good at this. Very good at storage. And it's adaptive and now I assert, I can't prove it because I don't have the data, but I assert that the relationship between weight and height would not continue to be linear out in this range. Why? Because of the skeleton. That's why. Because of the way our skeleton works. So the linear model is only a good approximation within some finite range of any particular range of variables. At some point you're going to have to give up on linearity. The other thing, even in this data set is as soon as we include kids and that's what I show you on the right obviously linear isn't going to do. Yeah? Humans don't grow like alligators. I think if you look at growth charts and alligators it's not exactly linear, but actually it's log linear and they grow basically at the same rate for their whole lives. But, say, proportional right, their whole lives. But terminal growth vertebrates like ourselves are not like that at all. And it's harder to explain this with a straight line, right? We could draw a straight line through it and I'll show you what that looks like in a little bit, but we want to do better. And everybody's first linear model, which is actually linear sorry, I want to tell you what this distinction means is the polynomial regression or parabolic regression. So, again, this is a geocentric strategy. We're adding epicycles, right, orbits on orbits and it can be descriptively accurate to an arbitrary standard. Just keep adding polynomial terms. I'll show you how to do that. And please don't do it in your own work, but I'll show you how to do it. Right? So, the first order of linear model we've done so far is shown here and you've seen it before. Mu sub i is alpha plus beta. Now I'm going to put a sub 1 because we're going to get extra betas, right? X i. So, this is the model we've been working with agonizingly for the last, you know, all of today and on Wednesday we did only work with that model. We can extend this model to draw parabolas on x instead. And you may remember once upon a time in school you did stuff with parabolas conic sections. Stuff, they still teach geometry in school. I'm sorry. That's the least fun. No, I'm sure it's fine, but it's the part of math I did not like. It's nothing like other math. So, we're not enough about me, right? So, this is the equation for parabola and you just need a squared term and then a curve. A particular kind of curve. It's symmetric, right? And it straightens as you get away from the inflection point. And what's appealing about this is when you fit it to data is you can get the relationship to taper off. Now, eventually, since it's a parabola it'll reverse direction, but you just hope that's outside of the range you care about, right? As individuals dive into being short again off the right-hand side. I'll show you what this looks like as we go. Again, it's a golem, right? It's geocentric and there's always some interpretation required. So, again, here's my warning, polynomial models suck. That's a technical term. But they are very common, so that you need to understand them. If you feel tempted to use one, I'd be happy to help you do something better. There are lots of really good alternatives that don't exhibit some of the weird stuff I'm going to show you about these. Okay. That's it. I've published models with polynomials. I learned the hard way. Don't repeat my mistake. There are way better options, right? I went through this journey and the point of writing this book was so nobody has to learn this the way I did. Okay, so now we used the whole sample, all of Nancy's sample, all of the data in how one. We're not going to remove the kids this time. We're going to keep everybody. So now there are 544 individuals in the data set and I'll show you the scatterplot over there. The range we worked with before is that vaguely linear cloud in the upper right. Those are the adults. But now we have all these kids to explain as well. And kids are people too. So we should explain them. It's very useful especially when working with polynomial models, but in general when running regressions to standardize your predicted variables. What do I mean? Standardization means you subtract the mean so that the mean is now zero. Calculate the mean of the variable. Subtract that for every value. Now the mean of your new variable is zero. And then divide by the standard deviation. And then now the standard deviation is one. So now you have a new variable with mean zero and standard deviation one. It's standardized. Like a standard normal. Like a z-score. Everybody knows z-scores. So why is this useful? It's easier for the computer to find optima when you scale things this way. Especially if you give all the variables the same scale. It doesn't struggle as much. So it's just a numerical optimization convenience. But it helps a ton. And I feel a lot of questions from people where their models don't run. And it's because they haven't standardized their predictors. So I know it's just part of the sorcery of doing calculus with numbers. In analytical math it doesn't matter. But inside the computer, computers don't do real math. They do some kind of sorcery that resembles mathematics. The I triple E standard for real numbers. And that's not actual real numbers. It's a bizarre computer representation of them. So I recommend you do this. And we're going to do it in this case. We're going to construct a new weight variable to standardize from the original length by subtracting the mean and dividing by the standard deviation. There's a function in R called scale which does exactly that for you. But I want to show you how to do it yourself. Okay. So the parabolic model of height is not exactly like the linear model of height. Top line is the same. It's exactly the same likelihood. That doesn't change at all. All that changes is the equation for mu. Now we add the squared term of x. And what is x squared? It's literally the square of weight. Yeah. This is where standardization helps a lot because if you start squaring the number 70 you get a big number. And that makes things numerically harder to search. So but squaring 1 gives you 1. So that's not nearly so bad. But it looks the same. So you put that carat on the circumflex called a carat math on the far right means 2 to the power of 2. And then we have priors. The only thing that's new is we had to add a prior for beta 2. And we give it the same one as beta 1. And you fit the model. And I won't shift. The code for this example is in the book fully worked. I encourage you to run it. And let me just show you what happens when we draw predictions from the posterior now. The posterior distribution is now not full of lines. It's full of parabolas because that's what the model is. And so we draw parabolas from the model. So I'm going to do the same exercise. I'm going to sample 10 and then 20 and so on individuals and show you how the model learns. So let's start with 10. What does it think? We draw 10 adults because there are a lot of adults in the sample just by chance we draw adults. And notice outside the range. So it estimates there's a linear relationship in the range of adults. It gets that pretty well. It has no idea. The model knows it's allowed to bend out there and there's no data constraining how it bends and so it's a wild horse tail all over the place out there. This is, by the way, one of the problems with parabolic models is they do crazy stuff near the limits and it absolutely bonkers things. This is called the Runge Phenomenon after a German mathematician who wrote a nice treatise about it in, I don't know, sometimes in the 19th century, I think, might have been the early 20th century. This is famous in engineering and it's one of the reasons that in engineering they don't like these functions. Try building a bridge where at the margins of performance your model scales wildly and unpredictably. Do you want to walk across that bridge? I don't. Number 20 on that note. 20 samples. Now we've got some kids and it starts to tame the horse tail. It's bending down but there's a lot of uncertainty about the exact parabola. With 50, it gets tighter yet. 100, it's locked down pretty tightly. Notice that it's over predicting the height of the smallest individuals, the lightest individuals in the population, because it's not exactly a parabolic relationship. No surprise there. With 300 individuals, you can really see that effect now. The model's really sure where the parabola is but it's definitely not predicting the smallest kids correctly. But you told it to use a parabola, so it's just telling you, if you're going to force me to use a parabola, here's the parabola, here's the best one. Yeah, it's terrible, but it's the best parabola. That's all it does. It's like linear models. Finally, with all 544 individuals, there's the best parabola. Nailed down pretty tightly. The model's very confident. If you're going to use a parabola, you shouldn't, but if you're going to use a parabola, this is the one you should use. Make sense? You can go further down this rabbit hole and we will not do this until chapter 6, in which case we will go way deep down this rabbit hole. It will take us a whole week to go down and back out of the rabbit hole. But we'll punt on that for now. You can do a cubic. And if you fit the cubic, I'll just show you what happens. So this is the linear model. If you fit a linear model, the whole thing, that's terrible, isn't it? So this is the golem says, you told me to use a line, here's the best line. That's what I'd use. Both use a parabola, here's what I'd use. Notice it's curving down from the really heavy individuals. They're getting shorter. Probably wrong. I'm sure we don't have the data. Now it can bend more, because it can change direction. And now it gets the smallest kids, right? It gets the transfer of the smallest kids, right? But now it's predicting a rapid increase in height outside the range of the data. That's the Runge phenomenon. And you don't... So this is why polynomials are dangerous. They do weird stuff like that. What would it be better? A biological model would be nice. And there are lots of biological growth models that work really, really well. And that's what we do in evolutionary anthropology, for example. Okay, I shall let you guys go. Chapter 4 homework, I suggest doing the three hard problems for H1, H2, and H3. They're all related to processing Nancy-Hall's data. And they're designed to talk you through prediction exercises using the models fit to those data. Okay. I don't think they're particularly hard, even though they're called the hard problems in the chapter. Next week we're going to do multiple regression chapter 5, which also includes, by the way, categorical data. Right? Sometimes it's called factors, which is extremely common. Psychologists love that. Love your factors. I do too. I love factors. They're wonderful. But we're not going to have new tools next week. It's just going to be steady practice with everything you learn this week, even more like science, I hope, than like data processing. Anyway, thank you for your attention. Have a good weekend.