 Introduction to linear models today, multiple regression, the same sort of models we have here, but with more stuff in them. So the signals I'm getting from people asking questions about homework are that you guys are first of all really good at this, your problems are pretty minor, and you have the anticipated sort of problems, and I want to highlight one that's common and always happens in the first couple weeks of this course, and is a symptom not of any failing on your part, but just, it's a symptom of the fact that words mean different things in different contexts, and most people here have probably not had a Bayesian stats course before, and unfortunately Bayesian statistics uses words from statistics in general and means different things by them. Confidence interval is a classic example of that. Bayesian confidence intervals should not be interpreted in the same way as non-Bayesian confidence intervals, I mentioned that before. Even innocent words like sampling, you did a bunch of sampling in your homework, you enjoyed it, right? It was fun, and one thing I just want to highlight for you is that, you know, your previous education can impede learning, because it puts up an obstacle, and that obstacle was purely historical. It's just a side effect of your development, educationally up to this point, like the Great Wall of China, which is why it's pictured here, right? It's not actually meant to stop people from keeping Mongolians out of the country anymore. This is the historical obstacle, and your prior education in frequentist statistics is just a historical obstacle right now to your full embodiment of the Bayesian approach, but you'll work through it. You'll eventually climb over this wall, and so I just want to highlight for you here. In frequentist statistics, sampling typically means re-sampling the data in imagination. We imagine re-sampling the data, we imagine alternative data sets, and for each of those, we get an estimator of some kind, whether like the sample mean, and that you collect a bunch of those, a very large number of those, you get a distribution called a sampling distribution, and that distribution is used to characterize the uncertainty in the estimator you have from the data you did observe. We don't do this in Bayesian statistics, even though we use the word sampling. So in this class, almost all the sampling we're going to do, and we will do a lot, oh yes, you've just gotten started, there will be a lot of sampling, it's just to do integrals. It's a way of getting collections of things so we can calculate areas under a curve, that's almost all the sampling we do. We don't re-sample data, we'll sample parameter values and almost all the stuff you're sampling will be that. Occasionally, we do sample observations, and you did this in the posterior predictive checks, and that's much closer in spirit to what you've done in prior statistical work. But we're not re-sampling our data, we're imagining forecasting in a sense and looking at the implications of the model through that, so it's a way to simulate data, but it's not re-sampling the actual data. So if you didn't, you know, maybe this was it, this little sermon wasn't a value to you and other words, great, but if it was, you're perfectly normal because it's just the fault of the pedagogy that these same words are used over and over again. That's why we do mathematical notation. So I just wanted to highlight for you, this will be one of the obstacles that arises, and it's just a semantic obstacle and you'll get past it. This is where we left off before, this is the first linear regression we're working with, I'm going to restate the model at the top. We have heights of individuals indexed i, and we say that those observations, those individual heights, we expect them to be normally distributed with some mean i that is conditional on the individual i in some standard deviation sigma, which is not conditional on individuals. There's no reason, I wanted to say this last time, but I forgot, I'll say it now. There's no reason that has to be true. You can make sigma conditional on i as well. And back to the end of the course, we're all the action, we're going to look at a class of models called Gaussian Processes, where all the action is there in the sigma, so to speak, you model the covariant structure among observations. And everything, all the cool stuff happens there. So genetic regression is like that. It's an example of that kind of thing, where all the action is in the covariant structure. You model the covariation. For most of the course, though, there'll just be little old sigma there and it's pretty inert and boring, just like that. It's not conditional on anything. The definition of how the mean from you sub i is conditional on the individual i is contained in the second line. This is the linear model part of the regression model. It's the equation for a line with intercept alpha and slope beta and x sub i where x is the weight of individual i. Then we define priors. I'm going to say a little bit more about priors as we go. And again, I'll bring up this horoscope issue that I brought up before. In this case, we have a broad, very effectively flat prior on the intercept alpha, a very slightly conservative prior on the slope beta centered on zero. Why? Because zero means no effect. Now, I wanted to say and get here, this is still a pretty silly prior. You can easily do better because you know the relationship between height and weight is positive prior to receiving the data. So if you constrained the estimate of beta to be positive, you would learn faster from the same amount of data. So a robot that had a prior for beta that was strictly positive would learn faster. It would convert to the right answer quicker than this ignorant robot that has no idea that the association between height and weight is positive or not. It's like we're estimating it as precisely as we can. And the question is how quickly can you do that with a finite amount of data. But this is a very typical kind of conservative prior. And there's so much data here that these priors get completely swamped by everything but I'll get you to the question in a second. And then this uniform prior on sigma. You fit this with quadratic approximation for the posterior distribution applied by this model and these data with the R code displayed here translating the model lines into that list. Question? How would you do that? It's not complicated. The question was how would I do that, put a positively constrained prior on beta here, you can use a distribution that's constrained to be positive. So there are a large number of those actually that you can choose. And we'll see examples of those later on. I was going to say something about this on the next slide actually. The standard deviation sigma must be positive. So that's a parameter that has to have a prior that's strictly positive. So you could use a uniform, right? You could say that the prior for beta is uniform between 0 and 50. And then we're saying it can't be negative and it's probably not bigger than 50 because seriously. And that would probably be okay. That's the simplest way. There are a bunch of so-called folded distributions which are constrained to positive like a folded Cauchy. And we're going to use that or say a half Cauchy sometimes is called. We're going to use that later on in the course, mainly for sigma so-called scale parameters. And those work great in those sorts of contexts. Okay. So yeah, I wanted to give you some, what I call horoscopic advice on regression priors. And again, I say this just to say this is the most solid vague advice I can give you before I know what you're actually doing scientifically. Once you tell me what you're doing scientifically, I can almost certainly do better than this. It's like the horoscope. When you read the horoscope, that person who wrote it doesn't actually know you and they don't actually do magic with plans. So they give vague advice that fits everybody's life and that's why it's so satisfying to read horoscopes in the newspaper, right? Because it always speaks to you as a person. These priors I hope will speak to you as a person and then nevertheless, you can always do better than this, right? So for intercepts and linear regressions, the alpha parameters, as I'm going to show you later today, those can swing around very wildly depending upon the slopes and the locations of particular values in your data. It's hard to have a strong prior expectation of where those are. So the most general useful advice I can give you, unless you do something like standardize all your predictors, which is something I'll show you how to do later, later today in fact, you don't know where it's going to be, so give it an essentially flat prior. And that's why I'm using those things with a standard deviation of 100. That's essentially a flat line over any reasonable range of parameter values. So alpha ends up having to be an unconstrained estimate. It can go swing around wherever. Beta then drives it. It affects where the line is. If you change the slope on a line, you also change the intercept. And the slope is our curiosity. These Gaussian slope priors centered on zero are conservative. You want to scale them so that they rule out really silly things like outrageously strong associations between height and weight. And that will depend upon your domain knowledge. This will make more sense to you when we get to Chapter 6. We call these priors regularizing. It's better to use a conservative prior because it dampens down a phenomenon called overfitting. You're just going to have to be patient until we get to Chapter 6 and then there'll be a ton of examples of that. It will make sense. This is, again, my mission to convince you that flat priors are never the best priors. In this case, there's so much data. We've got over 500 individuals that even a quite strong prior on this will be completely overwhelmed. And I'm not going to demonstrate that in class, but I really do encourage you to do that experiment yourself. How do you do it? You just change the number in the standard deviation of the prior and play around with different ones and see how small you have to make the standard deviation on beta before it starts changing your estimate of the slope. You're going to have to make it really small, like 0.1, which is a really concentrated. That'll put all the probability mass right over 0, essentially. And then you'll start to nudge it off where it was before because you have so much data. And as the course goes on, I'll give you more ways to understand these things. When you have small amounts of data, though, that's when you really need the prior to do some work so that the model doesn't get overexcited by the tiny amounts of data present. In Chapter 6, I'll deliver more on the promise of understanding that. Scale parameters like sigma, sigma's a standard deviation. It scales a distribution, right? It stretches it out. And in statistics, there are a bunch of these sorts of parameters that are not always standard deviations, but they scale a distribution. They stretch it out horizontally. Scale parameters are nearly always constrained to be positive real numbers and are, I should say, non-negative real numbers. It's nearly always harmless to use a uniform prior like we're doing between zero and some upper bound that it's within rational reason. And that's an easy thing to do. Later on, we're going to want to regularize these as well, make them conservative, to shrink estimates toward zero a little bit. And that's when I'll introduce you to these bolded distributions that I mentioned, like the half-coachy or an exponential is a really good one, too. You may know exponentials, but if you don't, that's okay. I'll show you how that works when we get there. I'll just say this is the vaguest advice I can give you that I think is mostly harmless. Okay. So you fit that model that was on two slides back and here's the pracy output for it, just showing you the values that define the marginal posterior distribution of each parameter, alpha, sigma, and beta here, means and standard deviations of them as well as percentile intervals, the 95% percentile intervals for you. This is just about the last time we'll be able to peer at these estimates and see a correspondence between the model predictions. What I've done on the right here is each blue circle is a data point. This is one of the adults in the sample. There are 300 something on them, right? I forget exactly. And weight on the horizontal, height on the vertical, and the black line is the map line, the maximum posterior line. I'm going to show you in a minute and we're interested in all of them to different extents, but we are interested in all of them, each of the special snowflakes and we want to pay some attention to it. But this is the most special of the snowflakes because it's at the top. But there are a lot of lines near it that are nearly equally plausible and we're going to look at those next. But right now you can see a correspondence between these mean values they define the location of this map line. This line has the intercept, the y-intercept of about 114. And you can't see it on this graph because that's the value of height when weight is 0 and weight is 0 off the left-hand side of this graph. This is a general problem with intercept parameters. They typically have no biological meaning of any kind. Unless you scale your data in some special way. That's just an awkwardness that is normal and I'm not going to hide it from you. Beta then is this slope with 0.9 and that's the run of this line. So that's exactly what it is. But it's only the map line. And then the standard deviations of these parameters give you the uncertainty around that. And we also want to plot that on this line. So that's what we're going to spend most of the time today doing is looking at ways to get the uncertainty onto the graph. So you can visualize your confidence or lack of confidence in exactly what's going on. In this case we're going to have a lot of confidence where the line is because there's a lot of data. But rather I'll trickle it in a few observations at a time so you can see how the Bayesian model becomes more and more confident as you add more individuals to the sample. And we'll see sort of the envelope of prediction change. So how do we get uncertainty onto the line? Well unsurprisingly we're going to sample from the posterior. This is a procedure with these models you can do all this analytically. And I've taught it that way before but like I said I usually lose half the class immediately with that. Because that half of the class is dumb or anything like that it's because they're human. And so I'm doing it this way both because as I said before I can carry everybody along this way. And once you get to Markup Chain Monte Carlo this is the only way you're going to be able to do it. Everybody has to do it this way because you only get samples in that case. You have no analytical option. So here's the recipe and I'm going to walk you through this in code form and show you what the graphs look like. We want to get uncertainty onto that graph. We sample from the posterior which means we use the map estimates which are the means of each, in the posterior means for each parameter and the standard deviations to approximate the posterior. Well there's also, I forgot to put this on there, also the correlations between the parameters which I'll show you starting on the next slide. Because these parameters alpha, beta, and sigma also have correlated uncertainty because the posterior distribution holds relative, has probabilities for combinations of parameters. And we only get a posterior slice for any particular parameter when we ignore the others and average over the uncertainty in them. Then we can sample from, we're making we construct a quadratic approximation of this multivariate posterior distribution. Since it's Gaussian in every dimension it's a multivariate Gaussian distribution and this is an easy mathematical object, trust me. And R is really good at taking samples from these. In the book I give you details about how to do this in the MacGyver way, you don't need to use my convenience functions, you can do this on the raw. And then you use these samples to generate predictions to integrate over the uncertainty. The last step is the more complicated one. The first two are largely automated for you and the last one you've got some freedom of choice actually and so that's the part where anxiety grows. And that's perfectly normal so I'm going to show you a bunch of examples as the course goes on. First thing the precy function can show you the correlations among the parameters if you use this optional core equals true and what it's showing you here is the correlations. It's called the correlation matrix among the parameters. This is something that's computed when the bidding is done. These are all the values that you need shown on the screen here to define the multivariate normal distribution that we're assuming the posterior is. So I want you to see this is a matrix so each parameter is perfectly correlated with itself. That's comforting right? And then nothing is correlated with Sigma. Sigma has got a zero with everybody except itself right? It's independent of the others. This is very typical with simple linear regression that the standard deviation is uncorrelated with where the line is. It's like the line is chosen. Those of you who had previous training in OLS regression may remember this, so you choose the line and then you pick Sigma and you could do it that way even though this way we're doing it all simultaneously. And then the intercept and the slope are really strongly associated. They have a negative correlation that's almost one. So they contain almost the same information. Now you're claiming yourself why is that true? It's nearly always true that slopes and the intercept are correlated because if you change the slope you change the intercept of the line right? That makes them correlated. Nevertheless, in the book I show you that if you standardize the data you could remove this correlation entirely. And that's worth taking a look at. We'll do some examples of that later. Yeah, question? The question was, is Sigma something like the residuals? In non-vagant statistics, Sigma is often called the residual standard, the residual error variance or something like the residual standard deviation. So I think if you use R's LM command at the bottom it gives you a number which will be very close to what you get from Sigma, but it's not the same thing for various technical reasons. I mean, you should expect them never to be exactly the same because in primitive statistics you never have probabilities of parameters. So it can't be the same thing. And OLS regression doesn't actually estimate Sigma. It has no model of data based upon that really. It kind of fits it after the fact. So which is all fine. I mean that's not to say it doesn't work. It's just a different logical procedure to get it. It's also true that there's this dividing by n minus one thing that has polluted statistics and is nearly always useless because it's meant to give you an unbiased estimate of these parameters, which is something in Chapter 6 I want to prove to you is you don't want unbiased estimates because bias improves accuracy. And that's just, and I know you're like, this is madness, who is this guy? And this because this is the pollution of statistical jargon. Bias in statistics is not a bad thing. It's just somebody, I think it was Fisher called this thing that was perfectly okay bias and it made it sound bad and then people got attached to the idea of finding unbiased estimators. But unbiased estimators are hardly ever the best estimates of anything. So I got off on this slide. That said, with even reasonable amounts of data, n minus one and n are basically the same. So if n is even reasonably large. And so in practice it has very little effect. But that's just something to watch out for. We'll return to this when we get there. It's practically the same as you'll see. The difference here of course is we get a distribution for sigma. We get its uncertainty. And as the amount of data increases it gets more and more precise. And when you simulate observations from a linear regression, you care about that uncertainty because sigma affects the spread around the mean that you anticipate in observations that you're going to see. And we're going to work with that particular issue today. Okay. So we can easily sample from the posterior. We got all the numbers we need to do it. Let me show you how to visualize the samples from the posterior before I show you how to extract them in this case. So I'm showing you alpha against beta on the top left here. And you can see the minus 0.99 correlation in the sample. So each point in that graph is a sample from the posterior distribution of those two parameters. And that's marginalizing over sigma, which you can't see in that. It's like sigma is the dimension that you can't see that's coming straight out at you or such. And as if, so what this tells you is if A is a larger number, B must be a smaller number in order to get an equally plausible line. And there are a large number of approximately equally plausible lines along a thin ridge of combinations of alpha and beta that are negatively correlated with one another. And this is the thing you'll start to get natural in your head is that the posterior distribution contains combinations of parameters. So in this case it contains lines and the spread around them. Yeah, question. Is this in part because it's still the case with Bayesian regression that you have to pass through the mean of height and weight. So you are rotating around an axis. So the question was it's still true in Bayesian regression that you have to pass through the mean of both. It's not true in general because we have priors, right? But with flat priors yes you will, the map will pass through the mean of both the X and the Y variables. It actually will pivot through, exactly. And that's why you get this correlation. So if you if you center these variables so that the mean of X and Y are zero then you get the intercept for free. And I know you know this. And so in that case they'll be uncorrelated. And I have an example in the book like that as well. So but this is exactly an artifact of the measurement scale. It's one way you could think about it. But if you had really strong priors you could push the estimates away from that mean, mean point, right? You could always do that. You could choose stupid priors and get absolutely a stupid answer whenever you want, right? Absolutely true. Okay. So and then I wanted to show you the lack of correlation between sigma and beta is shown in the upper right. If you drew one for alphablet it looked the same. It's just a snowball. That's what no correlation looks like, right? There's no association between sigma and beta there. But they're both Gaussian. So you get this Gaussian hill like a snowball that's been turned down on the ground. So I have no idea what a snowball would look like if you threw it down on the ground. Probably wouldn't look like that. But let's call that a snowball. It looks like a snowball. Does this make sense? Yeah, question. Either not perfectly correlated or you know. Well it's still true that the values in the middle are more plausible. There are more values in the center of that posterior distribution. That's where the map is. The map is right in the middle. They are the mean of alpha and beta. There's a map estimate. That's the most plausible line. And then there are less plausible lines if you move away from it. And those are highly correlated. But they're not perfectly the same. Yeah, exactly. They're not completely redundant. You do need them both. Okay. So in the book I show you how to use how to do this. There's a convenience function in the rethinking package called extract samples. It does exactly what it sounds like. If you have a fit model you fit with map it pulls out the map values. It pulls out the correlation matrix. And it pulls out the standard deviation of each parameter. And it uses that to construct the multivariate normal distribution. In the notes I show you how to do this in the raw yourself. There's a little box on how to do it if you're curious. It's easy to do. What you get back is a data frame. Like the raw data frame you'd work with where every column is a parameter. They're not data, right? They're parameters. And every row is a sample. A set of samples from the posterior distribution. And these columns have the right correlation structure in them. Defined by the map estimate which is defined by the curvature of this multi-dimensional hill in Hilbert space. You know what I'm thinking about? And I'm only showing you the first five here, but I think by default we get 10,000, which is usually way more than you need. But it's an adjustable parameter. If you look at the health for extract samples you'll see it. Does this make sense so far? These are just like the samples that drove you nuts on your homework, right? But you did those yourself in the raw using sample. This is automating it a bit because it's using the quadratic approximation. But we're still samples. Once we've got them, we can treat them the same way. We can push them back through the binary predictions. We can sum over ranges of them to answer questions about where the relative plausibility of values are, things like that. There are samples again. There are a way of avoiding doing it in a room of capitalism. Right? So I want to encourage you to think about this. What we've actually sampled are lines. And every row in the samples of the posterior defines a different regression line. And the density of lines of different placements and angles corresponds to the relative plausibilities in the posterior distribution of particular regression lines. So you think of it this way. What Bayesian models do is you define the question it's asking through this modeling language, and then it considers every possible combination of the parameters and assigns them a relative plausibility conditional on the data. So in this case that means it considers every possible regression line right in the X and Y plane, and it ranks them by their posterior probability which is, again, remember it's the relative numbers of ways that that line could generate the observation. Right? So it's still conditional on your model, but it's the garden of working data still. It's still just that. We're still just counting the relative numbers of ways that different conjectures could produce the observations that we have. And so in this case those conjectures are lines with some scatter around them, and there are infinite number of them, and our machine has ranked them all. You're welcome. Right? And victory for mathematics. But now you have to, like, this is like an interview with the model, right? It's like an interview with the vampire, but an interview with the model, and if anybody remembers that movie, book, actually, but remember that movie. Was Tom Cruise in that movie? Something? Okay. So lots of taxi men in that movie, yes. So anyway, I have to, this is being recorded during the internet, so I'm going to resist some natural jokes. Some of you know me, you know my sense of humor, and you know especially when Antonio Bandares comes up, but it's easy to distract me, but okay, we're going to, it's like part of that brain has got to turn off now and we're going to move on here. All right, so we've got a bunch of lines and I'm just showing you, again, this is just showing you that we've got these numbers, the A and B values to find a line and we're going to, I'm going to neglect sigma for the moment or remember that's defining this scatter of observations you expect around, how scatters the observations supposed to be around this line because they have A and B just to find the mean, the expected type, essentially, but people will obviously be sampled around it in nature. So and then I'm showing you for just the first 10 adults in the data say we fit this model for only 10 individuals I do this so you can see the scatter in the lines. You'll see when we use all 300 something and they're all on top of one another. Just for the first 10 and I plot, I think this was the first 100 lines from this table. So I just take the first 100 rows here and I plot the, I use the A and B values to define lines and I plot them with some transparency you can see how they overlap. And so you can see the scatter, right, that with only 10 data points the machine is telling you, you know they're definitely positively associated but I don't know exactly how much and I'm not sure where the intercept is either. There's only 10 individuals after all. What do you want from this? So you get this scatter like that and you can see it. In fact this is my favorite way of representing uncertainty for Bayesian models is just plotting multiple curves. Readers can see really obviously what's going on. That said, I'm going to show you conventional ways to do it that give you fancy shading and stuff like that. But this is a nice way to do it and it works for all types of models. Even really ones that are quite complicated. We're going to do some distance matrix models at the end of the course and this will work great for that because you're sampling distributions from the posterior, your distribution in that case. So yeah, you've got distributions in your distribution, right? So this is like Pimp Your Ride because you know that. Is that an exhibit? What's that? Yeah. So I'm glad people still remember this. I'm not old yet. Does this make sense so far? Yeah? Right now we just got lines in your distribution. There's a distribution of lines and their relative presences in the posterior distribution correspond to the relative numbers of ways they could produce the data conditional on the model. Now let's add data back in. We start with only the first 10 adults. Let's look at the first 50. If I said 50 to the model now I do the same procedure. I plot the first 100 regression lines. The first 50 individuals are shown up there. The cloud of points is narrower now. There's still uncertainty about it but there's scatter about it. They differ in their slopes and their intercepts of course. But it's getting tighter and you can see how it's getting darker because they're overlapping more. Yeah? The model is learning. And with 150 individuals, which is just about half the data, you can see now it's quite tight but there's a difference in uncertainty in different regions of the data and this is something you're used to in linear regression. This is bowtie uncertainty that arises in them. We're pretty sure at the average weight what the average height will be. But we're a lot less sure at extreme weights low or high what the average height should be. There's a lot less uncertainty out there. This is a famous thing again and David Hinnan did this when he talked about pivoting on the mean of both variables. In this case it is. We're going to pivot through the mean weight and mean height and the regression line is going to pass through somewhere near there almost guaranteed otherwise it won't be a good fit. But that means that slight changes in the slope though a slight uncertainty in the slope can create big uncertainty at extreme values because the whole line is pivoting. Because you assumed it was a line. It's probably not a line in reality. There's nothing to actually a line. But forever. And then with all the data all 352 adults in the sample you can see that they overlap quite a lot. There's still the only place you can see some scatter in these 100 lines drawn from the posteriors out at the extremes. It makes some sense? And this is linear regression as you've always known it. It's described in a different language. It's a language in which the estimate and the uncertainty are the same thing. They're distribution of the relative plausibilities of different values of the parameters. So we can process the uncertainty this way. So we can think about doing this in general for any values of weight. So once you have samples from the posterior distribution you can generate, you can make the model predict things. And not just for the data you've already seen that would be like retradiction and that's what we do in posterior predictive checks. And that's what you get in your homework that you're going to turn in today. I don't know. A lot of you already have because there are lots of little stars in smart site where you turned in things. But, you know, midnight is your deadline. That's all I'm saying. Midnight is fine. Actually, 1 a.m. is fine too. Whatever. I'm on Alaska time. Whatever. But isn't Alaska in power? Yes. But we might want to do other things. And we will. You might want to do some forecasting or you might want to do experiments in prediction to understand how the model works. And so I'm going to show you this way of generating these predictions in general for the average from the posterior samples. And this will simultaneously teach you how to visualize the uncertainty over the data you did see and do what I'm going to call especially next week, counterfactual prediction where you consider the behavior of the model for data you haven't seen. Which is a very useful way to understand how it works. So, summarized, we still have our samples for the posterior in this simple post. We're going to try to behave and always use post for samples for the posterior. If I misbehave, scream at me, please. And I'll fix it. Consistency is a difficult thing. But I'm working towards it. And the way you generate predictions for these models is well, you know the model and the model is mathematical. So you can just plug samples back into the formula for the model and make it safe. So if you want to know the mean at a particular weight, the expected height at a particular weight, well you know the equation. And it's shown at the bottom of this slide. For the expected height at a particular weight. It is mu i. It is alpha plus beta x i. So if you plug into value for x i and then you put in samples for alpha and beta, which you have, you can generate predictions for any combinations of them. Now we want to average over the uncertainty in the posterior distribution. And normally to do this, so if you were just going to do a prediction at the math, you could just plug in 0.9 here, which I think were the math values in this example. And that would be fine. But to do the uncertainty, you definitely can't just plug in values independently. Because alpha and beta are correlated, remember. So you have to draw correlated pairs. And luckily we've already done that. They're shown on the top of this slide. But the alpha and beta columns are almost perfectly negatively correlated as a consequence. And so if we just plug in the alpha and beta values on rows here into an expression like this, we get a bunch of predictions that have the right kind of uncertainty. We get distribution of expected heights at a weight of say 50 in this example. And then you can summarize that distribution. And this is bound to be confusing because it's weird. It's not how humans interact with the world. So I'm going to show you what this looks like in pictures. If we did that in this case, we store in the symbol mu at 50, which is at the weight of 50. What is the mu that comes from it? What the posterior distribution tells us is, because parameters have distributions, because they're uncertain, everything you compute with parameters is also uncertain. It must also have a distribution. I'll say that again, because this is like, if there's any law and Bayesian inference, it is that since there's uncertainty about parameters, parameters have a distribution. And therefore everything you compute with parameters should have a distribution. Yeah? You can say this to yourself over and over again. I'll have many opportunities to do it. And so the expected height at weight 50 should be a distribution. And you get that automatically from this one line of code because the posterior distribution is a distribution. And r is great. When you give it from this line of code, post dollar sign a means take the a column from the object post. Then we add to that the b column from object post time 50. And since post dollar sign a and post dollar sign b are vectors, they're lists of 10,000 numbers, it just iterates over them, each one. There's 10,000 values in each, so it returns 10,000 answers. It takes the first value for a, it adds that to 50 times the first value for b. It sits that aside. It's the first answer in mu at 50. It does it for the second, third. It's done by the time your screen refreshes, right, because your laptop is really fast these days. There'll be models later on where you have to wait like a second and you'll notice that it's a little bit slow, but it'll be fine. And then I show this distribution on the bottom, it's a density, distribution like any other. It's pretty tight though, because there's not a lot of uncertainty about, because there's 352 data points. There's not a lot of uncertainty about where the mean is, and that's all we're talking about right now is the expected height. This doesn't take into account sigma yet. And the expected individual has a height a little over 159, with a very tight range around that. See that? It's a distribution. In your homework, which is already up on the website, you're going to do things like this for different kinds of weights and generate predictions for unobserved individuals. So you can do some mental exercise in this kind of forecasting. So imagine you had all these estimates and then your field assistant said, oh, you know what? We didn't record the height for this one person, but I weighed them. So after firing this person, then you have to impute the height, right? So your model lets you generate a prediction, but that prediction will have uncertainty, and you can get the expected value and the uncertainty from the same calculation, because you have a distribution for the expected height. It makes sense? It will make sense when you solve question one on your homework, because that's what you're going to do. Yes? So you see the expectation and the variance about it. So if you're doing this imputation for the person, should this include the added uncertainty about sigma about this point? Or is this sufficient? Okay, so the question was when you do this, should you also include an uncertainty that comes from sigma? That's the scatter around this, and this is just the mean. And I think the answer is it depends. It depends on what you want to know. In your homework, I'm just going to ask for this distribution, basically. And I want you to characterize it with an interval and the mean and the interval. But it may be that you care about the standard deviation as well. Absolutely. But it depends upon your purpose, I think. Right? So if you have some sort of epidemiological application in mind with this, and there's some stature at which individuals come into risk, you absolutely need to use sigma. Right? Because you need realizations. You need to know what's observable. Right? But that's not what I'm going to ask you in your homework to do. But I could imagine doing that in the case where, okay, you want to know, given these estimates, what proportion of individuals at a certain weight are going to have a stature below this level, then you're going to need sigma. Because you're going to need to do you're going to need to get samples for actual realized heights. These are samples for average heights at a particular weight. So these are about an aggregation in a sense. Does that make sense? So there's no, again it's like horoscopes. Yeah. This is horoscopic advice right now. Okay. So let me show you how to use a nice convenience function in the rethinking package which automates the thing you just saw in the previous slide. And this will work for anything you could fit with math. Well, I shouldn't say anything. That's like famous last words. I'm sure you could come up. You guys are clever. And you can always come up with some way to break my software which I cherish. Actually, it's a wonderful wonderful thing. Last time I taught this is a student Helen, Helen Shimura, some of you know her, came up with a really interesting bug. I know I have named it in the code Helen's Corner Case in honor. So many of you can come up and find really delicious bugs. Let me know and then you will be, you'll live on in fame in my code. So I'll buy you coffee too. We're hunting bugs here. It's all I can afford. I don't make a lot of money. I'm an anthropologist. I can buy you coffee. So, but really, if things malfunction and you can figure out a way to make stuff go wrong, let me know. I want to know because I'm trying to make this as close to bulletproof as possible. You folks are very clever at shooting bullets into stuff. So I value that. So I've got this, there's this function I want to show you that automates that set before. It can do that because when you fit in that model, you told it the model. So R already knows the model. There's no point in retyping it. It's good the first few times to do it so that you know what you're doing. Absolutely. And when I teach you this the first time, I'm always going to show you sort of the raw way that things work. But in your daily life, you want some automation because you're in a hurry. You're going to get the Nobel Prize right. You've got to beat someone else to it. So use your convenience code. But here's the general way to think about it. It's the same procedure. Say we want to generate, do that same exercise, but for a range of weights, some sequence of weight values, because we're trying to say what is the regression line look over some particular range of weights. Or we just have a list of individual weights. And we want to generate a posterior distribution of expected heights for each of those individuals. What you do is you make a vector or a sequence of weight values that you want predictions for. And that's what the code here is doing right now. Nothing exciting about this. We make a simple weight.sequence. We use the sequence function, Seq, which just makes a sequence of numbers. It has a bunch of options. In this case, you just define its minimum value, which is 25. It's up around 70. And it increments by 1. So this is just a list of numbers from 25 to 70 and increments of 1. Does that make sense? Those of you who have been using R for a while know that you can just do 25 colon 70 and get the same answer. Because it's a shortcut. That actually calls this function is what it does. It's a short hand for it. It's the binary operator for calling sequence. Once you've got that, you've got a list of values that you want posterior distributions for. And then we're going to call this function called link. Link refers to something called a link function, which will be special to us later on in the course. There's one here, but it's hidden now. It's called the identity link, which just means you can't see it sort of. But what this does is take it this way. It creates a link between the parameters and the parameters in the likelihood function. The parameters like alpha and beta, it links them to the things at the top level that produce observables, like the data that we're anticipating. It connects them. And so you give it your fit model and you give it some data to compute predictions for. And what it returns is, so in this case, there are 46 values in weight.sequence. And I guess I was wrong before. There are only a thousand samples that it's going to use in this case. So the answer that's returned in the symbol mu is a matrix with dimension 1,000 by 46. What does this mean? For each of the 46 weight values, we get 1,000 samples. So these are posterior distributions for each weight value described by 1,000 samples, which is plenty in this case, because it's 1,000. It's more than enough. So you could pull out each column, right, the second dimension is columns, and that's each of the weights that are in that sequence. And if you plotted those thousand, there'd be a distribution, which is the distribution of expected weights, of average or heights, excuse me, of expected heights at each of those weights. Our goal is going to be to put this on a graph as a line, as a cluster of lines with some shading. And that's what we're going to do in a minute. But this is all that's going on here. You guys with me so far? Or at least you're willing to indulge me to get through the rest of the story and then you'll figure it out at home or something like that? Yeah. Which is fine. Okay. Let me spend a little bit of time talking about how LINK actually works. And you don't have to master this, but I want to demystify what's going on so that you know. I mean, I absolutely use LINK. But at some point in your life it will break. And then you'll want to do this yourself. And I want to reassure you that it's not hard. LINK is a tiny, simple function. Don't look at the code, but it's a tiny, simple function. Actually, the code is kind of long because it's trying to, you know, stop corner cases and stuff like that. And it has to anticipate any crazy thing you might have entered into a map. Right? And an infinite number of models can be defined in maps. So you can imagine I have to deal with a lot of stuff. But the basic template is pretty easy when you know the model. So I'm going to show you here. Here's the top part of the recipe. So the first thing a function like LINK needs to do is it samples from the posterior distribution. So it gets a bunch of samples for each parameter and those samples are correlated across the parameters in the right way. Then we define some series of predictor values. In this case, that predictor variable is weight. So it's a series of values for weight. But whatever model you're working with, it might not be weight. It could be something else. It could be speed. Then for each predictor value, for each sample from the posterior, we compute mu. And you can do this because you know the equation for mu. So you've got some pair of correlated A and B values, right? They come from, they're on a row in your posterior and you know the weight value. You just plug them in, get that answer, set it aside. You'll get a thousand of those if you have a thousand samples. Then move on to the next weight value. Do a thousand more calculations and so on. This is why we make the computer good. But this is all it's doing. It's just looping, right? And R is really good at looping. And then at the end you summarize. And that's up to you. This is when you have to make a choice. The typical choice is here because these distributions are going to be Gaussian is you use the mean and the standard deviation or the mean in some confidence interval. And you can describe everything about it from that. This code at the bottom, I'm going to step through it. But this code at the bottom is sufficient to do all the calculations that Link does. This is really the minimal link that goes on just a few lines. So I want to take the next two minutes say and step through this code for you. And this will be maybe the last time I step through our code in any detailed way. Those famous last words. And again, you're not meant to master this skill. But I want to clarify it. And there will be a time in your life when you need to do this for some crazy custom model. And you will be able to do it because you'll just replace the part of this where you calculate mu with whatever you need to calculate in your model. Right? There will be some little organ in your model of kidney. So to speak, mu is like this model's kidney. And it's the kidney of different models is different. But you know what it is because you define the model and you can always calculate things for given samples from the posterior distribution. So let me step through it real quick. All right. Step one, we sample from the posterior. So, we store in the symbol post all of these samples from the posterior extract samples just pulls out the correlation matrix and the marginal posterior of each parameter samples from that multi-variate posterior stores them in a data frame. I'm showing you here this function str, str in function in R is how you figure out how code works. It stands for structure. It tells you the structure of any object that you have in your system right now. And you guys know that R is like a massive calculator, right? It's a true and complete calculator. Compute anything that's computable. But it really is like a big calculator with a lot of memory. So, everything in its memory is called an object. And when you want to, you don't know what an object is, you can use str to see what its guts look like in a scaffold way. It's a great way to figure out what other people's code is doing and sometimes your own. So, I encourage you when you're trying to figure out an example in the book or why your homework is not working is to use str to inspect the intermediate results in your code. So, I'm going to go through this slide that way. At every step, we're going to use str to see what's going on in there. So, in this case, it's just a data frame with 10,000 observations. Those observations are samples, right? They're not actually observations. They're just imaginary observations. For three variables, which are not actually variables, they're parameters, right? And you've seen this before, nothing unexpected. Now, I'm going to define a function. This is maybe the tricky part. A function is just some reusable code. You can actually make code into an object in R and then use it as a variable to do things. In fact, you can pass new values into it substitutably and these reusable chunks of code are called functions in programming languages. And you can define one as easily as making a symbol for it, use the assignment operator and then use the header function. For instance, you wait means there's one argument, one variable input that could take on different values and it's going to have the name weight. That name will apply inside the code that's to come. And then you just write the linear model, right? And notice the linear model uses post, so it assumes you've already done the first step. So, you can find the samples. So, what does this thing return? Well, let me show you. Let's look at the structure of calling mu.link with value 50. 50 is a weight value. The answer 10 is a vector of 10,000 values. Why? Because there were 10,000 samples in post. For each of those, they computed mu at the weight 50. Yeah, you with me? And then step 3, we define the weight values to compute predictions for the ones we actually want. We're going to pass these into mu.link in a minute. Less than a minute. And I use sequence again. And again, if you don't want to know what's inside weight.sequence, you can just do structure on it, SDR. See, it's 46 values from 25 to 70. You can see the end, the dot, dot, dot. It's like passive aggressive text messaging, dot, dot, dot. Now, the action. This is the most confusing part is this wonderful, there's a wonderful set of functions in R that have the word apply in them. And they are magic. They do so much great stuff. They're also very confusing. But I'm going to use one here. What they, what apply functions do is they take a sequence of values, sometimes a matrix of values, and they pass them into a function, and they collect all of the different answers, and they put them in a nice package for you. So they apply some complicated structure and perform an operation on it. They apply the operation to the input. And those inputs and operations could be complicated. In this case, they're not that bad, but you can see the value of it in any event. What sapply, which means simplified apply, does is it takes each of the values in weight.sequence, and it passes it into mu-link. Now the answer for mu-link is a vector of link 10,000. So each answer is 10,000 numbers. So sapply packages them up into a matrix of dimension 10,000 by 46. Again, where each row is a sample, and each column is a predictor value that you passed in. This is what link is doing for you and why you happily don't have to do this for yourself. And so let me quickly go through sapply in a schematic sense, because as you're... You don't necessarily have to worry about it this term, because I'm going to keep you busy. But in your future in R you'll solve problems with the apply function. So I want to give you some sense in general how they work. sapply is a great one because it simplifies the answer into a nice matrix you can work with, rather than a list. So let's just start... You can make a list of 10 numbers in R with 1 and 10. That means a sequence of integers from 1 to 10. So I just want to show you that to start with. We can then take sapply and use that as an input. This series of integers from 1 to 10 and pass it into any function like this, and it will pass them each and individually, each of those numbers from 1 to 10, collect the answer, and then make a vector that holds all the answers. sapply is good for this. This is a bit overkill here, but I'm just trying to show you the answer. So what this does is it competes the square of each of the numbers. So there's this little symbol z. There's nothing special about calling it z. Again, call it pickle if you want, whatever you want. The computer will not care. Just make sure that you use the same name in the code part of your function as you do in the argument, where you define the argument. So all that does is compute squares, but you can do much fancier things. So this computes the geometric mean of each of those values. The z-th root of the product of z-values is the geometric mean of those values, and that's what this does. Not that you want to do this, but I'll show you. It's up to you what it does, but these apply functions are just ways to do some operation on a list or a matrix, and they save a lot of effort. So that's why I use it there, and you'll see it in other people's code, so it's just worth dwelling on it for just a second here. Okay, let's pick up where we left off. We were on step four, we used sapply, we've got this big matrix now. We're going to do some summary, and the way I want to encourage you, at least for the moment in this course, the next several weeks is to use the apply function to do summaries of these, and the way apply works. Apply is the more general version of sapply. You pass mu as a matrix, so that's the thing you're going to apply. The two means the second dimension of the input, which means columns. And the columns are these 46 values. So what, and then for each column, you're going to compute the mean. So what this means is take this matrix, and for each column, give me the mean. I'll say that again. So what this line of code at the bottom means is for each for the matrix mu, for each column of it, compute the mean. And it stores those means in mu dot mean, and then if you look at the result, you see there's 46 values, and each of those is the mean of the posterior distribution of expected heights for each of those weight values that you plugged in before. Okay, I appreciate there's a lot of moving parts here. And you can sit down with a cup of tea later and go through this again, and you'll really get it. But it's important for me to demystify what's going on here because demystification leads you to distrust your models, which I think is a healthy thing because they're crazy little machines. And you're always smarter than they are. Yeah, question. Didn't you try to explain the link function again? The link function? The basic link function? So the question was could I explain the link function again? So let me maybe go, well, I don't want to go back all the way. So what link does is it lets you pass in a set of predictor values for the any predictor variable that's used in your model, and it'll compute the posterior distribution of the linear model at those predictor values. So in this case, the linear model you called it mu. So in mu is defined as the mean of the Gaussian distribution of outcomes. So what link does is it computes the distribution of mu at any set of predictor values you like. So that's how you pass it in. So it's the idea like on your homework I'm going to ask you to say say I've got a person who has some particular weight, you know, 30 stones, whatever that is. That's pretty heavy, isn't it? Some of your 30 stones, that would be a very large person. So in some ways 30 stones tell me what this model says how tall they should be if they were calahari foragers. And the model will give you some absurd answer but you can compute that by plugging that weight into the linear model but you have to integrate over the uncertainty in the parameter values. And link does that for you because it computes a mu for every sample from the posterior and it gives you that collection back. It gives you all of them. So if they're by default it does 1,000 of them but there's an adjustable parameter in you can do as many as you like is that a sufficient answer for now? Yeah. Yeah, yeah. They'll get in there and fight with it. Yeah, exactly. If you're like me you'll need many cups of tea over years to really start to get to this stuff. I'm sympathetic to that. Okay. Alright, so let me help you visualize this a little bit. So now what we're going to look at is the result. So we used link or we did it the MacGyver way on the previous slide where the matrix mu where every row is there are 10,000 rows each of those is a sample from the posterior distribution and there are 46 columns and each column there's a different weight that was input as the predictor value. So now we can plot these actually. So what I'm showing you here are the first 100 rows from this big matrix just plotted up in the space of the data so to speak. So the first line of code here, the plot line is an empty plot. It doesn't plot the data points. I did that just so we had a blank field so I could show you what's inside of mu. What's inside of mu are a bunch of little points. For example, from the posterior it says mu is a particular thing. The problem is that we don't know which parameters are the quote-unquote real ones. We've got a distribution of them. So there's a bunch of points in there. If we plot up all these points just like when we plotted lines before you can see the differences. So here are columns you can so to speak going left or right. You see these little bundles of points that have been plotted up. The line down here this loops over the first 100 rows in mu and it just points, well draws points. So for each of the weight values it plots the ice sample. So that's the ice row in the matrix is what this notation means. Some of you know this already. The comma here since there's nothing after that, that means all of them. So that's all of the weight values all of the columns of that matrix. So weird thing about R you get used to it. You can actually have mu bracket comma bracket and that means the whole matrix. There's a little box inside chapter 4 about this because it's a terrible syntax. It really is. It's just like madly dumb. It's just horrible thing to do. There should be a symbol that means all so that people can read your code. The machine can always read your code but that's not the issue. We're not trying to make machines happy. We're trying to make one another happy. At least we should be. So anyway, I can't change the syntax. That's just how it is. When there's nothing after the comma, that means all of the columns. If there was nothing before the column that would mean all of the rows. So this is the ice row. Mu bracket i comma bracket means the ice row of mu all of its columns. So it's the whole row. And that plots all of those. There's a hundred points at each of the weight values that are shown there and you can see it's recapitulating that bow tie uncertainty that we saw before. This is not a great way to display the data but I'm just trying to expose what link did. What did link do? It computed these values. That's all it did. However, that's a lot because you can do a lot with this. We're still in the interview with the golem. So a nicer way to plot this, especially if you want to do a posterior predictive check, is you compute memories. So, for example, we can apply the mean function to all the columns of view and we can get the mean for each of those clouds of points, right? And that defines the map line. That's where the map line is through there. And we can also get the highest posterior density interval. You can apply fancier functions for all this too. So this means for each column, give me the 95%, which is the default, the 95% highest posterior density interval. Then what we'll do is we plot the raw data. Now I'm not hiding it. That gives us the plot in the upper right, right? All the little blue points up there. Wrongy too. You'll see it in the notes. That's just the blue that I like. And wrongy means color in Swahili, right? I'm going to have to apologize. What do you want? And 0.5 means it's transparent by a half. Well, I knew it wasn't a reserved word, right? There was no way to know. I think it was a nice choice. But you'll see it in the notes. And that's always just the blue. Choose what choose purple if you want. And then I draw the map line which is just x values from the weight sequence, y values from u.me. Because the mean of the posterior distribution in each particular way is where the map lies. This is peak. And then there's this function of shade which a lead magical is a function in the rethinking package which draws this shaded boundary, which you can barely see in class, unfortunately because this is a bit walked out. It'll look better in the slides. It takes that uncertainty there and it makes a shaded interval the 95% density of the scatter of lines around the map. It's shaded in. It just computes that polygon. There's a polygon that gets computed and it does it for you. It'll be easier to see that shading. Let me show you what it looks like with less data. Yeah, question before I do. It's just a clarification question. So the mu dot the distribution of height means at each particular weight. Well said. Yes, the question was mu dot mean is it's a vector of means of mean heights at each weight in the input vector. Exactly right. How many levels of recursion of abstraction can you hold in your head at once? Exactly. And that's why I'm very sympathetic to any kind of confusion like you need a cup of tea. We'll buy you that cup of tea sometime, anybody. It's a lot of levels of abstraction, but you start to get used to it. It gets a lot easier as time goes on. So you can do like three, four levels of abstraction right now. Later on we'll have more. Multi-level models add another level. It's all good. It just turtles all the way down, as I always say. But it's true. We have means of distributions of means. Yes, that is exactly what we have. I'm sorry, but it's just how it is. So let me show you it's easier to see the uncertainty in this case when we use less data, because then you can actually see the shading. So again, all those shaded intervals are doing is summarizing this cloud of lines. They're giving you a visual reference for the cloud of lines. If you like the cloud of lines, by all means plot the cloud of lines. I think it's easier to understand, particularly for newcomers to Bayesian inference, it immediately says look, there's a bunch of lines in the posterior. It's full of lines. And some of those lines are more plausible than others. Here are a hundred sampled at random. You can see where they tend to lie. And that kind of speaks for itself. It says it. The shading is just a pleasing visual representation of that that's less cluttered than all of these lines. The lines make it hard to see the raw data. So, at least it can in some cases. So the shading has become quite conventional. And there's nothing wrong with it. So here I'm doing it just with the lines of different amounts of data. And n equals 10 at the top. A lot of scatter. n equals 20. It shrinks quite a lot already. Until we get to 350 at the end. And there's very little uncertainty about where the map is. This is where the mean is. There's still uncertainty about where the heights are. We're going to do that next. Just hang on. We'll get to it. And then you can use shading and you can see the same thing. But now they're nice pretty symmetrical bow ties. You see these in grass all the time. And ggplot will do this for you automatically for lots of functions. There's nothing wrong with that. That's cool. But this is where it comes from. At least this is one way to justify those inputs. The question was, is the interpretation of the shaded region similar to the interpretation for standard error? No. I'm going to resist exploring that because it would take too long. Standard errors are constructed by imagining data you haven't seen. And Bayesian estimates are never conditional on data that has not been observed. The V-shaded regions are conditional only on the data we have in hand. Not on imaginary data that has not been observed. Frequentess estimates are conditional on data that has not been observed. That's a really interesting thing, that distinction. And worthy of exploration, but this class isn't about that. There was a previous version of this class where I compared the two and talked about the trade-offs of each and all this stuff. I'm going to beg it off because there's no way I can finish the lecture today by doing it. Analogizing any of your knowledge for Frequentess Statistics to these entities. And just learn this in terms of the garden of forking data. What is the interpretation of this? That's the 95% region of density of the lines in the posterior distribution. That's what it is. And that's all it is. Okay? So, I'm going to say that the connections are really compelling. And there are cases where you can translate Bayesian models into non-Bayesian models and vice versa. And that helps you understand both, but I just want to emphasize and still teach you how to do one thing, right? So, let's beg it off. I'm not saying that one way is always superior to the other, but they have really different logics and sometimes it matters. Okay. So, prediction intervals. We need to do this before you go. We got lots of time. We got 15 minutes. So, what about predicted heights? We want to get Sigma onto this graph. Very often you won't care about Sigma, but I want to show you how to do it in case you do. It works very much the same way, except now there's an additional layer of sampling because we must simulate observations from some Gaussian distribution. That Gaussian distribution is defined by a combination of alpha, beta, and Sigma, right? So, it's still the sample from the posterior to tell you what it looks like, but then you must draw a random number from it to simulate an individual. And so, there's one additional thing. Our norm would do this for you just like you used our biome in your homework to simulate babies, right? Yes? Yes, okay. And it works the same way. This can be automated with a nice little function we're thinking called SIM, which simulates the data, new data from your model. Instead of, Link does parameters for the linear model. Whatever you call the linear model in your model, that's what Link spits out. It spits out values for that. So, you can call it mu, it spits out values for mu. In later models, like logistic regression models, you'll have this little thing in the model called P for the probability of success. It'll be like probability of a boy in your homework. And in that case, Link will give you probabilities, depends upon the model type. SIM in these cases will give you observable things that you might have seen that would have given, let you estimate those quantities. So, in this case, it'll be heights. In logistic regression later, it'll be counts of things, counts of events, like baby boys, things like that. SIM, you use SIM just like you use Link. But now the result, you still get a matrix. This was a thousand by 46 for the same reasons as before. But now the entries are actual heights. So, they're broader. The distributions are more scattered because there's a standard deviation around the mean. Now let's plot them out. Plotting works the same way. Shade works the same way. There's two shaded intervals on here. You can barely see it. There's the dark one in the middle. That's the uncertainty around where the map line is. The uncertainty around where the mean lies. The little tight bow tie. Deep in the middle. Those of you who are squinting, I apologize. This will look great on your screens at home. And then there's this really wide shaded interval that almost encloses all the observable data. And that's when you include sigma. So you get the predictables, the simulations of heights. Does this make sense? I've drawn 95% interval of heights here. There's nothing special about that. I encourage you to try the different ones. In fact, the amazing presentation is often to over-plot multiple intervals because then you get this contour of darker shading as it goes out and you can show people where they are. The purpose of this is to communicate the shape of the uncertainty. None of these boundaries is really meaningful. There's nothing special about 95%. It's just convenient to work with. Okay. I'm not going to step through Sim in any detail. I just want to say there's a box on this in the chapter where, again, I explain how it works. It doesn't work much like Link does except now the function that you s-apply across has our door inside of it. So, again, levels of recursion in your brain. Yes, we have simulated values, distributions of value simulated from distributions that are the means of, yeah, I've already lost it. It's something like that, right? But it's all there. That's exactly how it does it. It samples an individual for each sample from the posterior distribution from the implied normal distribution of heights. That's all it does. Okay. Last topic and we're making comfortable time here is to ease us transition us into multiple regression where we have more than one predictor variable. There's this transitional topic often called polynomial regression. I want to show you how to do this because it's extremely common but simultaneously I want to discourage you from doing it your own research. This is a funny thing, but this will happen more than once in this course. Sometimes you have to understand things. But in this case it's also a really good way to see that these linear models don't necessarily make linear predictions. What's linear about them is the relationship between parameters and data. That it's a linear function. Not that the shape of the predictions is linear. So this will help me show you that and also give you some beginning of abstracting about what you can do with these models. So obviously trends are not always linear and even if trends are linear in a particular range if you extend that range out long enough it gets ridiculous. So on the left here I'm showing you this regression we did for weight versus height but if you imagine weights increasing like say we put a McDonald's in the Kalahari people realize it would be a great idea actually I'm all for it. Seriously, it would be great. But they get heavier. I suspect it will not continue to be a linear trend but it will be a wider trend rather than taller because there's no impacts. That's great to be wise. But it can't be linear forever. It just can't. And then in contrast if you include kids which I show you on the right we plot age against height. Obviously the relationship between age and height is not linear. It isn't certain regions. There are growth spurts where it is. But for the whole range of your lifespan we have terminal growth. So it looks something like that. So you need a different kind of model if you want to predict this. Ideally you'd have some biologically inspired model where you weren't just using a geocentric linear model. That's the first thing to say. I'll say that over and over again as we go. But in the absence of that especially in the social sciences people frequently just use polynomials to do this because polynomials are very flexible. They're functions of one input predictor variable and you just keep squaring and equipping and whatever equipping. Is that what 2 to 4 means? Or that's 5ing, isn't it? 4ing and 5ing and 6ing. Forgot. I took Latin in high school but then I took a lot of drugs in college and forgot it all. Wait, I'm taping this. That's okay. I'm an anthropoprofessor. It goes without saying that I do a lot of drugs in college. No, but I can't remember. Anyway, these are very flexible functions. These are very flexible functions. So they can fit a lot of shapes. The more terms you add the more times the path of the curve can turn. So you can fit all kinds of crazy stuff with them and people do. Within restraint they're useful. The most common is the parabolic model because it allows diminishing returns to something. It allows it to slow it down. I'm going to show you that in this case. I'm going to also show you a cubic. Not just a not just the quadratic model with the parabola. Again, same data but now we're going to include all of the individuals, not just the adults. We've been working only with adults so far in the graph in the lower right. The adults are shaded in blue there and then all the ones we haven't been using are the kids under 18. They're in black. We're going to include those now. This is not linear. We're going to fit a parabola to this. Same data as before, just more of it. There's like 500 individuals now. A little over 500. First thing to do here is to realize it's very useful when fitting models like this to standardize all the predictor variables before you fit the model. What standardization means is you take each variable you subtract its mean from it and then you divide by the standard deviation of the original variable. There's no loss of information here. Operations is destructive of information. You're not cheating. It doesn't change the answer you get. It won't change the parameter estimates either but it makes it easier to fit the model. Your computer will get less hiccups. This is a really good thing to do by default and I encourage you really to do it. Once I give you an equation at the bottom to standardize you will do this a lot. It's a really healthy thing to do. In the case of these polynomial equations sometimes it won't fit at all if you don't do this. Let's do a similar thing. Centering. Yes. Centering is the first part of moving the mean. That's actually most of the health. The centering gives you most of it but it's also good when you divide by the standard deviation you make everything have this range this kind of scale which affects the search range when Optum is trying to climb hills. It doesn't have this. It gives it a nice graded hill it can find. If the scale of your data is really really broad it won't detect any slope it might be really hard to do to search the space so it's a good thing to do. But centering in my experience does most of the work. We're going to come back to centering in the interaction chapter where it's pretty useful especially for interactions. We'll get back to that. It's a good question. So you want to do this when you practice this at home, do the standardization you'll get some idea about it. It changes the units on your x-axis but it doesn't change the answer. I think this one will work both ways but it's a good healthy habit to get into the standardize. It's just an artifact of the machine how you get the model is part of the model if you want the right answer, standardize. So, here's the parabolic model the only thing that's different here we've got an extra parameter beat sub two now these aren't really slopes anymore the parameters that define the path of this curve and the second variable so to speak is just the first one squared. This is the formula for a parabola. You will remember from your nightmares of high school endless homework problems that made no sense with parabolas. Then we define priors as before again a broad prior for the intercept because we have no idea what's going to end up and then these very mildly conservative priors for the two coefficients let's call them slopes if you like but they're not really slopes anymore and then the uniform prior on the scale on sigma goes into map exactly the same way right up there we put in standardized weight and standardized weight squared works great by the way in map when you write the coding map you're coding in R it's just our code and map executes the code you type so you could do all kinds of naughty stuff up there if you want you could put print statements in there and then every time it executes it will print something don't do this because it will go really slow but you could do that it really is just trusting the user to do stuff up there so you can enter arbitrary functions of any kind up in the map code it just runs it and climbs the hill what it does so this fits no problem you can plot this over the data exactly as the examples we did previously today and in this case here's your parabola with the shading I've done the shading there for simulated heights it's very hard to see in the class I apologize because it's because of the lighting at home when you look at this you'll see the shaded envelope it misses the heights down here it kind of goes under the curve in the middle and then in the mass of data in the middle it does a good job and it encloses the individuals there it's the same idea now the posterior distribution contains parabolas and when you sample from the posterior distribution you're sampling a parabola so you could plot 100 parabolas on there and visualize the uncertainty in the same way the same thing posterior distribution can contain anything you can model later on we'll put matrices inside of it we'll put matrices from the posterior distribution you'll love it does this make sense so far yeah so let me give you an idea again since I did this before and in my experience this helps people a lot to get the idea of the posterior distribution contains parabolas this is madness right what does this mean let me show you what it means it's easier to see it when we start with very little data and we gradually educate the model so let's say with only the first 10 individuals in the data set just going in order they all turn out to be adults the adults are first in this data set and just because the way Nancy Howells entered them and so the first 10 individuals we estimate parabolas you notice that it's pretty sure in that certain range there where the data are and it has no idea on the other side this is one of the things I love about Bayesian inferences it's kind of like caution that it has to it in ranges where you don't have data it's like look the things just flail around and it has no idea you get these so you can get these very broad regions of predictions in combination with regions where it's confident and you get it for free just by defining the model now the model may be a bad one in this case I think it is but as you'll see in your homework the last homework problem does a way better job with this add in 10 more individuals we've got some kids now now it knows where to curve down it knows that you don't get you don't get taller when you're lighter right you didn't know in the first part right because it's just a golem what is it now right and but now it now it has some kids and so it's pretty sure you're going to curve down and we can add in the first 50 individuals it gets tighter yet the first 100 the first 300 and then all 554 individuals we've got a very tight set of parabolas those that's the mean those that's the expected height at each weight value and again if you put sigma on this which is on the previous slide you get the envelope of prediction around it makes sense and all the operations of using link and sim are the same for this model as they were for the linear model as they're fundamentally the same it's just a different set of assumptions inside but there's the same sort of creature same sort of device that you're building yeah question the question was is a number of kids matter yes it does one will start to pull it down but it won't be so confident about how far down it should go or whether maybe it should go up right so as you add each one again if you want to do these sorts of experiments you can trickle them in one at a time to keep updating it this model will fit really rapidly so you could do that actually wouldn't be a problem okay I'm not going to go through the cubic model in any detail just to show you how you'd specify it you could every additional term you add to the polynomial it could bend another way again so you can fit pretty complicated things and so we can imagine doing that we can make a cubic model add it in now we have a weight cube term at the end another coefficient B3 this also fits no problem it also makes no sense but it does fit the data very well and when I say it doesn't make any sense what are those beta coefficients mean biologically they don't mean anything right they're just coefficients so you can't interpret them in any sort of way that's one of the reasons not to use these but that said it's really good at it as encrypting your data as you can see over here this is really nice now this is the cubic function right it gets this little upturn up here in the really heavy individuals are a little bit taller this is a co-board effect by the way in this population they're not all the same you don't have ages on here ages another slice that we have we did the full demographic analysis but this fits even better than that one because it gets to turn again whereas the parabola must go down at some point and you can keep going if you like more and more when we get to chapter 6 we'll revisit these polynomials and I'll show you some really absurd things they can do with a there'll be a point to it other than embarrassing polynomials but it'll make some sense when we get there does this make sense I'm very pleased with myself because I'm right on time there's 30 seconds to go and I'm at the what happens next week slide the first time this has ever happened I had to make jokes about doing drugs in college but they got me to be exactly on time so for next week your homework is already up online please turn in your homework by midnight or around there I'll put up the solution set tomorrow morning take a look at the homework for next week and next week when you come in on Tuesday we'll start multiple aggression multiple aggression will have no new theoretical concepts we'll just add more parameters and instead I'm going to show you some tinkers and the meaning of interpreting coefficients and the plotting gets more interesting because once you have one data dimension to deal with there's more ways to plot predictions and consider them so that's next week we'll be practice for the most part the hard initial climb of El Capitan is over we're going to take a little rest next week and do some cooking on a stove and then you'll do your job to the left where you hopefully don't fall off and die okay thank you all I'll see you next week