 99% would cast a little bit of a different light on it. So forget statistics now, and think about a different kind of model, the model of the motion of the planets. Most of you will know in the history of science, there is this now discarded family of models of the solar system, or really of the universe. This is the model of the whole universe, the celestial spheres. This is about the geocentric family of models. And the geocentric models, it means Earth-Center, but the Earth was not exactly at the center of the geocentric model. Instead, there was this thing, what's it called? It's not the equine, what's this? Something. And then the Earth sits opposite, it's only called an equine. And all of these are orbiting the center. And so the other planets are orbiting fixed points on an orbit around the center that the Earth is offset from. So it's not about the geocentric, it's crazier than that. It's really crazy. These models, from a modern perspective, look ridiculous. The amazing thing about them is that they're really good models, really, really good models. Claudius Ptolemy didn't invent the geocentric perspective, but he arguably perfected it. And his model of the positions of the planets was used for a very long time, for centuries. And it remains accurate. You could use it to find a position of Jupiter and Mars and other planets in the night sky with extraordinary accuracy. It works very well. What does it not work well for? It doesn't work well for launching probes to Mars. You would likely miss Mars. I don't know what you would hit. Probably nothing, because there's just a whole lot of nothing out there. But if you want to spot it in the sky as if there were a celestial sphere, it's a really good model. It works extremely well. And in fact, planetariums use versions of the geocentric model, because they need a fixed observer that everything's rotating around, because you're not going to move the building in this. That's how planetariums work. And so planetariums need this model, and it is used to construct planetariums today. And Claudius Ptolemy had hit upon something which turns out to be a general method of approximation called a Fourier series. So a French mathematician, Day Fourier, discovered that you can take any continuous repeating function, a cyclical function, and approximate it with a series of signs and cosines. This fact is so amazing that when he first claimed it, the French Academy denied it could be true. And then he proved it, which is nice about math. You can actually prove things. And then they're like, oh, OK. Any continuous cyclical function can be approximated with signs and cosines. And that's what the geocentric model does. It approximates orbits, which are cyclical continuous functions, with signs and cosines. And that's what the little circles upon circles are, they're ways of embedding signs and cosines. It's a series, and it's not infinite. It's finite, so it's only approximate. But it works really, really well. And if you add enough epicycles, you can make it accurate to any degree of precision you like. So it's a great model. We make fun of geocentrism because, well, it's wrong. But it has this feature of lots of other models, which are also wrong, is that it works really well. This is a frustrating thing about applied statistical inference, is that just because your model makes good predictions, that does not mean it's right. And most of the models we use in applied statistics aren't even wrong. And these are the category models that are linear models. They're not good mechanistic models of any kind of natural process we use, linear regression and NOVA. They're just descriptions of how the mean is conditional upon something else we know. They're like geocentric models. They're useful for describing populations and samples. But they don't model any kind of inference. And if you try to use them to launch a probe to Mars, well, you're likely to create space junk, right? So this is the comparison I want you to take. Geocentrism on the left is descriptively accurate to, as accurate as you want it to be, just keep adding epicycles. Keep adding circles on circles. It's mechanistically wrong. I hope you agree. I think there are occasional defenders of it pop up occasionally from time to time, just like flatterers or things like that. It's a very general method of approximation, 48 series, and of course we know it to be wrong. So we can use it wisely to do things like build planetariums without embarrassment, right? Just use it in an applied sense. Regression I want you to take on the same stance. It's descriptively accurate. It's mechanistically wrong, right? Nobody thinks that straight lines are actually most of the phenomena you study of pure straight lines. It's a general method of approximation, but unlike the geocentric model, it's taken way too seriously. People make the mistake that you can make powerful causal inferences from a descriptive ANOVA. So we're gonna use linear models a lot. I think they're very useful just like the geocentric model, but I want you to learn a different way of interpreting them. So let's continue a little bit with history here, but take a step sideways a little bit in the history. And think about another figure, Carl Friedrich Gauss, well-known in this part of the world, right? Was born in Brownschweig, or something, Brunswick, I guess we say English, and yeah, Brownschweig, Brunswick. But did most of his things work in Göttingen, I believe, and used to be on German money, in my picture here, until the European Union. And so it used to be you could get the formula for the Gaussian term off the 10 mark note, right here, and it looked really close, but there it is, the formula for the Gaussian density. It's fantastic, great. Gauss is arguably the best mathematician who have ever lived. He made major contributions in every range of mathematics, but he was really an applied scientist, and the math he developed was aimed at solving problems in astronomy and physics and many other areas of math. But certainly arguably the greatest mathematician who has ever lived by a long shot. And one of the inventors of linear regression, like the other things he did, he invented it to solve, in his late 20s, I believe, to solve the problem of predicting the return of a comet. He got famous real fast when he predicted or when the comet would return. So I want you to think of linear regression, these little Gaussian golems, as being, they're little simple statistical robots that model the mean and variance of some normally, we now call normally or Gaussian distributed measure. And what I want to talk about the first part of today's lecture is where the so-called Gaussian distribution comes from, and this is how Gauss thought about it as a model of error. These models treat the mean of some observation or set of observations as conditional on some combination of predictor variables, other things we might know about each case. And conventionally, we assume there's some constant variance or scatter on the Gaussian distribution. So we'll have lots of pictures of this as we go. This is kind of the study slide version of it. You may already know all of this. So this is Gauss' 1809. I think this work was done in the very end of the 1700s, but it came out in 1809. And he builds up an argument for using the normal error model to predict the position of celestial bodies. And we would now recognize the argument as Bayesian. But of course, at the time, no one used that phrase. Nobody knew about Bayes. That was later. And Bayes didn't contribute anything to this. It was called inverse probability. And it's just the way everybody did probabilistic inference at the time. And this is one of the first trying to put applications of the sort of Gaussian error model. Gauss didn't call it that. He wasn't very arrogant, though, so he might have. He was a pretty self-satisfied individual, I think you could tell by the Israelis. So why normal or Gaussian? Why are they so common in statistics? I think there are three ways to justify them. The first is that they're just really, really easy to work with. They have lots of nice mathematical properties. This is purely pragmatic. It's not the justification we're gonna lean on, but it is a real reason. You could do a lot of convenient things with them because they're so easy to calculate with. Second is they truly are really common in nature. Nature seems to like bell curves. And I'm trying to give you an intuition today why that happens. I think we know. This is one of those mysteries of nature where I think the answer is known very well, why there are bell curves all over the place. And then third, from an epistemological perspective, think about the information you or your golem has. It's among the most conservative of assumptions you can make. So all you wanna say about some collection of measurements is that they have a mean and a variance. Then the Gaussian distribution is the most conservative. That is, it spreads probability out over the greatest area of any distribution at all. There is no other distribution which spreads probability as flatly as the Gaussian for a given mean and variance. So that makes it conservative. Anything else would be tighter and would lead to greater risk of mistakes. It would be a less conservative set of inferences. That's the epistemological perspective. So let me try to give you the ontological understanding of why nature likes normal distributions. Let's generate a normal distribution. So imagine the football pitch and say I take you guys out to the nearest football pitch. I'm sure there's one here. There's always one here. Line you up on the center line here. Midfield line, what's that called? Somebody who actually plays football, tell me. That's called. And we're gonna, each of you's gonna have a coin and you're gonna flip it simultaneously. And if it comes up heads, you're gonna take a step to the left. The left as it shows on the slide and it comes up tails, you're gonna take a step to the right. So everybody flips their coins and they take a step. Now it's all staggered like this and we're gonna keep doing this repeatedly. And so the line begins to scatter about. And some individuals after some number of tosses are further away from the midline but some of you zigzag back and might even cross the midline a couple times as we keep doing this. And after each coin flip, we record the positions and the distances from the midline form a distribution of distances from the midline. And we want to know the properties of that distribution. And what is gonna happen, this is you probably see the foreshadowing here is that Gaussian distribution arises inevitably. It's impossible to stop it, in fact, from happening. So let's do the simulation. This is in the book. So I'll give you the code to do this in the book as well. So if you go look at the notes, you'll see it in more detail. I apologize that in the lecture, I don't have time to step through all the calculation details all the time. So we're gonna imagine this experiment where the horizontal here is time, the number of coin flips we've done. And then the vertical is the position, the distance from the midline with zero as the midline. And as we start, we're all on the midline and then we begin to spread out. Each of the gray lines is a different simulation, or sorry, a different individual. And the dark line is meant to show just one particular individual. So you can pick out the zigzag pattern that you might take as you flip coins and go through this. So let's take a pit stop after four coin flips and look at the distribution. We're just gonna take all those positions at the ends of those gray lines and we're gonna make an Instagram out of them. And what does it look like? That's looking kind of belly. It's not quite bell curved yet. The tails are too thin, right? We need some tail on that. And so let's do four more, right? Starts each going. All right, you notice some individuals really get out there and then they come back. Some individuals will leave the field if you have enough individuals, right? Eventually, people will just leave the field. But now this is a bell curve pretty good, right? This is almost statistically indistinguishable. It's converging to a Gaussian distribution. The divergence is very rapid. So even after only eight throws. So what does this do for you? It means if after eight throws, if you were to guess the position of some anonymous individual on the field, you could assign credibility to the guess of any particular distance using the Gaussian formula and that's probably the best you could do, right? And then so on out to 16. And by the time we get to 16, here I superimpose an actual Gaussian on it so you can see it happens pretty rapid. It doesn't take very long at all for this to happen. So this is an example of why nature is filled bell curves. But what process is actually doing it? The process is addition. I know it's sort of weird to say something like that. But processes that add fluctuations together produce normal distributions. That's what it is. It doesn't matter what the underlying distribution of fluctuations is for the most part. As long as it has finite variance, which most things do. And you add those fluctuations together, big fluctuations will eventually be canceled by a string of small fluctuations and so on. And all that is preserved from addition is the mean and the variance of the underlying process. And so nature ends up whenever it adds things together. And nature adds things together whenever order doesn't matter. Whenever you combine things in the order you combine them in doesn't matter for the aggregation. That's addition. It makes sure this where addition comes from. Addition is defined as the arithmetic operation that has that property. And so you can think about a bunch of natural processes like addition. Products of small deviations are approximately addition. I showed this in the book. So that counts too. You get things that are indistinguishable from Gaussian because they're basically addition. And the longer of them are products are actually exactly addition but just on a transform scale. So that counts as well. So this is where you get log normals. Lots of things are log normal. That means they'd be normal if you just measured them the right way, right? On the right, I show you, this is Francis Galton, a cousin of Darwin's. Also a big name in the history of Bayesian statistics. He built a mechanical machine in 1894. He called it a bean machine because it was loaded with beans. Dried beans, I assume. British beans, if you know English breakfast, right? Those beans. And the beans fall from the top on these sort of binary dates, if you will. They fall the left or right, just like the football players on the field. And they fall into these traps at the bottom and it forms a Gaussian distribution. And he used this to do mechanical tests of Bayesian inference theorems. He was, Galton was one of the first people to do what we recognize now as the modern Gaussian-Gaussian Bayesian regression model where you have a Gaussian likelihood and a Gaussian prior. He built machines to simulate it like this bean machine. I'm sure there's a museum in London that has this machine because there's a museum in London with everything. Dragons, everything, a museum in London. So let me try to summarize why normal. There are two general perspectives. The first is, you might say, is the ontological perspective. Mechanically, nature makes belchers whenever it adds or approximately adds things together. This dampens fluctuations. Fluctuations get added and they damp one another as long as there's variance in them. The really big ones get compensated by the really small ones and so on. And eventually you'll have long enough string. No matter how big a fluctuation is, there'll be a long enough string of small ones to cancel it. And that's why you get belchers. And symmetry arises from the addition process, which is an amazing thing. This is the central limit theorem, sir, about this. So damp fluctuations end up being Gaussian. And you can think about it from an information perspective. The only information that remains about the underlying generative process when all you have is the belcher is the mean of variance. And so that's why we use them to model things, the mean and the variance. That's what we've got. There's the upside and downside of this. The upside is you can do a lot of useful work on the conditional, the dependence of the mean and the variance on things you do know without having to know the underlying generative process. So imagine if to model human height, you need to know the genetics of human height. That would be horrible. You would be no progress in studying child nutrition if you had to understand the genetics of it to make any progress. But you don't. You can just measure the phenotype and model it as Gaussian because it is. At least as soon as they get an inch off the ground, it's Gaussian. Can't go below zero. Can't be exactly Gaussian. But so that's the upside. The downside is you can't infer the process. You can't study just the belcher and figure out the genetics of height. Because there's nothing about the belcher. A huge number of processes will produce belchers. An endless imaginary, anything. As long as there's addition, you get a belcher. So you can't go backwards the process. From the epistemological perspective, if all you know are the mean and variance of something, the most conservative distribution you could use is the belcher. Anything else would be betting on some narrower range of outcomes. Later on, when we get to chapter nine, we'll spend a lot of time talking about maximum entropy, which is where this perspective come from. The least surprising distribution is Gaussian. For any, in fact, if all you're gonna say about a set of measures is that they have a finite variance, Gaussian is your most conservative assumption. And the Gaussian arises from solving for the most conservative distribution. You get exactly the Gaussian formula. And this is the perspective that's in this book, picture on the right, Edwin James, probability theory of logic of science. So you might say, to paraphrase James, nature really likes maximum entropy distributions. There are other ones. It's not just the Gaussian. And when we get to other distributions later, like binomial, we'll also explain them in these terms. Okay, linear models. So linear models, it's a big family of special procedures that are often taught as special procedures. And I assume most of you will know one or more of these. You can think of it as the general linear model. There are t-tests, and there's univariate regression, multiple regression, there's ANOVA, ANCOVA, MANOVA, MANCOVA. I forgot what some of them are. And so on, yada, yada, yada. That's an option in SPSS now, I believe. All of these are the same basic model building strategy. You've got a set of Gaussian outcomes, and you're gonna model their dependency on things that you know. Other things you know, using additive functions. They're really all the same strategy. So we're gonna focus on the strategy and not the individual procedures, because then you can build anything you need for the particular dataset you have in hand. The questions that, to do this, we're gonna need some language for building models and describing them now. And you got a glimpse of this last week. So what we demand from a language for modelers for building golems is that it needs to answer some question for us from an applied perspective. If you're a theoretician, you have different questions, but we'll only brush up against theory and uncomfortable ways in this course. So the first is, what are the outcomes? What are the things we're trying to explain as conditional on other things? We often call those outcome variables. How are they constrained? And this helps us choose what's called the likelihood function or the distribution for them. Third, what are the predictors, if any, that is one of the things that we're gonna know that would help us predict other things. And then, how do these predictors relate to the likelihood? And finally, what are the priors? And we're gonna have a language that specifies all these things in a uniform way, no matter what the answers are. And that's what I wanna teach you guys. And it's the standard way of doing applied statistics model notation everywhere. So it's worth learning. And this is what it looks like, it's beautiful, isn't it? So let's go back to the global tossing model. The policies I know have the globe. There we had tossed the inflatable globe a number of times, nine times, and we had observed a certain number of water results. And we wrote down this model. This is the model we calculated the posterior distribution for. So let me take a tour through it and show you how to answer some questions on the previous slide. First thing is the outcome variable is N sub w, which is the count of waters. This is the outcome. Yeah, you with me? This little tilde means is distributive. You can read it as is distributed s. And that's epistemological distributed s. It's not a claim about what it has to look like in nature. Now in this case, it's a pretty good claim about what it has to look like, but it could just be epistemological. The likelihood function is named right there. In this case, it's binomial, the coin tossing model. And then we have things that N sub w depends upon. The binomial can take an infinite number of shapes, and its exact shape depends upon the things inside it that make it conditional. And we often call those parameters because in the case of the likelihood function, the arguments are parameters of the likelihood function. Some of them go we get to observe. Like in here, we know what it is. It's nothing. And that was part of our experimental design. And then we have a parameter that we need to estimate because we can't observe it and that's the proportion of water on the boat. And we give it a prior. In this case, a uniform between zero and one. Remember, it's not the greatest prior. It's kind of a dumb prior, but it's sufficient for teaching. And that's the prior distribution. Okay. So you could say this in plain English as the count N w is distributed by homogeneity with sample sets N and probability P. The prior for B is going to be uniform between zero and one. That's the globe tossing model. And there's sufficient information there to specify an actual computational analysis. This describes what the model is. There are other details we'll get to later about how exactly you do the updating that might be different. But this is a language for describing sort of the brain of the goal. Right there, that's the whole brain of the goal. So now let's take some data and do linear modeling instead of globe tossing with it. And this is the data set that's used through most of chapter four. It's boring data in some ways. It's not boring for me. This is really exciting data given what I do. It's the sort of thing we're interested in is how do humans grow up? And so the quotient data of just how tall and heavy individuals are at different ages is really exciting data for me. Has lots of implications for human evolution and growth and lots of stuff. It's awesome data. So it may seem boring to you, but then you're boring. It's a great data to work with and it's height is normally distributed as soon as it gets sufficiently above zero. Has its weight for that matter. And so we're gonna model height as it's related to weight. How does knowing weight help us predict height? I haven't said anything about causation yet. So these data are in the retaking package that comes with the book called Howl 1. This is from, these are data from Nancy Howl's work on the Dobe Um. She collected all these data by doing interviews like history, reproductive interviews with women. And these data collected, I wanna say in the late 70s is when most of these interviews were done. Now it must have been late 60s actually. It must have been late 60s. But she did a bunch in the 70s as well I guess. And all these data are up online now. Credit to Nancy is that she just uploaded all this stuff. It's up in spreadsheets now. So let's carry through. The first model of the distribution of height, just thinking about the height variable for a moment, is we're gonna model the outcome here and call h sub i. This is height is distributed normally with a mean mu in a standard deviation sigma. So let's walk through this. You wanna read this statement as the height h sub i of an individual i. The i there is now an individual because there are a bunch of individuals and they have different heights. And the i is an index, so let's you keep track. This may seem superfluous right now. It doesn't do any work. It's just there to annoy you, right? If you had another italic character on the slide. Why? We're gonna need it later because it's gonna get peppered throughout the rest of the model specification so you can keep the correspondence between an individual's height and other things you know about that individual together. You don't wanna pick my height based upon your weight. You wanna pick my height based upon my weight. That's why you need i. So individual i is distributed normally with mean mu in standard deviation sigma. And where mu and sigma come from, they're just symbols and they're the mean and standard deviation. There's this convention in applied stats of using Greek letters for things you can't observe. I don't know why, but that's a convention. And I will try to stick to that convention but you will notice me sin many times in the course. So again h sub i is the outcome. Tilda means it's distributed as the likelihood is normal. There's a mean mu in standard deviation. Sigma and those things we need to figure out from the data. We need priors for them as well. And I'm gonna put some priors in here which are not too awful. And we're gonna talk, this is a good chance to start thinking about what awful means or not. There's so much data here that you'd have to use really, really awful priors for it to have any impact. But it's worth thinking about before we've seen the data what the priors imply. So we're gonna think about that for a second. These priors mean before you've got the sample what's a reasonable expectation for your machine about the distribution of heights? Well, you do know some things, right? You might say you don't know nothing. Why don't we just put in flat priors? That would be very bad. Why? Because that means you could have negative heights. Infinite variance, flat on variance would mean it could be infinity. You think heights are infinitely variable? No, you don't. So you do know things, right? And you don't have to put all that information in. You could make the priors weakly informative so that you're doing better than stupid, right? That's all we asked for. I don't know if that seemed quote me on that. And so here's an example of something. These priors are really diffused. They don't contain all the information we have about human height. You're human, you know a lot about human height. But they're better than predicting negative heights, right? They don't predict impossible people. Well, they predict. They do leave too much probability mass of impossible people. So first, the mean, we're going to say we don't know where the mean is, but we're going to put it around 178 centimeters with standard deviation of 20. That's very conservative. It's like, right? So it's not a shrew, right? It's not an elephant. It's about how many it says. Pretty good. And then the variation within the population, we're not sure, very conservatively, this big range from 0 to 50. So let's see what the implications are of these priors. You can simulate observations from the prior to figure out, this is called the prior predictive. I hinted at this last week with the low tossing model, but now you can see it. Now, this example is in the book and I stepped through the explanation of the code in the book. So forgive me if I go quickly over this, but it's discussed in the book. We sample means from the prior. We sample signals from the prior, and then we simulate heights to get the prior predictive distribution of heights before we've updated on the actual Colorado data. And then I plot the density at the bottom. All right, you wanna think about, so here's 100 centimeters, that's 3.3 feet. I don't know if there are any people who were that tall. At least adults, this is only adults. I've taken the kids out of these data. And 200 centimeters is 6.5 feet. That's basketball players, right? So this is a very big range of possible homonyms, right? But notice that there are no negative height individuals, which is what would happen if you had a flat prior, because basically you're programming your regression to think that negative heights are possible. And then it's gotta average over the possibility that individuals might have negative heights, which is a bad idea, I argue. Does this make some sense? So you can do this for any model, and this is the way you understand what priors apply, is you force the model to make predictions before it's seen the data, and then you see what it thinks before it sees the data. And often, we'll see this later in the course, often it thinks really ridiculous things. And this helps you understand non-Basian procedures which nearly always use the equivalent of flat priors. And they expect impossible things prior to seeing the data, which is why they depend upon large sample assumptions. So you can overwrite those things. Okay, we're gonna estimate you in sigma. We're aiming for the posterior distribution as always. The posterior distribution is the relative counts of all the ways the data can happen according to each combination of parameters. I'll say that again. Posterior distribution is the relative counts of all the ways the data could happen, which data, the data we have, not some other data, but only the data we have, conditional on the different combinations of parameter values. That's what it is. So for each combination of parameter values, there's an infinite number of them, but that's no problem for your computer, if you don't care. Infinity's just as easy as three. It's gonna give you the relative number of ways you could get the data, the heights, and the howl data set, conditional on that combination. That's what the posterior distribution is. Yeah, you'll get comfortable with this. But that's all it is. It's that unglamorous. But it's still in super because we can't do that. We can do all the other stuff, but we can't do the counting. So how do we get that distribution? This is the last example of grid approximation in the book because already in two dimensions it's bit monstrous. So I encourage you to look at chapter four, where I have all the codes for doing the grid approximation to compute this posterior distribution. Take a look at it, run it once on your computer, and then leave it aside forever. You will never do this again. Grid approximation is a great teaching tool because it helps you understand, takes all the magic out of it. You're just counting stuff. So you just want to, how does grid approximation work here? You just loop over all the combinations of mu and sigma within some range that you pick, where you think there might be non-zero counts. And then for each of them, you just count up the relative number of ways the data could arise. And that comes from the normal distribution, which is what normal distribution gives you. It's like the soccer players on the field. Relative expectation of the number of players at that particular position. So, and then if you plot it out, take samples from that and plot it out, you get the cloud, the snowball on the right there, which shows you that there are some highly plausible values of mu and sigma in the middle there, and then decreasing possibility as you move away in the scatter. So, you can look at this posterior distribution is now two-dimensional. We're looking down on it in a slide. But you can also take what's called the margins of the distribution. From any particular parameter's perspective, you can imagine looking at the side of this hill. And then you get profiles. And that's what I show you in the right-hand column of this. So, mu is as if you were standing at the bottom here and looking up at this hill, and that's the contour you'd see. That's the marginal distribution of mu. And what it means is you're averaging over sigma. What is mu? And that's what it is. And then the other direction, stand on the left-hand side of this plot and look at the hill, and that's sigma. It's the marginal distribution of sigma. Margin means edge, right? But it stands it off and means averaging. The margin is when you're averaging over the other stuff and you give us called a marginal. I tell you this vocabulary, not to glamorize it, it's horrible vocabulary, right? But you're just gonna hear it a lot, so there's no avoiding it. Marginal distributions. What we're going to do for the next several weeks is instead quadratic approximation. If you start adding more parameters to these models, which we will, grid approximation becomes impossible, really. Not literally impossible, but you will be dead before your stats finish, right? The combinatoric explosion of the grid, the size of the grid is really awful. And so even the fastest computers aren't gonna finish before the sun expands and swallows our planet. So we do something else. We use various approximations and that's what we'll do for the rest of the course. Those approximations that are gonna work for us are the so-called quadratic approximations or Laplace, as you can see on this nice fridge postage stamp approximation. And Markov chain Monte Carlo, which is another family of approximations, which are in fact better, I think. But it's nice starting out for you guys not to have to fight with Markov chains just yet, that'll come. But we were gonna spend time on the quadratic approximation as a scaffold because you don't have to fight with it as much. Okay, this will give us a prox... We assume that the posterior distribution is Gaussian and we approximated according to that assumption. We don't have to believe this assumption. In fact, it's important not to because you wanna recognize the cases where the approximations are bad. But often, as I'll show you, it's a really good approximation. You can estimate with two numbers then. If it's a Gaussian distribution, you need exactly two and no more numbers to tell you it's whole shape. That's what's nice about Gaussian. So you need its mean and its variance. So that's all we need. So we're gonna get those and it turns out that calculus can give you those numbers. And we're gonna do it through numerical calculus or your computer will, for that one. And then once you get the peak of the posterior, which will take us to mean, right, in the middle, the mean of the Gaussian inside its peak is in Latin called the maximum a posteriori or mat. And then the standard deviation, same information as the variance. There are lots of algorithms for doing this. R has a very capable optimization engine called Optum, which is what we're gonna lean on. There's a tool in my package that will do this for you. You just have to use the language to program the goal. And then the goal has the motor to do the optimization. So the nice interesting thing about the quadratic approximation is that you've already done it lots of times, even if you've never done Bayesian stats. If you've done maximum likelihood estimation, this is what you're doing. It's just, if you have flat priors and the only information about where the peak is and what its variance is, is the likelihood function. And then that's maximum likelihood estimation. It's the same thing. If your prior is not flat, then you're doing Bayesian inference. That's, nobody would agree with that statement. That's a cartoon version that's sufficient for today. There's lots of nuance to be had on both sides of that. They're not exactly the same. Interpretation's very, but as far as the computer cares, it's the identical procedure, but the interpretations are different. Okay, so there's a tool in my package called MAP for maximum on posteriori. And you give it instructions in the form of this modeling language. So here's the model we saw before on the right in the text version. And then on the left, you have to create something called a formula list. That's what flist means, a formula list. And it's stored inside of a kind of object in our called a list. So what is a list? A list is a list. It's a list that can hold stuff and it won't try to interpret what's in it. So you can put all kinds of garbage in there. And it won't, it'll just say, okay. It won't, don't put list. If you put list without the a in front, it'll try to interpret it. And then things will happen that will be bad for you. So just use a list. It's not computers. What are you gonna do, right? So I should say, I saw a poll yesterday that R is the least hated programming language, according to online suggestions. So, which I was surprised though. It's like, I was like, what? There's sampling bias, there must be sampling bias. No, R is great. It just has things called a list that are different from list in super important ways. There's no way to remember. Just use a list. So the correspondence here is that we have a variable named height, deba-tilde, denorm means the normal distribution in R. It's the function for the normal distribution in R. And then you can create some symbols, mu and sigma. The words there are arbitrary. Is there whatever you like. But you might as well use mu and sigma. And then, mu is also normally distributed, and then you put in a mean 178 for the priors, the integration 20. And then sigma. Dunnif is the uniform distribution in R between zero and 50. Yeah, you see the correspondence? You get better at this after you do a few of them. The nice thing about this, I know it's a huge pain compared to choosing a menu option in SPSS. But this forces you to learn all the assumptions. So what do you do with this formula list? You pass it to the function map, and you give it the data, and then it finds the maximum posteriori for all the parameters. In this case, there are two. And you can summarize the result. So the answer gets stored in this object in 4.1, model 4.1. And crazy is my summary function. It's sort of a no-nonsense, very lean summary. It doesn't have p-values. That's why I wrote it. Get them out of my sight. Kind of thing. Just gives you the mean, the standard deviation of each marginal in the posterior. And then an 89% credibility interval. Why 89%? Because it's prime. No, because 95% is an abomination, right? There's no reason to use it, except that we have five fingers in each hand. That's the only reason. So you will not see me using it except when I slip up and accidentally do it through unconscious bias. So here I want to show you at the bottom, how good the approximation is. So the dash curve, apologies, this doesn't look so good in the room. The dash curve is the approximation you get from taking those means up there, those standard deviations and drawing perfect Gaussian curves from it. And then the dark part, it looks better in the book. You look in the book. And then the dark part is the grid approximation that we have to arbitrary precision to show you that for mu, it really is Gaussian. This is a Gaussian posterior. In fact, theory can tell you it had to be. For sigma, it's not exactly. There's a little bit of probability mass on the up end and less on the down end. Again, theory can tell us that that was supposed to be, but it's really good. It's quite good. And often the Gaussian approximation is amazing. Just like why? Because counting is adding. And so central limit theorem again to the rescue. Okay. Scaffolds. So that's what I've been saying all along, is you wanna think about math as you're not gonna keep using this. We're gonna use it in the course because it forces you to learn all the assumptions of the model you're specifying. And that's really good for you. It'll help you develop a lot of confidence in interpreting the output of the model. You won't have bad questions about what is the model actually doing, which is how most push button software is what it does to you. Because it makes things easy right now and you suffer later. What I'm doing is making you suffer now so that it'll be easier on you later. Yeah? You'll thank me later. Stocklub syndrome will set in. And you'll say thank you, restrict. No, this is really good for you, trust me. This is what you wanna do. You wanna know how the goal of works because eventually the goal of will misbehave and you need to diagnose its behavior and the answer is in the assumptions. So, that forces you to fully specify the model this way you know what's going on. This also means that it's incredibly flexible. It's not restricted even just to linear models. It can do a whole bunch of non-linear models and factor analysis and all kinds of stuff because they're just models. You'll see how to just write them in those formula lists. It's not really a very good way to approximate the posterior markup chain Monte Carlo is a much better way to do this. More capable especially as models get more complicated. So about halfway through the course we're gonna transition away from math and start using markup chains but the formula list will stay the same even after that transition. So you're learning, you learn this form of specifying these models. It's independent of the way you fit them. And that's what the real value of it. You want to extract your knowledge of the golem's brain away from the golem's motor, right? It's stomach or whatever it is that does the counting. Sorry, the golem metaphor has its limits. So scaffolds are great in education and it's just an irony of a lot of things in scientific education is that you have to do it the hard way first before you can responsibly do it the easy way. So I do this without apology but with explanation. That's why. All right, so let's add a predictive variable now because it's a little bit more interesting. Height and weight are related, at least in humans. And so we're going to be interested in predicting individuals' heights with weight. This relationship, by the way, does vary a lot across human populations, right? And the correspondence at age is incredibly important for understanding corrective development. We're just looking at the adults right now. How does weight describe height? Obviously weight doesn't cause height, right? You're not tall because you're heavy. So let's keep that in mind. These models are not causal models, they're just descriptive. How do we get weight into this model? Here's the standard linear regression. You've never seen this before, but maybe not in this form. We're gonna make a linear model of the mean mu. Top line is the same as before. This is the likelihood. The height of each individual is normally distributed with mean mu i and a standard deviation sigma which doesn't have an i on it. So what that means right away is that the assumptions just written there stamped on the golem's brain, the standard deviation doesn't depend upon the individual. So the error, our prediction error for any particular height, the normal distribution around or any particular weight that you'll see is constant. That may not be true. You could make a sigma dependent on i, that's fine. You can do it, it's not breaking the law at all. Mu sub i then is a line, equation for a line, alpha plus beta times x sub i. Alpha is just some parameter. We'll talk about what it is in a second. Beta again is just some parameter. And x sub i is data, x sub i is gonna be the thing, the other data we have on the individual i, in this case their weight. And then we need priors for these things. I decided here to use some really bad priors to give you an idea that they don't matter because there's so much data here, you'll see. Say we take alpha and we set it, so that the mean is 178, the standard deviation is 100. Why is this a bad prior? Well, in combination with the others. Because this will happily predict negative heights. It's a really big standard deviation. So your goal of now thinks that negative height individuals are possible. And since it's a dumb robot, it will happily carry out the calculation. And it will count all that stuff out. You know better than it does. Beta is the relationship between weight and the outcome. I'd a lot more to say about this. And we give it a normal prior. I'll have a lot to say about priors like this as we go. This is essentially flat over any interesting range. But centered on zero. And then our uniform prior from before. So let's walk through the language version of this. So the mean ui is on two lines now. And it's just to say we're defining mu as conditional on other stuff. It's not a parameter anymore. It won't be in the posterior distribution. It's something you calculate from other stuff that's in the posterior distribution. Yeah, before it was a thing. We had to count it. Now it's a thing we compute from other stuff that we've counted. And it's the mean on roli, which is individual i. And it depends upon these other things. Like the weight on roli. And then these symbols alpha. What does alpha mean? Well, implicitly, it's the mean when the weight is zero. Obviously that doesn't make sense. There are no zero weight people. Yeah, even at conception, you weigh more than zero. Yeah, so, but nevertheless that's the definition of alpha. It's often called the intercept. It's the value of the mean when the predictors are all zero. Sort of annoying, but that's just what it means. It just needed to draw on it. But it won't be on the graph. There's a change, and then beta is the change in the mean for a unit change next side. We often call this a slope. This is probably familiar to you guys, yeah? That's okay, I'm just gonna reteach everything. I'll do the assumption that you know nothing. But I know you know a lot, so don't take it as disrespect. I know you guys have done a lot of this before, but I just wanna reteach it with my language. So often called the slope, but for every time x gets bigger by one, then u gets bigger by beta. I'll say that again. Every time x gets bigger by one, u gets bigger by beta, yeah? So here's the thing about priors on linear regression. There's a set of conventions that you often see in textbooks, including mine, which I would like to call the horoscopic advice on regression priors. Why horoscopic? Being an astrologer, I assume, I'm not an astrologer, but being an astrologer, I assume, is a very difficult job because what you're forced to do is predict a person's life from their birthday, right? Maybe also where they were born, I guess. Is that cheating in astrology? I don't know. But that seems hard, right? I mean, there's not a lot of information to go on. You know, as a scientist, I like to know some more stuff about people. I was gonna try to give them useful advice about their life. This is what statistics is like. It's like, scientists come to me all the time, and they're like, so I did this study, here's what the data looked like. What should I do? I'm like, well, let me get you a horoscope. It's the details matter in science, right? I don't have to tell you guys that. But when you're writing a stats book or teaching a stats course, you can't focus on the details of each particular case because then you're not providing generalizable advice. What I mean, the details of each case, if you have information about the meanings of parameters, then you can do better than the horoscopic advice on this slide. What is the horoscopic advice on this slide? If you knew nothing else, and what you were modeling were just random numbers, then for the intercepts, look, the intercept has gotta move pretty freely because you wanna let the slope, you're interested in the slope. And so, if the slope changes, the intercept has gotta change because that's what lines are like, right? You can't hold the intercept in one place and move the slope. It would break, the line would break. So, you wanna let the intercept swing pretty freely. So give it a loose prior so it can swing. And as long as you put information on beta, then it doesn't matter so much what the prior on alpha is. And this is horoscopic advice because if that will mean something and you have information you can use to fix it, you should fix it, right? So if, for example, you know if there's a real growth processor modeling and means you know at weight zero, individuals have height zero, that's information you wanna put in the model, right? I'm working on a paper like that right now where we fix the intercept because we know when an individual is one cell, there is tall as one cell, right? We know the weight of one cell and you can make a model like that. You wanna do that. You don't wanna fit the intercept freely, right? You wanna fix it because information tells you. But if it's a horoscopic situation, you don't know what's going on, then this is okay advice. For beta, we're gonna spend time later, a lot of time a whole week in chapter six talking about a problem called overfitting. And what overfitting means is golems, statistical models have a tendency to learn too much from the sample. That's what overfitting is. They're trying to optimize their fit to the sample and you wanna make them skeptical, program them to be skeptical of the sample, put some other information in there so that they don't overfit. The tendency is that they exaggerate effect sizes when they overfit the sample. So you wanna put conservative priors on the slope so that they don't overfit. And this is a thing, if you're using standard methods like t-test and out-of-box know about it, guarantee you they're overfitting the sample and they're exaggerating effect sizes in the literature. Guarantee it because that's a property of statistical models. We guard against this through something called regularization. And in this case, the advice would be you put priors on these slopes that are centered on zero and have some finite standard deviation. That's the skepticism. Big effects are rare. And then how rare will it be when it's upon your literature? Again, I'm being an astrologer here. So I can't tell you exactly what value is good for you but flat is bad. It's easier to do better than flat. Give it a finite variance, infinite variance bad news. That's, you can quote me on that. For scale parameters like sigma, uniform is gonna be fine in those cases. Sometimes you have information about them and you can also overfit the scale parameters. So later on we're gonna use this thing called the Cauchy distribution and also exponentials as ways to regularize scales. But you can, don't worry about that right now. For now we'll let it be uniform. We can have fun with Cauchy in a few weeks. And above all, if you're worried, check the prior predictive. Simulate data out of the prior. We'll have examples of this as we go. And then you can see how goofy the prior predictions of your machine are, right? Or make a good time. Actually, let's do this material. So, oh, am I going too fast? Is it okay? Yeah. There's some bidding nodding. Okay. That's where it's hard to elicit feedback. You guys need clickers? You can upregulate my speed, the talking speed or something like that. We should try that. So, let's look at how we program this model. So at the top again I repeat on the left the formal model definition. On the right I give you the code version of each line as it's gonna go into the formula list. So the height of each individual line is normally distributed. Height, tilde, v, nor, mu, sigma, just like before. Mu, you just write mu, and then there's this arrow which if you've used r at all, you know this means equals in r. For reasons that are lost in the midst of time, some programmer decided they would do that. So originally it was an underscore was the assignment operator in r. Yeah, r is a nightmare. But it's the least hated programming language for them yesterday. So this means all the others are worse. And then alpha equals beta times weight. You literally just type it in. And this will be interpreted by r. You are coding your model literally when you type this formula list. And map would just execute the model you write there. So, you know, be careful. If you exponentiate instead of multiply, it will carry out your instructions. So, and then we write the priors exactly the same way. And then the formula list just goes down here. You can type the formula list directly in the map. You don't have to make a separate list object. It's up to you whenever helps you manage your sanity. I tend to do it this way in one block of code. And again, as I said on the previous slide, these priors are terrible. I'm using them because they leave no impact, actually. Even though they're terrible priors, they wash out because the prior enters once and the data enters in times. And so you overwhelm. Your priors have to be very powerful to overwhelm a large dataset. Nevertheless, if you have a lot of us work with small samples because, well look, I'm an anthropologist. You can't just get more data than on traditional systems whenever you want. You go to war, as I say, with the data you have, not the data you wish you had. And that's how it is. So, what's the analysis? You get out of this. Again, looking at the preci result, these are the marginals. What preci does is it makes a table of the marginals, those profiles from looking at the hill for various sides. This is a little complicated because it's a three-dimensional hill. So I haven't plotted it for you, apologies. I don't know about the rest of you, but I can't see 3D graphs. I gotta look at them and I feel, I'm 3D blind. Some people are color blind and I'm 3D blind. I can't see what's going on. Maybe you could rotate it or something. So apologies that I haven't done that. Between the numbers of these Gaussian profiles in this three-dimensional posterior distribution, we can see the impact of it by plotting it in the prediction space. That is the weight by height space. Let's just start with what's the line, so to speak. So if we just look at the peak of this Gaussian, so it's a three-dimensional Gaussian hill. What does that mean? It's a sphere, actually. And we want to know where the center of this sphere is. It's a big sphere of hazy probability and it's densest at the middle. Is this working at all for anybody? Yeah. It's a big snowball of probability. And literally, it's three-dimensional, so it's a sphere. A Gaussian distribution is a fuzzy sphere in 3D space. It gets Gaussianly thin as it moves out to the tails. So it's densest at the middle. And OK, what's at the middle? What are the values at the middle? The middle defines, well, alpha and beta at the middle, at least, not sigma, defines a line. And where is that line? And that's what I've drawn for you on the right. So the open circles are the actual data points in Nancy Hall's data. And that line is the so-called map line, maximum of posterior a line. It's the same as the maximum likelihood line in this case, because there's so much data and the priors are so weak. You with me? Yeah. OK, but the posterior distribution, as I said, is a big Gaussian snowball. So we want to get all that information on there. Otherwise, we're overfitting. Literally, we're taking just that central part, but there's a lot of uncertainty still about where things are. So we want to get the uncertainty under the graph so that we're not overconfident. No matter how certain or uncertain you are in the analysis, there will always be a peak. So if you're only plotting the peak, you're throwing away most of the information. So how do we do this? Again, there are lots of ways to do it. In this case, there are actually formulas. You could use to draw the uncertainty, because it's as simple as your model. But I don't want to teach you special case formulas. I want to teach you something that works for any kind of model. This will work for anything. Just take samples from the posterior distribution, and then we can process each of those samples to draw things on the graph. This will work for anything. We imagine that snowball, and then we sample points from it. We sample them in proportion to how dense it is in that region. And then we get combinations of parameter values, three numbers at a time. We dip into the snowball, and we get an alpha, and a beta, and a sigma. Then we dip it in another place. And our probability of dipping it in any particular location is proportional to the density in that area. Yeah? Should have brought the globe again. So you get an idea about it. But again, so algorithmically, we use math and standard deviation to approximate the posterior area. So we've got this Gaussian snowball from that. We sample. The Gaussian snowball is a multivariate normal distribution of the parameters. So there's a device in R for doing this very easy. It's easy to do this. It's a sample from Gaussian snowballs of any dimension. And then three, we use those samples to generate predictions. And what that does is it lets us integrate over the uncertainty. This is a way of doing integral calculus without knowing you're doing it. And that's why I want to teach it to you, because it lets you be your own applied statistician. You're not captive to someone who likes integral calculus. So the thing to say about this, where I let you guys go, where I'll show you the last slide, is that sampling here is purely epistemological. It's a device that lets us do calculus. It's not a physical metaphor. Sampling in frequentist statistics is a way you construct uncertainty. The uncertainty in Bayesian statistics doesn't come from imagining repeat samples. The only data in Bayesian inference are the data you have. You're not imagining any other data sets. They're just this data set. Sampling is just a device we use in the computer to characterize the uncertainty in a distribution, because it's convenient to work with it. But it's not a physical assumption. The posterior distribution doesn't exist, right? It's a purely epistemological thing that exists in your computer. You can't physically sample from it. This is just a way of doing numerical integration. That's all it is. It's just a way to do calculus that makes it easy. In my experience, people are really good at processing samples from a posterior distribution. They're very bad at intervals. No offense to anybody here who happens to be amazing at intervals. But that tends to be the norm. So, last thing I wanna show you, I'll let you guys go. This is how you sample from the posterior distribution. In chapter three, I showed you how to do this more mechanically. But we're gonna automate this for the most of the course. There's a function that works with math in the rethinking package called extract samples. You just give it the fit model and it represents the posterior distribution as a big data frame. It's just a big data table now where each column is a parameter and each row is a set of correlated samples from it. There's samples in the Gaussian snowball. And as many as you want, I think the default's 10,000. But go crazy, whatever you like. And you can get an arbitrarily precise approximation from this and then you process each row to generate predictions. Each one of these rows specifies a predicted line. What is the posterior distribution full of in this model? It's full of lines, right? And it's a big cloud of lines. So, the snowball was weird. Now, think about a big hyperspace of progressive lines. But they're all in there. They're all in there. And I'll stop here and let you guys go. We'll pick up with this slide on Friday. We're gonna draw all these lines. And I'm gonna show you. No, this is really useful for all of them. Just like a hundred of them. And I'm gonna show you that this lets you visualize the uncertainty. How confident is the model? Where the true relationship is? We can do it by plotting a bunch of these lines so we can visualize the scatter. So, when you come back on Friday, we'll pick up right here and we'll go from here to other fun stuff too. And we'll finish out chapter four. Okay, thank you for your attention.