 I know you guys are making good progress because the emails I get are mainly pretty minor things so far. Of course the true sufferers may not be emailing me, but that'll work itself out. Let me get back into the imperial and my typical indirect fashion. So this is the horoscope of Prince Iskander, who was the grandson of Tamerlane or Timur, who conquered most of what we now call the Near East. He was a descendant of Mongols and all that stuff. So I bring this up because often when I teach task courses I feel like it's impossible like doing horoscopes. In a sense there's this impossible task where a bunch of people with particular criteria are asking me to give general vague advice to all of you at once and yet you want it to be useful in your particular instances. So this is like newspaper horoscopes, right? They can only be appealing because they're so vague that they're useless. That's the only way they can be true, right? And that's why horoscopes are sort of half that. Apologies to those who believe in astrology, it doesn't work. But that's sort of how the newspaper now, Internet Horoscope, that's where the horoscope is now on the Internet. They can be appealing and remain credible because they're so vague. And often task courses degrade into that problem as well. And I always feel this tension when I teach these statistics that it's almost this impossible thing. I've got to teach you some general methods and then there are all these details that are going to matter for each of you in particular in your little life or your big life, sorry, the problems you have because everybody has a big life, way bigger than any of us can feel. And so I want to say this at the start and that'll help you calibrate some of my commentaries that go along. I keep reminding you that I'm giving general advice, but you will know things about your particular study systems and your science and your questions that may have been made violating any particular advice I give you. And you should trust yourself and not me in that case. I mean, I'm happy to help you when you have your particular problems later and you come to me. I do a lot of consulting in office hours about that sort of thing and I find it very rewarding. But usually it involves trying to dissuade you from doing something you've seen somebody else do. That makes some sense. So we're going to resist the horoscope even though we have to start with the general, you know, casting of the bones and figuring out when you were born and where Mercury was and stuff like that. So this week we're going to be doing linear regression, which is truly the Vegas sort of model that we could start with. But we'll learn a lot from it. In your particular cases, you'll end up doing something better. Before we get into that, we just got a few slides that'll help you interpreting your homework. I can tell from the emails I've received that you guys have pushed through this just fine. We had just gotten to what's often called posterior predictive checking. These little machines called statistical models can malfunction and even when they function correctly, they may reveal themselves to be nonsensical by the answers they provide. So you have to do some sort of criticism at the end. And often this will mean plotting the implied predictions of the model. And we'll do a ton of that in this course. One of the main struggles that students find in this course is all the dance plotting. There's going to be a bunch of plotting and every model needs to be plotted slightly differently because they're different. But you get good at it after a while. And it's something you really need to do to make sense of your statistical projects. So the simplest sort of check, just check on function right now and we'll do more sophisticated things later, is often called a posterior predictive check or simply a predictive check. It is posterior because we're going to use the uncertainty embodied in the posterior distribution to simulate implied data from the model. So the process of Bayesian updating or conditioning on the data sort of pushes the data into the model and it constructs a posterior distribution. Now we push the posterior distribution back through the model and it makes data. And we're not lying because we're going to tell people we did that, right? Now we're going to report it as the real data. So there's nothing naughty about it. It's a way to check that you can understand what the model implies. And often, once you try this, you realize you don't like your model, or at least speaking for myself. And sometimes it reveals that the machine did something wrong. The machines are sophisticated like you. They don't know when they fall down on their face, right? They just report answers. And so we're going to start by learning how to do this. Let me give you the conceptual version of it. So here we have a humble posterior distribution for the globe tossing data. You can print it in your nightmarish by now, right? This particular shape. And so along the bottom, we have the different proportions or probabilities of water on the globe. And I'm going to isolate three different parameter values, three different probabilities of water labeled here A, B, and C. And each of them implies a different ensemble of predictions that are possible given simulated globe tossing. So for example, if the true probability of water in the globe were at A there, which is a little bit below 0.5, let's call that 0.4, or I'll have it up here in a second. It's 0.38. If that were the true value, and we had a globe where that was true, and we tossed it a bunch of times, nine times each, we'd get there's an uncertainty about what would happen, because you wouldn't always see the expectation, because the globe tosses are still, quote unquote, random. There's uncertainty about what will happen. So we get different counts of water observed simulating data if the true water coverage was 0.38. And the pie graph there is meant to show that 0.38 is blue is water, right? Most of the earth is land under this wrong globe. Does this make sense so far? You with me? So you get over, I think I simulated over 10,000 different simulated sets of tossing the globe nine times the count of water. It has a minimum of zero. And then we got land every time. It has a maximum of nine. We use our water every time. Both of those extreme cases are highly unlikely. Mainly you're going to get something that is close to 0.38. That is the expectation. But there's lots of scatter around it, because it's only tossed it nine times. Makes sense? That's the implication of this particular parameter value, this particular conjecture. We can do that for B as well. At B it's 0.64. Now there's more water than land. This is closer to reality. Reality, by the way, is like 0.71. Just about 70% of the earth. Just about 70% of the earth is water. It's upon the time of day. Always increasing, too. So we simulate now, and you'll see that the simulated distribution of observations has shifted to the right, because now the simulations assume there was more water under them. You with me? These distributions of simulated data are often called sampling distributions. They are distributions that arise from particular assumptions about the conduct of sampling data. This is all still in the small world. It's all still in the land of assumption. The real world has not intervened yet in this. And then the last case, let's think about C, a really extreme case. If it's 0.89, now you see the sampling distribution is pushed up against the maximum. It's no longer even big. Because in this case, most of the time you expect to get 7, 8, or 9, really 8 or 9 waters, because 90% of the globe is water. It makes sense. Now, of course, we don't know the true value, but we do have a posterior distribution over the possible values for the proportion of water coverage on the globe. And so if we want to get a sampling distribution that contains that uncertainty, we want to use the samples from the posterior distribution. And for each one, we can generate a sampling distribution, and then we can mix them all together. Because we simulated for each sample from the posterior distribution, we made one of these sampling distributions like this. When we mix them all together, the simulated data will be present in the proper way given the relative plausibilities contained in the posterior distribution. That doesn't make immediate sense. You're normal. This means you're human. So there'll be a bunch of examples. And as typical, when you do your homework, you really get a chance to wrap your brain around this. So when we merge these together, we get a more squashed distribution of implied data because we're not sure what the actual value for P is. But we do have, given the data we've seen so far and the model that we're assuming, we do have some information about which ones are more or less plausible. And the extreme values near 0 and 1 are highly implausible, given the data we've seen. In fact, 0 and 1 are strictly impossible given the data we've seen, because we've seen at least one water in one land. Lots of ones in between are more plausible. And the most plausible ones give us 6 out of 9, because that's the data we saw. I put the thick black bar there on the far right graph, indicates the actual observed data. And it is central to this distribution of simulated data that we see here. So one way you can think about this is the model reproduces the actual observation with very high likelihood. It's right in the middle. But it's also highly uncertain about what the future is. Because we haven't had a lot of data yet, there's still a lot of uncertainty in the posterior prediction. So this is a calibration. This is a way to visualize what the model expects. If you made your little golem predict the future, this is its conjecture that embodies all the uncertainty that's still in the posterior year. And if you used instead, I'll get you a question in a second. Thank you. If you used instead only one of the values, it would be anti-conservative. It would throw away all that work that you went through to get the posterior distribution in the first place. And you would end up being overconfident in some bastardly way that would make the world explode. Right? Question? Where did you come from? The creative inspiration. I've got the curve and I just picked A, B, and C. The actual merger on the right uses the infinite number of them. It uses all of them. And I'm going to show you how to do this in the code on the next slide. So this is a good question. I should repeat for my computer what the question was. The question was, how did I choose A, B, and C? I chose them just for the sake of example because they were basically evenly spaced and on different sides of the posterior mode. But the merge distribution on the far right actually uses every value of p weighted by its posterior probability. In other words, we say we integrate over the uncertainty in the posterior distribution, which is just a fancy word for averaging. Weighted averaging. When we say integrate over in probability theory, we nearly always just mean a weighted average. So you guys have already got samples. I showed you on Thursday and you've already done some exercise on this on your own. I showed you on Thursday how to draw samples from a posterior distribution. This makes this integration task a lot easier than it might be otherwise. Because now just for each sample, you run one of these simulations. And so if we had 10,000 samples from the posterior distribution, we can feed it into our binome. Our binome is the random binomial function. It simulates binomial draws. Each of these simulations will have a size equals nine tosses of the globe. And what is spit out or emitted by our binome and stored in the symbol NW in the line of our code here are the counts of observed water. They're integers from zero to nine. So we have one simulation for every draw from the posterior distribution. We have 10,000 of them. And so these simulations at the end integrate over the uncertainty distribution, right? So we get a conservative prediction of forecast. Does this make some sense? Yeah, question? So this line of code basically skips over the middle thing. Yeah, it does the middle column invisibly for you. Infinitely. And then makes the merge thing at the end. Exactly right. Question was, so this code skips over the middle column in the previous figure. Yes, is the answer. Yeah. So at least the way the homework's laid out this problem's well in some sense. If you're drawing 10,000 times, we happen, I think, to have 10,000 samples. But in reality, if you draw 1,000 times with problem equal samples, how is this selecting from your samples value? Okay, so the question was, I'll translate. You tell me if I got it right. That's my translation of the question for posterity in my computer. So in this particular case, samples contains 10,000 values because that's what we decided to draw on previous slides. And 1,400 is 10,000. We're going to do 10,000 simulations that match. So we get one simulation for each sample. What if they don't match? Then what happens? Then what R does is it recycles the vector. So whichever one is shorter gets recycled in order. So if you only had 1,000 samples and you did 10,000 simulations, it would use it 10 times. Vector of 1,000, 10 tons. Starting over at the beginning again. You don't want to do that. You want to make a match. Otherwise, you're going to get some weird correlated simulations or something. You'll end up getting bad rates. Your computers have a ton of memory, at least for these simple models. So when you go crazy, you could do a million. No exaggeration, no problem. Through the end of the course, we'll have models with a few thousand parameters in them just for kicks. And in that case, we'll do this. The good news is as you'll see today, posterior distributions are often quite normally distributed, normal shaped. And in that case, you don't need a lot of samples to describe them because you can get a really good estimate of the mean of standard deviation with a small number of samples in a thousand. So, and we're not launching space shuttles. It's like, you're joking. Right? So we're not going to fetishize precision where there really is none. That's a good question. There was another question back there? No? No? Okay. Tom had a question. No, you're done. Okay. Don't be shy about asking questions. Point of being in classes so you can interrupt me. Right? Otherwise, I would lecture my computer at home and upload it and sleep all day. That's how it would go. Okay, so later on, actually today, this will get a little bit harder where the strategy will remain the same and the rethinking package contains convenience functions I'll introduce you to. Maybe today, if not today, on Thursday, that automate a lot of this. But I want you to, first time I introduce you, I'm going to help you understand what it's doing. So when we are looking at our... Good question. So the question was, little computer, when we're doing this, we're visualizing the uncertainty in these graphs but quantitatively how should we describe it? I'm being shy about that right now because I don't think there's a general answer to that. This actually predicts my next slide. It depends upon what you want to do. Your scientific purpose. There's a kind of tradition of doing chi-square tests to figure out if the observed data are comfortably within the simulated envelope there. In this case, you can tell just by eyeballing this. You don't need to do a chi-square test here. That would be better. And I don't really have anything against that, a chi-square goodness of fit test in this context as long as you treat it informally. The problem is, of course, you'll need some threshold to decide to accept or reject, and the threshold will ultimately be arbitrary to a real cost-benefit analysis. So that's why I'm being a little bit vague. And then this comes back to the horoscope, the princess standard, right? In general, I can't give you good advice about that. I'm going to try to resist trying to cast a horoscope for you. In the context of particular data analysis examples through the course, I'll be able to say something better. And when you're referring to that value? Yeah, like a 5% for something. But I think the general procedure makes some sense because you're trying to get a measure of what the observed data are. That can be quite useful as a calibration and as a form of communication, but just quote how far the quantile it is far out. It would be one way to summarize where it is, I think. The threshold then is a different, it's like masquerading his decision theory, I think. And I'll have something more to say about that later. Okay. All right, I see people scratching their tins, which is fine, but I'm going to keep looking at it because I think you're maybe about to ask the question, right? It's a teacher instinct that develops over time. Okay, so this brings me to this where I was going to say something about this, is this process of generating the posterior predictive distribution and then asking where the observed data in, it's a good way to figure out if the model did its job right. The observed data should be in there comfortably and you want to get some calibration on that. But then there's often this question to figure out, but then quantitatively what decisions should I make based on that? And I think it's hard to give general advice. We're in a horoscope plan. In fact, I think there's universally no best way because scientific purposes are very diverse. And no single criterion is always justifiable, like the 5%, right? And I want to say, I say this in the notes too, Fisher, Ronald Fisher, is often blamed for this 5% thing because it was in one of his 1920s books, but he had the most casual justification of it in there. It was like, yeah, it seems convenient right now to use 5% because that's the score of about two. And it seems harmless, you know, and at the time it was, then, you know, it's clear scientists took that 5% and used it as a ritual to clean their hands of guilt, lots of stuff. But I don't think it's Fisher's fault that the 5% caught on so strongly. It's clear, I think, that it's just a convention, right? The only objective about 5% is that everybody uses it, right? That doesn't mean it's good. In general, I think you may need our imagination and often what posterior predictive checks are for is some way to, all models do something silly. This one's so simple that it's hard to use it as a good example, but we'll have good examples later. All models make bad predictions for some subset of the observations. And so these posterior predictive checks can be a ways to spur our imagination to think about a better process model for the data. And then through multiple cycles of modeling and empirical investigation, you can make models that are better and better. So this has this kind of iterative effect where it helps us find flaws with the model and then we can try to theorize ways to improve those things. Now, we do have to be careful, and I'll emphasize this as we go through the course. We have to be careful not to chase noise because no model will ever make perfect predictions. In fact, there are lots of really interesting phenomena in nature which cannot, when we get the right model, what the model tells us is we can't predict these things. So like the weather two weeks out. No, good luck, right? Unless there's a hurricane two weeks out. But there are lots of phenomena in nature which are highly indeterminate. Birth, human births predict the sex of someone's next child, right? We have the right model. The right model tells us you can't do it. Very high probability, right? So that's the world, folks, and it's a wonderful one, worthy of writing poetry about. Anyway, and I have this quote from James, which I think he says it quite well. James was a physicist, an American physicist. He was also an officer in the Navy. This is him looking handsome as a young man. And he did a lot in Bayesian inference, and my sort of philosophy of Bayesian inference follows heavily on James. Those of you who've read some James, you'll recognize that. I will say though, James was a very pugnacious person. So you have to be careful when you read him. He's very doctrinaire, and I try to be considerably less doctrinaire than he was. Okay, let me give you the pit stop here and we'll get into new material. The program so far, we're making models because we take as one purpose in science is to make predictive models of natural phenomena. These models help us forecast and see what's going to happen as well as understand things that have already happened. We make the model go by conditioning on data, and we do that, we derive some approximation of the posterior distribution. That distribution gives us the relative plausibilities conditional on this model and these data of the different conjectures, the different adjustable bits of the model that could have generated the data. We usually call those little bits parameters. Then we use this posterior distribution to describe our uncertainty. It could be very peaked. We'll get some peaked distributions today. It'll make you feel good. And sometimes it'll be very wide, but either way, there's this kind of safety device built into using it, because the width of the posterior distribution is the idea of your uncertainty. We can then add more data later and improve upon that. Then we need to check the model. I've only showed you one way so far. We'll get more examples as we go. A little bit reminds you of the philosophy. The inference here is in the language of probability, and this is deeply frustrating. Your model will only talk to you in probabilities. It gives you distributions, and it doesn't speak human. A lot of what we do in this course will be generating implied predictions from these posterior distributions. We'll start it today. The best parameter value is not really the focus. The whole distribution is the quote-unquote estimate. It's deduced, so it's not an estimate of something. It's a logical consequence of your assumptions and the data. Of course, even the best value may be terrible. Models take themselves for granted, but you should never take your model for granted. We'll have an example of that, probably not today, but on Thursday, because I'm really, really confident because there's a lot of data, but it's a terrible model. The model can't see that, but you will be able to. I'll show you an example on Thursday if I can stay on time. Let's get into linear regression. Let me introduce this by asserting that linear regression is the geocentric model of statistics. Let me try to unpack that. I don't mean that as an insult, because geocentrism is awesome. It's just wrong. It's its only problem. Claudius Ptolemy, really an incredible intellect, a member of this lineage. I think they were the Greeks that were given that Alexander gave Egypt to. I think this is history. Someone here may know this better than me, but the Ptolemy's built the library of Alexandria, and they still scrolls from boats and coordinates them in the library. Then some barbarians burned them all later, at least most of them, but that's history. Ptolemy invented this model, or rather improved upon a model that he had received through modification and descent, but he did a lot of work, a model for predicting the positions of the planets in the heavens and some stars as well. We now know this model is the geocentric model, quite casually, or the Ptolemy model, the solar system, and the physical analogy in this model is that the earth is at the middle and everything goes around it. That's sort of how we perceive it when we sand and look at the sky. The thing about this model, I can start it into motion here and what this model is, if you know the physical structure of the solar system, it's absolutely goofy and it's extra goofy because it achieves its predictions by using this device called epicycles, which are orbits on orbits. You can vaguely see them. This will be way better when you watch it at home. The colored circles that are orbiting one another and their little planets on the outer circles and they keep spinning around and this is not how the solar system actually works, right? But it turns out that this is a really accurate model. If all you want to do is spot Mars in the sky, this works incredibly well. Over time it gets wrong and you have to refit it. That's true, like every 150 years or so. It slowly gets out of whack, but once it's reset and it's really accurate, it's perfectly good for amateur astronomy. You want to find Venus or Mars, it works great. If you want to get a probe to Mars, you're going to miss because it's got the wrong model of where things are. But for just spotting something, finding it in the sky, it works great. And in particular, it's able to predict, and this is what it was constructed to do, a retrograde motion of planets in the sky. Mars will be trucking along in the sky and then it goes backwards. That's why they're called wanderers. The planet is Latin for wanderer. That is now explained by the fact that we were moving too. But at the time they had to invent some mathematical device to get this thing to go backwards and that's when it loops back. So Mars on my screen is about to do it again. So that creates this retrograde motion. This is a fantastic mathematical achievement, this model. It works incredibly well. It's incorporated to the current positions of the planets. And not only that, but it's an example of a Fourier series, which those of you who have some engineering exposure in particular or a particular kind of math background, the Fourier series is a general way to take any kind of cyclical function and represent it with an infinite series that you can truncate. So the epicycles here are giving you this periodic functioning. So not only is this accurate, but whatever the structure of the solar system, as long as the planets are on orbits, which are cyclical functions, you can always describe it with a geocentric model. Exactly. You can get arbitrary precision by adding more and more little circles on circles. Now that way lies madness, no doubt. And we'll return to that next week. But this thing works really well. It's a fascinating thing. So let me make my analogy now that I started this with. Linear regression is the geocentric model of statistics. It's an approximation that can be constructed to an arbitrary degree of precision, but it only describes what is going on. It never actually explains it in any kind of satisfying way. And that makes it incredibly useful as long as you're cautious about what you do with it. And that's the way I want to teach linear regression. Linear regression, there's nothing to be ashamed of using it. It's quite good, but it doesn't get at nature's nuts and bolts. It doesn't see the machinery. It's a way of constructing general approximations for associations among variables. That's what it's good for. And it's really good at that. But it isn't, you can't take it too seriously, right? Okay, so with linear regression, I want to start with Gauss, pictured here on the old German 10 mark note. It's a great thing about European money as they put intellectuals and scientists and stuff, right? So the Bridgetown has Darwin on one of them. And in Germany you have Gauss, who's probably the greatest mathematician who ever lived, many, many mathematicians think. And right on there you have the Gaussian distribution function. So this used to be, I was a school kid in Germany and people used to cheat, right? Because you had a 10 mark note when we thought right there. Linear regression is a family of simple statistical machines or golems that model the mean and variance of some measure using additive combinations of other things you've measured. We'll mail this down in particular examples as we go. And it assumes that across all values of these other things you've measured, the variance is constant. And you'll see that when we learn to write the model in a particular way. Now I want to say that Gauss is responsible for this because he had this, in this particular manuscript, he came out in 1809. He had a Bayesian argument for the normal as Gaussian, what we now call Gaussian, he didn't call it that. Normal error in least squares estimation. He invented least squares estimation to solve an astronomical problem. He was trying to forecast when a comment was going to come back around. In fact, he got famous in his 20s for this, you know, to predict this and develop linear regression because he had to solve this problem. He used to comment, he's a smart guy. But it's a fully Bayesian argument, it really is. And then, you know, this was 1809 and you know, Fisher is the early 1900s. So this is just to caution you, there are lots of different ways to justify the same statistical procedure. And the original justification of least squares estimation was Bayesian, and that's how we're going to think about it. But that is not the way you probably first learned it, right? Well, you probably first learned it to do this or you fail, right? But we're going to try to do better than that. So let me give you some motivation before we get into code. There'll be some software carpentry today. We'll start to do a lot of that in this course. But let me give you some motivation about why the normal distribution is so useful and so common. And this will also help you understand later in the course why there are often really good reasons not to use it as a foundation for your modeling. So this is the old Gaussian distribution. It's extremely common in statistics. I think there are three major justifications people use. And different people like different ones. The first is that it's just really convenient. It's easy to do math with a normal distribution. It is. Compared to others, right? The second is it's fairly common in nature, and I would say that a Gaussian distribution ever exactly exists in nature. But nature produces collections of measurements which aggregate towards approximately Gaussian all the time and quite rapidly. And I want to give you some intuition about why that's true in the next series of slides. And then third, a little bit more cryptically, but we'll unpack this over multiple weeks, is that it's the most logical assumption given a certain state of information that you start with. Which is to say, if all you're willing to say about a collection of measurements, is their mean and their variance, then it is illogical to describe those measurements to anything except the Gaussian distribution. I'll unpack that as we go along. Now, of course, so for example, if the measurements are skewed, then you need another parameter to describe the skew, right? So if you're not willing to actually measure the skew, then use a Gaussian distribution. And we'll unpack this. When we get to chapter six, we'll spend a bunch of time on that. Okay, so think about a soccer field. And we can go out to the soccer field here on campus and I can have a bunch of you line up in the middle of the field, line in the middle of the field, whatever that's called. So if you like soccer, tell me what that line is called. Midfield line. Yeah, all right, thank you. I grew up in Germany, but you know, that's all people play there is soccer. Now I'm resisting invading Poland jokes, but I guess I just made one. So, all right. Now, imagine each of you has got a coin and on the count of three, each person is going to toss their coin personally and if it comes up tails, they're going to take a step to the left and if it comes up right, heads are going to take a step to the right rather. So we can do that simulation, say everybody jitters a little bit. And we can do this for multiple rounds. Next toss, some people move further out, some people move back towards the center line. Yet again, some further distribution, some back to the center line, all these little binary movements. Now at the end, we can do this, say, a hundred times with a bunch of people on the midfield line and then we collect the distances from the line in the middle, both positive and negative. How far to the left you are and how far to the right you are. And we can think about that distribution of distances and what it aggregates to. So let me show you a simulation of this as we go along the step number. That is the number of coin flips that have been done and the number of accumulated steps that have happened. Initially everybody's at the same point, position zero there shown in this graph and after four coin flips, we've got a cloud shown here by all the little gray trails. Each trail is a person that's wandering around. And I'm simulating for, I don't know, like a thousand here because it's a really big soccer field, something like that or a bunch of really small people. I don't know. And the solid one is just to help you trace a particular person. That's you, say. You're the protagonist in this story. Everybody else has got the wrong path and you're going through it. But there's a scatter. And you notice that the envelope is increasing. We can take all the values there at that vertical slice of four and plot them out as a distribution. And it's noisy. It doesn't look like a normal distribution quite yet. The tails aren't thick enough. But it is symmetrical, roughly. We keep the experiment going. We're going to get out to eight now. And you see that it's still increasing. Now it's looking more Gaussian. The tails are starting to have that little flair that the bell curve has. And by the time we get out to 16, at the end of this particular experiment on this slide, it's statistically indistinguishable from a Gaussian unless you're really uptight statistical test of some kind. How does this happen? You get the Gaussian distribution from natural mechanisms like this all the time. This is how it works. If you add up a bunch of things, in this case, the steps of each individual, that collective of sums aggregates towards a Gaussian distribution. And that aggregation may take a long time, depending upon the distribution of things you're adding. In this case, it's just little steps left and right. The distribution of that could be really weird and skewed. You'll still end up with a Gaussian distribution eventually. And the reason, the casual reason, this is hard to understand, but the casual reason is because these little steps are like fluctuations. If you add fluctuations together, they dampen one another. So, imagine you get a bunch of steps to the right. Eventually, you'll get enough coin flips to go to the left to cancel all those steps to the right. So that after enough steps that you add together, the most likely thing is that you're back on the center line. And it's really a perverse and weird thing about the universe. And I also think benign, because a whole lot of science would be impossible without the central limit theorem, that's what we're talking about here. So when you add things together, the fluctuations dampen one another. And so the aggregation of those sums tends to approach a symmetrical curve. It's a physical generative phenomenon that makes a lot of stuff possible in the world. At the same time, the consequence of this is that the underlying distribution is erased. So many things end up Gaussian, that you can't look at a Gaussian distribution and see what generated it, unless you have a lot of other kind of data. So I use an example in the book of talking about height. Human height is approximately normally distributed, only approximately. There's an excess of really short, really tall people in the human population because of epistasis. I'm quite oldest here of something about this. And nevertheless, it's pretty much approximately Gaussian. That doesn't tell you anything though about the architecture of human development. The fact that height is approximately Gaussian, you can imagine an effectively infinite number of ways to generate a person from that fact. So it doesn't work backwards. Lots of processes generate Gaussian, but given a Gaussian, you can't then infer what generated it. All you know is stuff got added together. The notes have some simulations for you to explore to prove this to yourself that I hope you will enjoy with a glass of wine or something. You may get more entertaining. Here's a great example, Francis Galton. In 1894, built a mechanical, it was called a beam machine because beams are falling down from the top here and they're bouncing off these little obstacles in a way. This is like on the prices right. There was a particular game. Anybody remembers the prices right from Bob Barker? Something like the Chinko machine or something like that. And that was a Gaussian distribution generator as well. All these little binary moves are like the steps on the soccer field and you get this approximately Gaussian distribution of beams in the bins at the bottom as well. So Galton was interested in using this to explore the normal distribution. And Galton did a lot to establish linear regression as a workhorse in demography and other fields. So think about processes that produce normal distributions or things that add things together. And natural processes add things together quite a lot actually. So this is like genetics does this. You have approximately independent additive effects of a bunch of Glossach summation, aggregation on a larger scale. Turns out that products of small deviations are also approximately addition. So that will give you normal distributions as well. And logarithms of products, when you take a logarithm of a product, it's a sum. For mathematical reasons that some of you remember from grade school, right? So lots, this is why normal distributions are unreasonably common and often are reasonably good to use. We're not going to work much with the mathematical form of these density functions in this class. You can always look them up if you need them, right? There's this thing called the Internet. And if you put in Gaussian distribution function, you'll get this right away. You don't need to memorize it, although you will effectively if you use it a few times. It is like all probability density functions. It has a structure that is meaningful that you want to learn. And it will help you memorize it as well. So in this case, I've translated this into vaguely English form, although the grammar here is questionable, right? So we're going to use this mainly as a likelihood function, so the probability of some data x conditional on parameters. The first part of this with the Pi in it is the standardizer. That's just the thing that makes the area under the curve equal one. You solve for it. But it doesn't determine the shape. It determines the height of the thing. It's a multiplier. It determines the height of the curve. All the actions inside the exponent here. And the mu minus x part is where it is. Mu is a location parameter. It's the mean. The exponent is a shape, which is a 2. It's a constant, but you can make that a 3 and you get a different shape. The bell curve comes from squaring the distance from the mean, right? The deviation from the mean. And that's what creates the bell curve. So you may recognize this as a parabola, right? You've got that difference squared. That's a parabolic function from polynomial theory. If you don't remember that, that's cool. You're awesome. Don't worry about it. You have better things in your brain. But if you exponentiate a parabola, you get a bell curve. And that's where bell curves come from. They're exponentiated parabolas. So the only way you can think about this is a log Gaussian distribution is a parabola. Why log? Because log n does the exponentiation. And that's all it does. Why do I tell you this? Nevertheless, it arises from first principles. If you do that fluctuation exercise and you do the mass stats, you can derive this. And there are lots of proofs of the central limit theorem from many different origin points. Okay. The main thing is to think you can standardize this in terms of the sigma, the standard deviation, which is how wide the distribution is. And about 95% of the probability mass in a normal distribution is between two standard deviations of below and above the mean. So this is, by no accident, approximately the same as leaving 5% out in the tails, 2.5% on each side. So this is the one origin story of the 5% convention. But there's nothing special about it. But this is a way to calibrate. About 2 thirds of the probability is within one standard deviation up and down, and 95% of it within two. It's just a way to help you think about the distribution. So let me reiterate these two more confusing justifications of the Gaussian distribution. The first one was it's easy to do math with it. That's not a great justification, though. So if you're trying to justify why you did what you did to the president and say, well, it was easy to do math that way, Mr. President. And then you're in a black ops site pretty fast. So there are two other better justifications. First is the ontological one. Under many situations, there are lots of generative processes in nature which produce approximately Gaussian distributions because they add lots of little influences together. Little influences cancel one another. And so the aggregation of those sums ends up being approximately bell curved. So even if you don't know the generative process, this isn't a bad bet to start with. That's ontological because it's how things come into being. Remember that's ontology. The other justification is epistemology. The epistemological justification is, OK, all I'm willing to say about this collection of measurements is their mean and variance. The distribution where all you say about it is its mean and variance is the Gaussian. In fact, it's uniquely the best on a criterion that we'll talk about a lot in Chapter 6. That criterion is called maximum entropy. The Gaussian distribution is a distribution for any given mean and variance, which has the largest information entropy. One way to think about that is this distribution that can be realized the greatest number of ways. What ways? The ways through the garden of 14 data. I'll be very rigorous about that, well, marginally rigorous about that in Chapter 6 when we get there. We talk about maximum entropy again. But just to foreground you have maximum entropy is just the thing you've already done with the models. It really is going to cast it in a different light when we get there. These two sorts of justifications can live well together as well. You don't have to choose one. OK. The fact is, regardless of how you justify it, people use models like this a lot, so we need to understand them, and they are useful. They're geocentric, right? You shouldn't be embarrassed to use a geocentric model. You could find Mars with it, and you could find it in geocentric statements. So there are lots of examples. The general linear model and statistics, they're all, in a sense, just Gaussian models with some outcomes with different kinds of functions inside of them which adjust the mean and the variance as an effect of predictors. So t-tests, single regression, multiple regression, ANOVA, all the OVAs, right? ANOVA, ANCOVA, MANOVA, MANCOVA biosaf people are in pain from these things. You had to take a course on this campus and all the OVAs, right? And people came out of it with post-traumatic stress syndrome, endless tables of sums, right? Some of you have done this, and we're not going to do any of that, because it's useless, but we're going to learn it in a different way. So you can write the model down so you can see the relationships among them and interpret them and generate predictions from them as well. I should say it's useless. It's just like the least useful way to process the model. It's traditionally done because of this book called Biometry, which scarred generations of biologists. Anyway, you don't need to copy that book. Okay. So we want to language for modeling. There are lots of alternative ways to construct this, but I want to give you what I think is a useful robust way to approach general problems. We have some questions to answer in data analysis. So what are the outcomes that I'm interested in modeling as a function of other things? That's often what it is and most scientists are interested in causal statements, meaning that there's something that causes something else and these questions embody those things. But it could also just be descriptive, like in the geocentric model. We've got some measurements we want to use that to predict other things. So there are some measurements that we call outcomes. Those are the things that the model will have on the left-hand side, as you'll see the things that it will predict, make predictions for. We make some assumptions about how these outcomes are generated. This is your data story. This gives us a likelihood function. Then we make some decision about which variables, if any, are the predictor variables and these variables we will stick into the likelihood function through clever devices. And then we make choices inside there about how to relate them and then for each parameter that you'll see how the parameters enter into these general models. For each parameter we have to choose a prior, information state of the machine before it's seen the data. It's not your information state. The machine's information state. You don't have a prior distribution in the normal human being. So let's bring this down to some precise example then by revisiting the globe tossing model from last week. You've seen this model-based notation before. We're going to use this throughout the course because if I teach you this modeling language you can read a whole bunch of things. This is the convention in the field of statistics and it can be used to notate a very large number of models. It doesn't have everything in it because there are computational details that are sublimated out, but that's what makes this notation useful. It's like a map to the model structure and its assumptions. So that's how we'll read it. So let me give you a little bit of a cribsheet to it. The outcome in this model was NW which is the observed numbers of waters that you saw. That happened to be six at the time. Six. And the tilde means you want to read that as is distributed. Then binomial is the name of the likelihood function we used in this model and it's a function of two values which are often called parameters N and P. N was also data in that case because we do have many times we toss the globe which was good, otherwise the inference problem would have been hard. And P was unknown, so we want to estimate P. We're asking a question about it so we assigned the machine some initial state of relative plausibilities of every possible value of P and we call that a prior. In this case we made it uniform. So P is distributed uniformly in this case and uniform is the prior distribution. So if you want to read this in plain English you could read this as the count NW is distributed binomially with sample size N and probability P. The prior for P is assumed to be uniform between zero and one. This program is your machine. Does this make some sense? So this notational convention you certainly seen models like this before. It has a lot of advantages. We're going to do it with linear regression. So to get to the linear regression example first let's get some data to work with so you can think about it in the context of the data example instead of total abstraction and we'll get more abstract as we draw the lens out later probably on Thursday. So this is a tanking package. Howl 1. These are data that come from this book Life Histories of the Dobai Kung by Nancy Howl and there's a lot of reproductive life history interviews that were done with kung women so all their kids and their kids' weights and all this stuff. There's lots of biometric data as well. It's about growth in nutrition and human life history theory. These are randomized heights, weights we'll use age as well in some of these analyses. So I use this as not the most exciting data in the world and it's not a very penetrating scientific question to ask here although you'll start to see some questions about transitions and growth rate in the data as we go through but it's the simple sort of thing where we can look at it and height is approximately normal but it'll also show you one thing which is obviously when we get close distributed in a mathematical sense because normal distributions are legitimate between negative infinity and positive infinity. You never measured anything with a ruler that big. So you have to weigh your boundaries at some point. This will emerge naturally as you go through. So on this slide on the right I've just plotted out the distribution empirically of adult heights in the sample so I've taken kids out of the data and in the notes I show you how to do that. And then the first thing we do is we define the likelihood we're going to model the height of each individual eye. So the little eye is under the H here now first to an individual. It's an individual observation as normal with some mean mu and some standard deviation sigma. So it defines the distribution. Yeah, question. Yes, yes. The question was I refer to notes and is that synonymous to book? Yes, they're the same thing. Sorry. I'll publish a colored book. Right now it's notes. I mean whatever. Yeah, sorry about that. I've been teaching this class for so many years now they started off with these really embryonic and terrible notes. And then through the peer pressure of people like you in the class they have gotten better. So it's all thanks to students just telling me like could you make some sense Richard? And over the years I have been better to make some sense. There's still notes to me but they're a book to you. Anyway, so again to help you to practice of reading these things we could read this as the height H sub i of an individual eye is distributed normally with mean mu and standard deviation sigma. It defines a distributional assumption. Remember this is geocentric. We're not saying anything about how the individual differences in height arise we're assigning a common distribution to them all. We're just going to start with that. So we're going to be estimating and the standard deviation of adult heights in this community is effectively what we'll get. So H sub i is the outcome again tilde means is distributed the normal distribution is the likelihood mu is the mean of the normal distribution, sigma is the standard deviation. You can use whatever labels you want there. You don't have to be the little Greek letters mu and sigma but that's just conventional and I need to teach you guys convention in the world. Is that a question? No? Okay. Sorry. It's a teacher tick. It's like when you auto-grew I'm going to think you're asking a question. So now we need priors so the machine can get going. All Bayesian statistical models need some initial information state. This is what we call the prior. So in this case we're going to stick with our vague priors but I want to show you how you can visualize the priors. When you're starting out you're trying to understand priors. Remember there is distribution so you can plot them just like anything else distribution. In fact you can sample from them you like sampling and then you can answer questions about it. So in this case I'm going to assign the mean is going to be centered on what I know is the population mean actually with some big standard deviation actually. Remember this is for the mean. It's not for the population. This is the prior for the location of the mean. So standard deviation tin means I don't know it's somewhere around typical adult height and then a uniform for the standard age between 0 and 50. If we visualize these two they look like this. So if you think about calibrating it the mean could be anywhere between 140 centimeters and 170 which is a big range. That's like worldwide height variation. And then this puts no probability above 50. If that causes a malfunction you know how to detect that and we'll see. But it decides equal weight to every value below 50. It's not going to be that big. Now remember these you have a prior on two parameters now so this should be a little bit confusing if you're paying attention if you are. I see some confused brows and I thank you for your attention. It should be a little confusing because they both affect the likelihood. They both affect predictions but we've independently assigned them initial information states. So how do we see what they imply? Prior to seeing any data what does this little machine think about the distribution of adult heights? Well you can sample and see. So let me show you that real quick. We can take samples from the prior distribution for the mean mu. That's what happens on the first line of code here. We use rNorm which is random normal numbers. We get 10,000 of them that's what 1.4 is. With mean 156 and standard deviation 10 because that was the prior we chose. We could use other priors and I encourage you to play around with this and see what you get. We can take samples from Sigma using runif which is random uniform. 10,000 of them between 0 and 50. Then we can generate random heights from the prior. So remember this is what the machine thinks before you've seen any data and it's really what it thinks is really dumb but we want to see what it thinks. And what it thinks is what it expects about the distribution of 10,000 simulated human heights. We get by just plugging the samples into rNorm again but now we're simulating heights. It's the top level of the model. It's like we crawled up from the bottom of simulating data out of it. And then I just plot the density of the simulations and you can see it's this odd looking distribution. It's approximately t distributed. You know something about sampling theory. t distributions are normal distributions where you're uncertain about the standard deviation so you get these thick tails. This isn't exactly a t because there's uncertainty about both the median and the standard deviation but this has thicker tails than normal distribution because the machine doesn't know. And so it's represents its uncertainty by saying you know the tails could be pretty thick. I wouldn't be, it's not impossible that you have a really really short person or a really really tall person. It's not an empirical prediction. It's an epistemological thing. It's what the machine sort of expects calibrated for its uncertainty. Does it make a little bit of sense? I can tell from the furrowed browser paying attention again I thank you for that. You have to be patient with yourself with this stuff. This is what sort of what the machine sees and by doing little simulations like this you can decide whether the prior makes any sense given the scientific context you're in. When you have domain knowledge you can use good priors for initial machine states. If you can't in this class I will teach you the horoscope priors which are the generally super vague mainly not going to make you play in traffic priors. That'll help you do regression modeling but you can nearly always do better when you know something about the system you work with. So now we want to condition on data because we have a lot of data. Nancy Howell did a lot of work out there in the Kalahari and did a lot of interviews and uploaded it all on the internet for us on use. So we're going to use her data and do some updating. Again the aim is get the posterior distribution. Now it has two dimensions because there's two parameters so it's a joint distribution in both directions. What does this mean conceptually? What this means is so last week the posterior distribution for the globe tossing data was for every possible value of p the proportion of water on the globe we needed to assign a relative plausibility conditional on the model of data. Yeah, remember that? Now we've got two dimensions so now for every combination of mu and sigma we must assign a relative plausibility. And there are a lot of combinations. If you have an infinite number of mu's and an infinite number of sigmas then you've got an infinity squared number of combinations. That's no problem in that. We can do that. We've got continuous dimensions and we're going to do grids across both to get this motivated. This will be the last grid approximation example just to show you there's no hocus pocus and then we'll move to approximations in particular map, quadratic approximation using map estimation. But we're going to do the grid approximation so there's no sorcery about this, no superstition about what's going on. You can always fall back on grid approximation if you've got time to wait for your computer to finish. You'll see. In the book I give you the grid approximation code. I'm not going to step through it in class because it's really just not time but it's in the book and I encourage you to play with it. At least run it even if you don't understand it. If you want to understand it please harass me. It won't be harassment. I would be happy to explain it. I geek out on things like this. It's fun but there's just things about computing there that you know conceptual insight so I don't want to focus on it. But if you're curious it's all there and I do a lot of explanation in the notes book about it. What we see is we get these samples so I do the grid approximation and I've drawn samples from it to help you visualize it and it's a cloud now looking top down. It's like a hill. One dimension on the horizontal here is the mean and then sigma you notice that it's pretty tight. It's a range. It's basically between 154 and 156 is the range on the bottom. Yeah. It's a combination. Yeah. So the question was that's right. The question was so each little cell in this grid is some combination of you and sigma. Yes and then on this graph where there's lots of blue that means the posterior probability is high and where it's white no samples were drawn in the 10,000 samples that I got so it has a very big peak of this mountain sort of in the middle there and this thing is Gaussian in both directions so it's a nice gentle hill that you could climb without special shoes and gear. It's not El Capitan. It's a nice mountain. If we look at this mountain from either side and see its profile you get what are called marginal distributions for each parameter and that's what I'm showing on the right hand side of the slide and you're going to look at mainly at marginal distributions when you do Bayesian statistics. Marginal means the average is over the uncertainty and all the other parameters and it really is just like standing for mu it would be like standing down here where the word mu is on the left hand part of this slide and looking at the hill this way and then you'd see that shape on the top because you can't see sigma so you just see the outline of the hill from the mu direction and then if you walk around to the other side and look at it this way you end up marginalizing over mu and you can't see mu because its dimension is blocked out and now you see the profile of uncertainty for sigma. Does that make some sense? Through exercises when this really starts it makes sense. It's like motor memory I always make this joke in this class so I'll do it again. I've watched a lot of Jackie Chan movies I think I've seen them all although it's hard to be sure and yet I cannot do kung fu You can watch Jackie Chan all day long and get no better at kung fu in fact and stats is a lot like that in the sense that you can watch it all day long and get no better at it. You've got to go in there and get your motor memory going and it's normally brain memory but there's something about embodying the knowledge and getting comfortable with it as you go. You have to actually do it. Lots of intellectual tasks I think are like that there's an athleticism metaphor which is quite accurate. Okay so we're going to appeal to quadratic approximation instead of doing grid approximation. Grid approximation works great nothing wrong with it it can give you an arbitrarily good approximation of the posterior distribution but as you'll see if you look at the code in the book there are a lot of combinations of mu and sigma and the more finer you make the grid the more and more combinations you have to look at now imagine adding a third parameter so now so you have two parameters and you want to look at 10 values of each then it's 10 squared. If you had a third parameter of each of these 10 values of each it's 10 cubed so pretty soon the number of combinations of parameter values you have to make a calculation for is really big and you need to publish your dissertation before your computer finishes so we have to do something else and we will in this class use models that have thousands of parameters because that's no problem for Markov Chain Monte Carlo and other things that we're going to use so we need something other than brute force grid approximation even though at its root that's the intellectually honest way to do things right so first half of the course as I said we're going to use the quadratic approximation which is say we're going to describe the posterior distribution as a combination of peaks of each marginal distribution called the maximum off posteriori where's the peak of this multi dimensional hill and right now it's only two dimensions so it actually has a peak in three dimensions it's like a hyper peak but it's out there and then we don't ask what it is but it's there right out there in Hilbert space is what we call it and and the standard deviation which gives us the widths of each and since it's assumed to be Gaussian in every dimension you can describe the multivariate Gaussian probability distribution with just a vector of means and a vector of standard deviations and co-variance which we'll also need to get we'll get to those as we go through the lecture so mechanistically this is just hill climbing and you could just write it that way you could write a little robot if you were fancy scripters and I know some of you are you could write a little robot in your computer that just starts at some combination of mu and sigma it computes the posterior at that point and then it computes the posterior at both of the little points next to it and then it climbs uphill and it just does that over and over again it's the near sighted mountaineer it's like evolution it climbs uphill, doesn't know where it's going it's going to extinction but it's getting fitter the whole time and it climbs all the way up to the top and then it gets to the top and it's like okay it's flat here, I can't improve the posterior probability of going in any direction so I'm at the top let me measure the curvature under my feet and that gives it the standard deviation and then it's done well it's also got to calculate co-variance between the two which is something we'll get to in a moment so your R is really good at this it has an engine called Optum which has a bunch of different algorithms for doing hill climbing doing optimization and all the rethinking package does is appeal to Optum to do this but it packages it up to make it a little easier on you so what you do to use this function map which is in the rethinking package is you make a list which is your model statements I'll show you an example here, I call it FLIST for formula list and this is a kind of variable in R called ALIST ALIST does not process so you can put all kinds of nonsense inside of it and R will never detect it, that's what lets this work by the way so you do have to police yourself although map will tell you when you screw up so we're just restating the model and I'll show you the correspondence on the next slide and then you pass this formula list to the function map and map finds maps and returns the quadratic approximate posterior distribution for this thing so even when you had to work with this and we're going to use map for the first half of the course before we switch to MCMC and even when we switch to MCMC we're going to use the same kind of input formulas the models will be expressed the same way but the engine underneath will look different so you'll get used to this and then you won't have to learn a new kind of input language as we go and you also have to tell what data to use you see there data equals D2 so let me show you the correspondence in our code there is a variable in the data table in how one the data you get out of the package it's just a list of heights and D norm is the density function for a normal distribution in R the tilde means tilde means distributed as and then you make these labels mu and sigma and you can put anything there you want the joke I think I used in the book or in some former versions of the book is Pickle and Tardis or whatever you like depending upon your popular culture references you should care but you should care because you have to read the same and your colleagues are there and mu and sigma are perfectly good because they're queues that you're using you're doing a linear regression then you define your priors the same way priors look like likelihoods in this statement because they're just assumptions about probability distributions they map parameters onto distributions or other parameters and same thing in there so you can see the correspondence after you run it in a couple slides you'll see the output from app is stored in some symbol here m4.1 this is the convention I'll use in the book we're in chapter 4 this is the first model m means model that's the convention I'm going to be using as we go through the book in a previous version of the note they were all called little m I was the only person who could make sense of it and then people aggressed against me and I'm getting better I'm trying so there's this summary function in the rethinking package called crazy which is French for abstract or sort of like that you have summary basically it means precise it's a precise description of what's going on the precy it was a word that was not already used by any other package besides Laplace was French and he was sort of the father of Bayesian inference so this is homage to Laplace and you just give it your fit model and it gives you the summary of the quadratic approximation it's the typical kind of statistical summary you get what I want to convince you of in this course however is that these summary tables are terrible terrible terrible with this model the model is so I mean not mine in particular mine might be especially bad but I think it's really hard to understand models from tables of estimates of summaries of estimates really hard for a model like this you can get away with it because this model is incredibly simple it's just about the simplest serious statistical model we're going to do in the course next week the tables will be nearly useless and I'm going to try to give you some examples about that I see this in print all the time papers where all you've got is a table of coefficients and that is insufficient to reconstruct predictions from the model that's what I'm going to try to convince you of it is hard to understand a model from tables of coefficients very hard hard to understand interaction effects lots of other important things that go on so I'm going to try to persuade you you can look at the precy stuff there's no harm in doing it but that's never sufficient to understand what the implications are in this case we get a posterior distribution the mean the map, the peak of the marginal posterior distribution for mu which is shown on the bottom left on this slide is at 154.6 don't fetishize precision it's somewhere around there and with standard deviation of 0.41 and I show you that distribution on the bottom the blue the grid approximate, the samples from the grid approximation and the dash part is the quadratic approximation just constructed by plotting a normal distribution with mean 154.6 and standard deviation 0.4 you can see it did a pretty good job the quadratic approximation works really well in this model and then for sigma the same thing as a mean at 7.7 and standard deviation of around 0.3 in this case you can see there's some mismatch between the grid approximation calculation which is better because it didn't make assumptions about the normality of the posterior distribution and the quadratic approximation doesn't quite hit it the posterior distribution for sigma is skewed it has a longer tail to the right this will nearly always be true so later on the course that'll be fine so to show you with a lot of data the skew will be very small to emphasize this to you here's an example where I just take 20 heights only 20 heights and just update the priors using only those 20 so now there's way less certainty now there's a lot more skew in sigma as you can see there on the right I'm showing you the whole posterior on the top you can see it's like a snowball that was thrown from down here and it's kind of exploded up so there's more uncertainty towards large values and that's because well in a casual way sigma can't be less than 0 standard deviations must be positive so there's always more uncertainty about how big the standard deviation is and how small it is and that's how they get this skew in this when we get to Markham chain Monte Carlo we won't have to use this compromise and there's a little box in chapter 4 where I show you how to patch this up with something called a log link if you're curious but for the examples we use in this course we will pay scant attention to sigma as most people do and so you won't feel any violence being done but you should just keep in mind this is an approximation and if you have trouble with that let me know and I'll help you fix it up so let me say a map is a scaffold it's just about the least convenient way to fit a linear regression that I could think of actually no I could think of less convenient ways actually say that out loud but it's not really convenient but the reason I use it in this course is not because I'm mean but rather because when you're learning this stuff you want to do it in a way that forces you to state every assumption of the model as you go through then there is a little tool in our called LM for linear model that will fit linear regressions with one line with way less input than this and give you almost exactly the same quadratic approximation for the posterior and it's fine to use LM and in fact at the end of chapter 4 actually I think it's the end of chapter 5 actually I re-explain how to fit linear regressions and are using LM and explain the correspondence between the two and that's fine but if you start with LM you probably never really learn what's going on so that's why I do it this way the other reason to focus on a tool like MAP is later on, especially when you get to chapter 6 but you'll start to see this next week with chapter 5 I keep saying flat priors are never the best priors and there's a very important reason for that is because they get flat priors get too excited so if your machine starts with I have no clue it could be any parameter value including really really silly ones and so I'm going to show you in chapter 6 that you nearly always do better by having conservative priors called regularizing priors and this is not a uniquely Bayesian perspective it's also the dominant tradition in non-Bayesian statistics called regularization and it's dominant because you make better predictions when you regularize we'll postpone on what that means but LM you cannot regularize it has no way to conventionally regularize but with this you can, yeah question? oh well about priors are there some situations in which using a prior versus not using one makes a really big difference so the question was are there situations in which using a prior versus not makes a really big difference, oh yes lots I mean priors could be anything so you could choose a really goofy prior and get a really weird answer and that would obviously make a difference with a Bayesian model you have to use a prior your choices are just is it flat or not and absolutely you can make a difference as we go through examples I hope to convince you that flat priors aren't the best because you always know something about what's going on about reasonable values of the parameter before you get in especially with Markov Chain Monte Carlo just to get the thing to run you've got to use that information a little bit you've got to tell it that 2 million is not a plausible value or it will take samples out of 2 million occasionally and cause you grief so you need to do things like that I'm not directly answering your question because again I like the horoscope problem I'm in the vague general case and I've just got to tell you that your moon is in Mercury that makes no sense in astrology your house is in something Mars is in your house sounds like a rap album anyway you know what I mean so I beg your indulgence we'll have examples of priors and some of the differences and I'll even have examples there's no reason for you to use only one prior you should use sensitivity analysis it's an assumption just like the likelihood just like the linear models we're going to get to just before I let you go and those are all just assumptions so if you don't feel strongly you can justify any particular assumption you should vary it and see if it makes a big difference in the conclusions you do that with priors, you do that with likelihoods you do that with linear models don't vary the data and see if it makes a difference well actually that's not that could be a good idea too but in the context of the data that's a pretty strong argument then probably the data is not doing anything in the model though anyway so I just wanted to say map is a tool and you'll graduate away from it almost certainly at some point but I've made it as a teaching tool it was coded up entirely in the context of this course and for students and I think past years either students develop Stockholm syndrome during this course or they find the tools useful so how do we get a predictor in here regression usually implies that there's something associated with something else and we don't have that yet we just have a Gaussian model of the distribution of adult heights and the Kalamari Saan from the 1960s all we have so what about the relationship between weight and height not a thrilling scientific question but it'll help you see how to get a simple model that looks at measures the association between these two variables and the plausibilities for that association gives you a distribution of relative plausibilities for all the strengths of association that you allow so we'll construct that so here's just the bivariate scatter plot for weight and height we're going to make a model of these two things so there's going to be some new stuff here so bear with me and I'll go through it step by step this is the classic linear model at the top level we still just have a Gaussian likelihood the only thing that's different now is a little subscript i and I put it on mu you see that that means the mean depends upon the individual now it's going to be a function of some feature of the individual whereas before every individual in a sense had the same mean that is every individual was treated from a epistemological perspective the same way there was no individual features of them that we could use to improve prediction now we're going to do that and so we put the little i on mu so we can make me a function of something about individual i you with me? that makes sense the next line defines that assumption we make mu sub i a deterministic function of another variable that we have we create two more parameters alpha and beta and the data in this case is x sub i which will be the weight of individual i what are alpha and beta? well they're things that describe the shape of this function and that's all they are they're inventions that you put into the machine to ask questions and the questions they typically ask are alpha answers the question when weight is zero what should i guess about height? you can see how the function implies that if you set x to zero then mu sub i equals alpha and so alpha answers the question when weight is zero what is height? now it's a bit weird because weight is number zero and weight is zero you're not a cell you're like a fertilizer you're a zygote that's an important thing to say about this model is linear models are always goofy if you push them far enough so weight is not going to go to zero with the data so you don't have to worry about it but in principle it could and b is the rate of change for every unit change in weight in x it's the change in mu it's the change in the mean for every unit change in x and with those two questions we've defined a line that relates x to the mean of height not the individual height because the heights still have this uncertainty distribution it's a function of both the mean and the standard deviation but our deterministic mean here we have a model of the mean and linear regressions are models of the mean they don't model the standard deviation they just estimate it it's whatever's left over it's often called error it's not a term I like each of you is a precious snowflake and your height is different from others but it doesn't make it an error sorry I'm an anthropologist there are no errors so then we define priors now we have three parameters we define a prior for each there's a really effectively flat prior on alpha a standard deviation of a hundred is effectively flat that's a super wide Gaussian distribution right incredibly wide a practically uninformative prior on beta centered on zero zero would mean there's no relationship between weight and height right cancel out in the linear model the variable X and standard deviation 10 makes it pretty wide it gives it a variance of a hundred so it's still pretty wide and then the uniform distribution we used before and I encourage you to explore these priors you want to alter them and re-run the model and see what impact they have in this case there's so much data as you'll see that you can use really tight priors and they get overwhelmed there's so much data here but you have to experiment with that yourself to understand it so think about what's going on and go through the anatomy of the linear model again where the action is mu sub i is the mean on row i and it's defined by this function down at the bottom alpha plus beta X i and alpha and beta are distributions or their parameters they all have some true value now these are devices alpha and beta so they don't exist independently in the world how can they have true values they're true conditional and using this model to measure height there are particular values of alpha and beta which would give you the best predictions or describe the sample in the best way depending upon your purpose that's how they're defined as being true or not but they don't exist objectively in the world they're parameters even a parameter like speed of light doesn't actually exist in the world light has speed but it doesn't mean speed is a property of the world now you're like dude yes and dude but that's my philosophy of natural history so X sub i is the weight on row i you understand how that works alpha answers the question what is the mean when X equals zero we often call this the intercept because in the equation for a line that you learned in secondary school or middle school actually forget when people learn lines middle school and that's how you learn this what the intercept meant when X is zero it's the value of Y and beta is the change in the mean for unit change next we usually call this a slope because it'll be help tilted the line I'm going to craft this in a moment so chances are this is like deja vu guys you've learned linear models before and this is some weird way of looking at the same stuff that was easy and now it seems hard right and if that's the case then you're welcome I've achieved my objective in this because I think actually these models can be pretty hard they're simple statistical models but even the simplest statistical models can be very confusing that's a message I want to get across as we go alright I've only got a couple of minutes so let me try to get through this show you how to fit this model and using map and then when you come back on Thursday we'll plot predictions from so you can really understand the model and learn how to dissect them so here's a restatement of the model and by each of the mass stats definitions of an assumption in the model I put the corresponding code that you put in a formula list inside the R code I think the correspondence is fairly simple the only thing to really note is that the equal sign that defines linear model in the math on the left you don't use an equal sign in R because equals is this very special thing in computer code that actually assigns things to memory there's no getting away from that so instead we use the R convention of the assignment operator the arrow the left arrow and that will work great what's great about this too is that's a convention in lots of software packages so Bayesian well fitting software like bugs and Stan use exactly the same convention from R that's where they got it so you learn this convention of using the silly little left pointing arrow and it'll last you your whole life maybe and then to find the priors same way you put this into the R code for math the same way I just want to show you here you don't have to make the formula list its own little thing you can just embed it and you recall the map like this and in fact most of the time you probably will because if you're like me you're lazy you can do it but if you made it a separate thing you can reuse it and fit it to different data but it works the same way notice that there are commas at the end of each line because it's a list and lists in R have commas that separate the entries pass it to data you're ready to go something to say at this point that I meant to say and this will be my last 30 seconds of today's lecture map is a hill climber it optimizes it has this multi-dimensional topography it's got a climb which is the posterior distribution it's got to find the peak and then measure the curvature of the peak so the question is where does it start climbing it starts climbing at random locations it picks from the priors is the way it's set up so you can find a prior for a parameter it just samples some random value from each of these priors and that's where the mountaineer near side mountaineer starts climbing often you can do better than that so if you're having trouble getting it to climb well you can pass an optional list of starting values and there's an example in the book about how to do that sometimes that's true if you have a really really flat prior the random value you pull could be way out of no man's land where there's basically no incline and then it can't climb up because it looks flat in both directions and it just panics and then R gives you some ugly warning message that you can't interpret and you send me an email about bmn finite difference error and I'll be like aha yes I'm familiar with this cozy error I advise a start value or a tighter prior and then it'll work great so when that inevitably happens to you let me know and we'll fix it up and this is just part of doing computational statistics is that how you fit the model is part of the model because fitting the model in different ways entails different kinds of mistakes and different approximations and this isn't the traditional way we think about it we think about the model as being this thing that lives in the platonic world of perfection of mathematical perfection it's like the definition at the top of this slide but when you use it to do stuff you can fit it some way and the way you fit it entails different compromises and different kinds of hazards so how you fit the model can't have an effect and we work hard to remove any of those kinds of errors from it but you need to keep it in mind with that hopefully uplifting message I'll let you guys go and I'll see you on Thursday thank you