 Okay, welcome everybody. What I want to do today is give you a conceptual introduction to Bayesian inference and data analysis by conceptual, I mean we're not going to do calculations. My goal is to help you understand the philosophy and the outline of the procedures so you understand the purpose of doing this. And I think the, if you take one message home from all of this, it is that Bayesian inference is a very humble approach to inference and it's nothing more than counting assumptions. And I want to show you what that means and how that's connected to scientific models and that's really all I hope to achieve today. But that will take us two hours, I think, and it's still a massive compression because the truth is doing science is very difficult, right. But I hope to have you leave today with something interesting. So this is probably most people's impression of what statistics means, right, a flow chart like this. So you've got some data, where did the data come from? Who knows? It fell from the sky. Your supervisor gave it to you. You found it on the internet. Anyway, the data arrives. Statistics, as it's normally taught, begins after the data are produced, right. And then you just think about statistics for the first time and there's some flow chart and you ask a series of confusing questions and answer them and then you end up with some test that you will find in SPSS's menus and then you will execute this test and it will tell you what's true, right. Now all of this is wrong, of course, and that's not controversial to say, but this is the kind of, let's say, in statistics, statisticians hate this. But this is something that the sciences have generated because it satisfies the demands that scientists have to produce papers. So what I want to do is back out of this and think about what is a philosophy that would help us decide what we should do instead. And that's a complicated problem and a lot of this is embedded in history. So why do we use these toolboxes of tests? And the reason is because there are lots of legitimate scientific contexts in which they're useful. And the workhorse of statistics, these tests that are the workhorse of statistics largely come out of this Bicharian and Neyman Pearson tradition, which is very much focused on agricultural trials. Large sets of replicates trying to figure out which fertilizer and which wheat variety, which kind of barley you should grow. A lot of this was focused on beer production as well. And so a lot of statistics were worked out, things like the T-test, the T-test was invented to test Guinness beer. That's not a lie. So, and in those circumstances, what do you have? You have large numbers of replicate units, you have experimental control, and things like T-tests are really legitimate and powerful tools. Analysis of variants invented here at Rothamsted Agricultural Research Station, one of the longest-running, continuously running scientific research stations in the world in Rothamsted, England. And Ari Fisher, who's a famous biologist and statistician, was president there. And had probably the, for a single person, the greatest impact on statistical practice. And that came from ANOVA and the T-test and a lot of those things. But in many other scientific contexts, those tools don't make any sense at all. So, here's an example. So, actually, I should ask, does someone know what this is? Can you guess? You know what it is. It moves in our planet, that moves in a certain way, and then when we move, we see that happening. Yes, so this is Mars. And, see, right? This is Mars. And from an observer on Earth, if on multiple nights, you plot where Mars is. So, this is a composite photo of multiple nights, the position of Mars in the night sky. So, from our perspective, it does a loop, right? And planets is just Latin word for wanderer. And so, this is why they're called planets, because they wander. I'm like the stars, which stay fixed. They're not really fixed, right? They just move really slow. The planets, planet. They wander. And so, they're explaining this path of motion is a very, well, it's one of the famous achievements of scientific practice, trying to understand what really goes on. And a long history of constructing scientific models to predict this. T-Tests don't get you anywhere with this, right? What you need is some scientific model, which makes predictions about the path of this object in the sky. And to do that, you have to project this, this, you need some bigger model, a 3D model of the solar system, that you project onto this visual sphere that encases us. And this is a big problem. And if you study this at all, you know what the solution is, and it's complicated. But there's no population of Mars. Mars is, if I may invent a plural. There's no population of Mars is that you're going to use to think about, like, a T-test in the way it's derived, or ANOVA. You've got one observation. And the question is what explains it, right? It's a completely different framing. But to do scientific inference, the question would be, what's a foundational approach, a philosophy that can handle both of these contexts, and other contexts? To continue problematizing this topic a bit, machines can learn now. They can even read. Here's a robot reading The New York Times. Yeah, The International New York Times, which I don't recommend by the way. This will give you biased judgments, but these robots can read. Now, do they understand what they're reading? No. But they can read, and they can recite it back, and they can summarize it. They can do lots of things. This robot is doing inference, but it's not doing T-tests, and it's not doing other things. It's learning. And it's learning in a feeble way compared to even an animal like a squirrel or a starling, which regularly learn fuzzy statistical relationships that they need to survive in their environments all the time. Those animals aren't doing science. They're not collecting data and then running ennovas. How are they learning? And what can we learn from those procedures as well? So the goal is to have a foundation of understanding inference in general. What is the theory of learning that can encompass all these things and help us understand what's different and what's the same about them? And can we use that to then improve how we do science? So the metaphor that I always use in teaching is instead of thinking about statistical models or computer programs as robots, let's think about them as a golem. So first I say why not robots? Well, robots are good at stuff, except for walking, which they're terrible at, right? This is the irony. They're good at counting about it walking. We're the reverse. Yeah. But robot, it sounds like it's a sophisticated piece of technology that would be good. The golem isn't going to sound like that to you. A golem is a monstrous automaton. This is a myth, I think, in Jewish folklore, take a bunch of clay, say some magic words, and you can make an artificial slave that can defend you and do hard work for you. And there are various golem legends, the most famous of which is the Golem of Prague. And if you've ever visited Prague, you've seen little gift shops, right, with golems and things like that. If you haven't visited Prague, do so and then you'll see the Golem of Prague stuff. And there's this story about rabbi in the 1500s in Prague who constructed the Golem to defend the Jewish population of Prague against anti-Semitic violence. And in this story, though, he ends up deconstructing it because the Golem is unthinking, just obeys commands. And so like in the lots of stories like this, if you're not very careful with your commands, you end up causing harm. The Golem doesn't understand anything at all. It just has programming and it executes them. It's good at some things, it's incredibly powerful, but it has no wisdom. And so in the end, they end up be commissioning it because, well, part of the story is you're not supposed to play with the power of creation. It's very, very dangerous. When we do stats, we're playing with the power of creation. We're making these very sophisticated learning algorithms that we may not understand. And if we're not very careful with them, and we don't make the effort to understand how they work, then you can wreck Prague. This is the metaphor. So let me let me extend this silly analogy to its climax. So on the one hand, we have the Golem, it's made of clay, it's animated by truth. That's literally the word in Hebrew that's written on its brow that brings it to life. It's very powerful, much more powerful than the person who made it. But it's blind to intent. And so it's very easy to misuse. And of course, it's fictional, right? I think it probably never existed. But it's a very compelling story. Systical models. There's a lot of similarity. They're made of silicon, at least because computers are made of silicon, right? No, it's a bit weird, but bear with me. It's also animated by truth. We'd like to know the true causes of things in the world. And we want statistical models to help reveal those causes. We hope they're powerful. And if we're good, they will be. They're also blind to creator's intent. And so statistical models will behave in ways that are unanticipated. And better we understand them, the better able we will be to avoid those unintended consequences. So easy to misuse. And I don't want to say that models are false. So there's this saying from a famous statistician in the last century, George Box, that all models are wrong, but some are useful. I don't like to say very much. And I get that if you're a beginner, it's a very nice saying, right? I worry about that saying because it almost seems like we'll stop criticizing the model because it's unrealistic. But I think we should criticize models, because they're realistic, right? That's it's a category error. Instead, I want to say that their models are not even false. It's a category error to even debate the truth or falsity of a model. They're all false. They're they're constructions. It's sort of like saying, if I if I come up to my friend, I go over to his house and he's building a table, and he's using the wrong tool to construct the table, like he's trying to nail it together with a screwdriver or something. It isn't that the screwdriver is false. It's just the wrong tool. models are tools. And it doesn't make sense to say that they're wrong or not, right? It might make sense. This is a complicated thing. And so it's orienting you towards the idea of selecting the right model to help you learn in a way. But they're not false things. They're constructions. They're bits of technology that we built. Hammers are not true or false, right? Screwdrivers are not true or false. You make a choice to use a different tool at a different time. These are bits of technology. So what is Bayesian stuff then? Bayesian data analysis is Bayesian inference applied to the scientific analysis of data. And it's a very simple and humble approach. All it means is we're going to use probability, which I'll define. This will come to describe uncertainty. That is our lack of clarity and imprecision. Again, I'll define this in a very crisp way coming up on the slides to come. It's one useful way to think about this is if you know basic binary logic, truth tables, constructions of true and false, Bayesian inference and Bayesian probability theory just extends that to continuous plausibilities. So now things can be somewhat, they can be plausible instead of just true, right? So you can have something that's 100% plausible and that's a claim that's true, 50% plausible and so on. So it's a logical extension. If you extend ordinary binary logic to continuous plausibility, the only thing you can end up with is probability theory. It's the only correct proper extension of it. It's computationally difficult, even though it's very old. Bayesian inference is older than frequentist inference. It goes back to, well, Gauss. Here in this room, people probably heard of Gauss. He used to be on German money, right? But then we got this Euro money. Now he's gone. I think I'll show the 10 mark note later on in the lecture. Gauss used Bayesian inference to derive linear regression in 1809. Linear regression was originally a Bayesian procedure. It's in his 1809 thing about predicting the return of a comet that got him really famous when he was like 25 years old or something like that. He was kind of smart. And on the continent, German and French mathematicians use probability theory. It wasn't called Bayesian at the time because it is all there was. It was just probability theory to do all kinds of scientific inferences. A lot of it in astronomy, but in other areas as well, ballistics and many other things. And they worked on relatively simple problems because they didn't have computers. They did what they could analytically. For contemporary sorts of statistical models we hope to fit, you need some fancy technique to avoid doing the mathematics. And this is a technique you may have heard of called Markov Chain Money Carlo. And I'll say a little bit about this later. It's just a way to get the computer to do the math for you. That's all it is. It's just a trick for that. So I said it's older. Bayes is older than frequentists. And in England around a certain period of time, it became quite controversial because the frequentist approach was taken over by some very powerful people, including Sir Ronald Fisher, who was an important theoretical biologist. I've mentioned him on the previous slide, President of Rothamsted Research Station. So in one of Fisher's very often cited and highly influential books on how to do statistics, which he talks about at NOVA and so on. In the preface, all he says about Bayesian analysis is it must be wholly rejected. He doesn't explain why. Now elsewhere he had explained why he thought it should be wholly rejected, but he just dismisses it sort of out of hand like this. He's also the one responsible as far as we can tell for calling it Bayesian. Bayesian is just called a probability theory. He's the one who said it's Bayesian stuff after a certain scholar named Bayes. Okay, so what is Bayesian data analysis? It is nothing more than counting the implications of assumptions. So you can put it like this. If we count all the ways, the data, by which I mean the observable variables. So we take measurements, we've observed something that's an observed variable. If we count all the ways, the data we did see could have arisen according to our assumptions. And then we rank those different ways relative to one another. There'll be examples of this to come. You're not supposed to understand this from just this slide. Then the assumptions that can arise in more ways are more likely to be true. They're more plausible. That's all Bayes is. And this sounds like an incredibly silly thing, but it turns out to be really, really powerful. And this is why the title of this talk is that Bayesian inference is just counting, because it really is. It's just counting. I'm going to give you examples of this, where we're going to make some assumptions about the processes that could produce data. And then I'll show you how they're counted. There are going to be no calculations, we're just going to count. And that's all Bayesian inferences. And so when you run a fancy Bayesian model in your computer with a Markov chain and all that, it's just using a very indirect technique to do counting. Counting over infinite sets, which sounds terrible. How do you count infinite things? Well, we can do it. This is what calculus is for, right? And calculus is a way of doing things like that. So yeah, so let me say this again, and then we'll go into actually running some examples in a few slides. We commit to this view. We want to know how we've got a series of plausible processes that could have produced the data for each one. We ask, how many different ways could it have produced the data that we see? And then we rank those relative counts of ways. And those are plausibilities. And probability is just plausibility done this way. It's just counting. Everybody agrees with that, by the way, even frequentists that probability theory is just counting. But there are some differences. So I don't want to say much about the standard frequentist view, because I don't want to pick an argument with it. This is, in my opinion, not the important thing. But the important problems with statistical practice are not that people aren't Bayesian. I'll get to what I think the important problems are later. But this isn't the big deal. But it's nice to know what the difference is, because sometimes it's very hard to be a frequentist, and it's very much easier to be a Bayesian and vice versa. So the frequentist view is that probability is defined objectively. It is a limited frequency of an observable variable, right? You've probably heard this before. Yeah. And the Bayesian view is not. A frequency and a probability are different things. Even when a probability has the same value as a frequency, that doesn't mean it is a frequency. It's just a counted relative plausibility given your assumptions. But it doesn't have to equal the observable thing. So why would this matter? In the frequentist view, uncertainty arises from sampling variation. And in the Bayesian view, it doesn't. In the Bayesian view, uncertainty arises from your internal uncertainty, or the uncertainty of the machine. That is, they're still plausible ways that other processes could have produced the data. But it's not because you have multiple samples and there's variation among them, and that generates uncertainty. In an agricultural trial, you won't notice the difference between these approaches because they end up numerically almost identical. But there are scientific cases where they behave very differently. So consider this image. This is Saturn. I manufactured this by taking a real picture of Saturn and blurring it. But this is how Galileo, when he looked through one of those early telescopes, saw Saturn. We know this because he's got notebooks where he drew this, and it's basically a blob of two little blobs on the side. And he says, this is disdain. Saturn has blobs on it. What is this? Then he starts sending off hurried letters to all of his colleagues. I saw some blobs on Saturn. There's orbs on orbs. Now we now know that there are rings around it. So it's this question. You see an image like this. You've got a permanent telescope. What's the real image? This is an image resolution thing. So like on these crime scene investigation, top sort of shows where they do magical stuff with imagery. So those are Bayesian algorithms. There are things like that exist in their Bayesian algorithms. There's no sampling variation here. It doesn't matter how many times Galileo looks through the telescope. He's going to see the same thing. Right? So the frequency of view doesn't even get you started on this. You need a model of generative processes that could produce this image. There's a series of images given scattering of light that would produce this view. Right? And different underlying images could produce very similar views. And so they have different plausibilities. And the Bayesian calculus does that. It lets you go from a hypothesis about what the true image is to how close to this it would produce. Right? How close to this would it get given some knowledge about how optics works in this case. Okay. Yeah. So the summary maybe that you can take home, walk out of here with is, in the Bayesian view, probability is always epistemological. It's not an objective fact about the world. It's the internal uncertainty of the machine. So it doesn't depend upon sampling variation. It's in the machine or in the golem. Yeah. Or at the bottom of this slide, thinking about coin flips. So we often say that coin flips are random. You flip a coin. It's random whether heads or tails lands up. That's a fine description. I'm not going to complain about that. But the coin is not random. The physics are perfectly deterministic. There's no debate about that. Right? A physicist will tell you while there's a system. It's got high angular momentum. But it's chaotic in the sense that the quality of measurement you would need at the initial flick is you would have to have incredibly precise measurements to be able to predict what happens. And so it's just it's practically impossible to predict. Yeah, a coin flip. That's all it means. But it's not an inherently random process. We have physical models that tell us why we can't predict the coin flip. It's a deterministic system. Does that make sense? Hopefully I'm blowing your minds a little bit here. Right? So one way to say this is the coin's not random. We are. Right? And we're ignorant of that initial state and the angular velocity and all the other things you would need to plug into the physics model to know whether heads or tails comes up. But the coin's not random at all. The randomness describes our uncertainty. And Bayes takes that view on and applies it to everything. Because it's true. The world is deterministic by almost all scientific models. Yeah? There are these debates in quantum physics about whether God plays with dice. Right? But there's no experimental result in physics which is not consistent with the deterministic universe. And so. But we don't live in the quantum scale. Right? So we're not worried about that. We're social scientists. I think all of us. Yeah. And so this description of uncertainty being in you is accurate for the sort of work that we do. Does this make some sense? You all seem to be following me. Yeah? You're nodding at least which is encouraging. I appreciate it. Okay. Let's go through an animated example now of doing the counting so I can help you see what goes on. This is the simplest example I could come up with. I like to use this metaphor called the Garden of Forking Data. I take this, of course, from this famous short story by Borges. I don't know how many of you know this. It's a great story where there are all these branching paths in the story and different things can happen. It's like life, right? You make choices and the different futures open up. And you can think about life as a series of branching destinies that are contingent upon your past choices. And data analysis is a bit like that in Bayesian data analysis. We're mapping out all those branching possibilities given some assumed truth and that's what we're going to count. We're going to count all these paths that could lead to the event we have realized. And there will be different true states of the world that could produce the state that we're in right now and we're going to count all the weird branching paths through different choices that could do it. So let me show you how this works. So the future is branching paths, the data are events and we want to know how many of the possible paths could produce the events we've realized. So let's think about a simple thought experiment. In statistics we like to draw things out of bags or urns. This is why because it's easy to teach with. So let's think of it that way. We have a bag pictured there on the left. And it has four marbles in it. I tell you that it's true it's not alive. The marbles come in two colors, white and blue, no other colors. So there are four marbles. The possibilities are that each of them is white or blue. And now the question is how many, whether the contents of the bag. So we're going to draw marbles with replacement one at a time. And each time you draw a marble you get some information about the contents of the bag, right? But you could draw the same marble and then you're going to put it back. You can get the same one again, right? So you can go white, white. It could be the same marble or it could have been two white marbles. And you have to deal with all the possible branching paths given the true state of the bag that could produce this. So we're going to draw this out now. And this is all Bayesian differences. So the hypotheses, the conjectures about what's in the bag, we can list them exhaustively because there's only four marbles. They could all be white, one could be blue, two could be blue, three could be blue, or they could all be blue. Agreed? That has to be it. Yeah, easy. Science is easy. We're done. Right. Now we're going to draw three marbles. I'm going to say we've drawn them one at a time. We put them back and draw the next. We draw the first one. It's blue. We put it back in. Reach in, draw the second one. It's white. Put it back in, draw third. It's blue. Given these observations now, which of these is most plausible? And we can figure this out using Bayesian difference. Right? That's what we're going to do, using probability theory. So here's what we do. We're going to make the garden of forking data, as I call it. We're going to think about the first marble draw from the bag. Four things could have happened on that first draw. Assuming that the bag contains one blue and three white. We're just going to assume that for the second. It's not that we're claiming it's true. We're going to say, if that were the state of the bag, what could have happened on the first draw? There are four paths into the garden of data from that initial hole. One is blue and three are white. Right? These are different paths because there's three different white marbles. They all look the same to you. They're just a white marble, but they're actually different marbles. They're special snowflakes, every single one of them. Yeah? This makes some sense. We're just on the first draw. Now we grow the garden because each of those things that could have happened. On the second draw, there are four things that could happen on top of it. This is the garden of forking data. These are the paths into the future. Right? So if the first one had been blue, then the second draw is not contingent on it because we put the marble back. Right? So we could have drawn it again, in which case we get blue a second time, or we could get one of the three white marbles. The same for all the other events that could have happened because these are independent draws of the marble. If they weren't independent, the garden would look different. Right? So we didn't put the marble back. Then you shrink these. The gardens pass will get narrower and narrower as we went. Does that make some sense though? But you're just drawing out all the possibilities. And then the final one, the last marble that we've seen so far, you get a really big garden now. Lots of possibilities. Right? So this is all the things that could happen. Now what we're going to do with this, given these data, we can eliminate some of these paths because they're inconsistent with the observation. But before we collected observations, this was what was possible. They're many, many different. Right? So it's four times four times four possible things that could happen in three draws from the bag. And those are the possible sequences of events. All the paths of branching data. Does it make some sense? Yeah? I drew this with a computer program by the way because doing this by hand would have been like joking. Right? So let's eliminate some paths then. What actually happened is a tiny subset of this. Over here at the first thing that happens, down here at the bottom, we draw a blue marble. So we know we're on the left hand branch into the garden of data. Then we draw one of these three white marbles. We don't know which, but we know it's one of those paths. So there are three paths. Yeah, question? I'm following, but is it all empirical or it's just you are just like making inferences to what I direct? Well, the data are empirical and the data are empirical. And the only there are facts that you agree are true, like there's only four marbles in the bag and they have to be either blue or white. That's not empirical. You have to trust that. I haven't lied to you. You can't verify that. So you have to make some assumption, but the data are real and we're using the evidence, the data, to decide which of the theories is most consistent. And that's the whole point. So we're saying there are a range of hypotheses. They have empirical implications. And that's what we've drawn. We've drawn the empirical implications of the conjecture that there's one blue marble. We've only done one theory so far. And what we've found is that there are three different ways that you could observe this fact that we get a blue, a white, and a blue marble if this were the true state of the bag. If the bag has one blue marble in it, there are three ways we could have seen this data. And we're going to compare this to the other possibilities in a second and then maybe that'll help your query. So is it empirical? Yes. Is it completely empirical? Nothing is. All learning depends upon assumptions. There's no escaping theory. Without theory you can't learn. And so this is a theory. It has empirical implications. We learn because we confront the theory with empirical implications. And there is no other form of learning. If you are interested, like with your example about marms, about the planets, I know nothing about it. Well, then you'd have to study it. Yeah. You'd have to do some science. So you make assumptions before collecting the data. Yes. Yeah. Well, you know, you have a theory and the theory has assumptions in it. Yeah. And then the theory makes implications about how the data should look if the theory were correct. And that's what we're counting now is all of the ways that the theory could have produced this observation. And some theories are going to have more ways to produce the observation than others. And that's what's going to differentiate them. Yeah. So in the in the Mars case, some theories get closer to the true path of the sky than others. They make better predictions of where it is. So let's think about the other conjectures. Let's list them again. Five possibilities of what's inside the bag. You can't look in the bag, right? You haven't looked. We've just got these three samples. And we're going to count up the ways each of these can produce the sequence of data we've seen. So we just did the first. We just did the second one actually the first one. Right. We just did the second one and it was three. That's why the three appears in this column. There are three paths through the garden of working data that are consistent with this possibility. That's the theory. What about the first one? Yeah, we're just going to eliminate the first one very good because obviously it's impossible, right? There are zero paths. We could draw the whole garden, right? It just be a bunch of white marbles all the way, right? Four paths of white marbles just going out all the way out. All whites. But since we've seen blue marbles, yeah, it's inconsistent with the facts. That's easy, right? Basin inference is that but for much more complicated combinations it all works the same way. Let's do the others though. So what about number three? Well, we could draw that garden. So I'll just do it in this third of the screen because I'm going to do the other ones in the other parts of the screen in a second. So same idea though. Here's the origin of the garden. This is the one we drawn before where there's one blue marble. I'm just repeating this garden but now crammed into the corner. Yeah, so we've got one blue, three white in each. But we go up to these three paths. There are three ways that if the bag contains one blue marble and three white marbles, there are three ways we could have seen the data. Yeah, we can't tell them apart. So there's three, three ways you could have seen it. Let's compare it to the next one. Imagine there are two blue marbles in the bag and two white marbles. We draw this garden. I'm going down on the screen. Two blue, two white, two blue, two white on each split. Same number of total possible data sets as it were that you could see. But now when we count up the ways you see there are eight ways we could get two blue and one white when three draws from the bag. There are more ways that this could reduce the data. This makes some sense. Third one. Three blue marbles, one white marble, draw the garden the same way. Four branches each time, but there's three blue marbles and one white each time. Count up the number of blue, white, blue sequences. Blue, white, blue, past. There are now nine. So this is the most consistent with the data. At least it has the most ways to produce the actual observation. Does that make sense? Exciting, right? This is rocket science. This is all basis. It's just counting assumptions. And all it is. Probability theory is just counting assumptions. Amazingly it works. So try to summarize here. When we do statistics, we don't draw these things. For a tiny data set like this, it's already a huge number of possibilities, right? Like 64 options. You don't want to start drawing your data this way, right? You need huge pieces of paper. It would be madness. Luckily mathematics compresses all this. And there are simple rules for probability statements that let you do these calculations without having to draw this all up. This is combinatorics, right? There are combinations and permutations and things like that. All that mathematics is for dealing with sequences like this. And so we can just use multiplication in this, in this case, turns out to be the solution to this. So on the first path, there are zero ways. On the second coin, the second marble draw, there would be four ways to get, to get a white marble and zero ways to get a blue marble from the first conjecture. So zero times four times zero is zero. Anything times zero is zero. It's impossible. This can't be true. Second case though, we draw blue marble first, there's one way according to this conjecture to get that because there's one blue marble in the bag. There are three ways to get a white marble for the second marble. One way to get a blue, there are three paths that goes through and so on for the others. So this is what you actually do when you learn statistics and you learn the laws of probability. You learn this product rule and the product rule comes from this kind of counting. It's just from counting. Good. You don't have to understand every detail of this right now. I just want to show you there's a conceptual link and that all the laws of probability are really just about counting, counting sets. So this is where we get the zero, three, eight, nine. I didn't show you the all blue, but I'm sure you can do it. It's also zero because there's a white marble. Yeah. Okay. In this view, if you get more data, you can just update, right? So there's not some frozen stage where you have to make the inference. Now more data can come in in a stream like it does for a squirrel foraging on the lawn or whatever, right? And it updates its beliefs. That's how learning works in organisms and robots. We hope in robots. So now we've gotten those counts that I showed what I call on this side the previous counts. Now we draw a fourth marble and it turns out to be blue. And I asked you to update. You don't have to do it all over again. You can just multiply, right? Because it doesn't, you don't have to start over again at the beginning. You can just take the previous counts and multiply them by the number of ways you could have gotten a blue marble given each conjecture. So obviously the all white and all blue, we can just ignore those, right? But it's useful to think about this first column where we write down the number of ways you could have drawn a blue marble. There's zero, one, two, three, and four, which is just the number of blue marbles in each conjecture. Yeah. And then we multiply those by the previous counts. And that's updating gives you the new counts. And branching multiplication because you've got these branches in the tree. So it's zero, three, 16, 27 now and zero. Most possible is the bag with three blue and one white. Yeah. But you look at these numbers and you're like, well, what do these numbers mean to me? What's three mean? What's 16 mean? Doesn't mean anything. The only thing that matters is their relative size. The relative sizes contain information about the relative plausibilities of these things. And the relative plausibilities aren't that different yet. You probably don't think it's this one, but it hasn't been eliminated, right? There's no evidence that this is impossible. The all white, so sorry, I'm pointing at the slide and I'm recording this, there's no evidence that the bag with only one blue marble in it is impossible, right? It still has some plausibility. You could have seen this data. It's not even a vanishingly tiny probability. It could easily happen, right? If I gave each of you a bag and had to draw four marbles, one of you would probably get this result, right? With a with a bag that had only one blue marble in it. It's not that impossible. But it's probably if you're going to bet, you would bet one of these two. Yeah. And if you're going to bet on only one of them, I can guess which one you would bet on. Unless you're trying to lose money, right? And that's with these things. But the the relative differences give you some idea of how much you should bet. Don't bet, by the way, it's bad. No betting. Okay. This approach also lets you use other information. If there's any information which you can summarize as the relative numbers of ways that the data could arise, then you can combine it with the other information. So you can use all forms of data in the same estimate. So as a toy example, say someone tells you, okay, the factory that makes these bags is like marble gift bags. You know, I don't know, buy it at a gift store. People buy bath bombs and stuff like that. And other things. The factory that makes them, the blue marbles are rare. They're more expensive because they have some fancy blue dye in them maybe and they make fewer of them. But every bag contains at least one. I give you only that information now. Can you use that? The answer is of course you can because it constrains the possibilities. It opens up more ways to do things. So in particular, we just get summarized by your informant at the factory as the relative production counts of the different kinds of bags. So there are no all white bags because that would make customers upset. So their process ensures there's always at least one blue one. But there's a ratio of three to two to one of the bags with one marble, two marbles or three marbles. Yeah. So every three out of six bags, half the bags have one blue marble in them. Yes. But they're only saying that every bag contains at least one blue marble in them because of it that a bag contains four blue marbles as well. Yes. But I've assumed that's not true. Yes. Yeah. Sorry. Maybe I left something out there. Marbles are rare. Every bag contains at least one. You know, blue or white. I should have written up there. Thank you. I didn't notice this on my slide. Yeah. Every bag contains at least one blue one white marble because otherwise people are upset. They want to mix of colors. But blue marbles are rare in white marbles in this proportion. Does that make sense? Yeah. But we could assume any numbers here and any information from our spy in the factory. And the point is to combine them with what we already know. So we've already done all that other calculation going through the garden of working data. And here now in this column, I have this column called prior ways. It's prior because it's our previous calculation that we've already done. It was before. Solve prior means. And they were zero, three, 16, 27, and zero. Now we have this factory count information and we can just multiply because it's numbers of ways. Right? This is it. And this is this multiplication rule and probability of the product rule. So now it's zero, nine, 32, 27, and zero. Now the even bag, if you will, the bag with two blue and two white is the most possible. It's pulled into the lead, but there's barely any difference between those two still. So it has made a difference. I show you this just to show you that this is often one of the advertised strengths of Bayesian inference. It naturally accommodates different kinds of information in the same calculation. Whereas in typical, you know, statistical procedures, this is very difficult to do. Very, very difficult to do. Yeah. Good. You're at least willing to let me proceed a little bit. Yeah. This stuff is, by the way, impossible to fully understand on the first, on the first encounter. It's complicated and I want you to adopt the attitude that it's like learning a language. It's much easier than that, but it's like learning a language in the sense that you can understand a lot before you're really fluent. Right? And you can make use of it. Right? You take an intro course in French or something and you can go and order some food. Yeah? And embarrass yourself. Things like that. Right? It's very useful to do. But the fact that you're not perfect doesn't mean you can't use it. Right? So you have to be patient with yourself and accept the idea that understanding comes in pieces and then some conciliance forms over time. And that's just how scientific skills develop. Okay. So let's convert these to plausibility because these absolute counts don't actually have meaning. It's only the relative values. And if you insisted on having these counts in the actual combinatorics of a dataset, they would be in the billions very, very quickly. You can see these numbers grow very fast. So with a dataset of only four observations, you've already got numbers that are approaching 100. If you had a dataset with tens of thousands of data points, which is not that unusual these days in science, you'd have counts in the billions. Trillions. So I don't have huge, huge numbers. And you don't want to write those things down. So probability theory is normalized is what we say. And that is it's normalized counts. We take those counts and we just divide each one by the sum of all of them. And that's called normalization. So now the maximum value is one and all the numbers sum to one. So the way this usually works is and those and those relative counts are called plausibilities or probabilities. So when we're walking through our example, again, the table at the bottom on the left or the different possible contents of the bag. Then I have this column with the P at the top, which is just a label, a name for each of the conjectures. It's the proportion of the bag that is blue marbles. That's what P means. No, it's not the P value. P there are going to be no P values in the stock. The values are blasphemy in Bayesian statistics. So P is the proportion of the bag that is blue marbles. So for the first one at zero for the second one, it's 0.25, a quarter of the bag is blue, half, three quarters, and all of it. Understood? It's just a label. It's just a number that describes the hypothesis. And then there are our counts that we've gotten, ways to produce the data. This is just from the prior counts from the beginning of the example. If we sum up this column, the ways to produce data column and divide each number by that sum, we'll have these numbers on the far right, these possibilities. And those are probabilities. Probabilities are just normalized counts of the ways the data could happen according to each theory. Yeah. Now the way in the statistics you would usually frame this is that P is a parameter that we'd like to estimate. What proportion of the bag is blue marbles. There's a range of possibilities. It could be, that's an assumption that we built into the analysis, the bounds on that number. What values could it theoretically take? And then we use the data to estimate that value. So we're estimating P, the proportion of the bag that's blue. That's the usual way statistics is phrased. But you just have a range of conjectures. It could be an infinite number of conjectures because you could entertain every value of P between zero and one continuously. That's impossible with a bag that only has four marbles in it, though. Right, but if it was a giant, an infinite bag of marbles, right, and you reached in and pulled out things, then the proportion of blue marbles, I know it's impossible, the infinite bag, but in math it's very easy. And then P could be anything between zero and one. Right? It could be point one seven, point one seven, one three seven, right? Those are all possibilities. And then you're just trying to estimate as precisely as possible what it is. And that's a usual thing you call a parameter estimate or an effect size if you're doing a comparison between treatments. It's the same idea. There's a range of hypotheses about what the difference between the treatments is. Those are conjectures. Given each conjecture, how many ways could it produce the data you've seen? And that's the way all stats is done in this framework. So it makes some sense. So far there's no difference from the frequentist view except that we got to use prior information. That's the only thing that's different so far. Frequentist probability is the same as this. It's just counting. OK. So in R, how many of you have used R at all? Yeah, all of you. Super fantastic. Yeah, it's the world's best calculator, right? Do all kinds of stuff with it. And if you want to do this kind of calculation in R, you can just make a variable called ways and give it a list. 3, 8, 9. I left off the zeros because we know what's going to happen there. Yeah, but you could put them in. It won't make any difference. You then divide ways by the sum of ways, which I show at the code at the bottom here. You can do weird stuff like this in R. And that converts it to probabilities. And then you get the probabilities here. So it makes sense. Most of the time, the probability functions in R are continuously renormalizing so that you stay in this probability space. That's what probability functions do so that we don't get these exploding counts. Yeah. All right. So let me try to summarize this a little bit. Plausibility is labeled probability in applied probability, at least. This is just a set of non-negative real numbers that sum to 1. That's all probability is. And it comes from normalized counting. Probability theory is just a convenient way to count big sets of things. And do you think about their relative amounts? It's just shortcuts for counting. And that's why at the start of this I asserted that Bayesian inference is just counting. That's all it is. But it's counting really complicated spaces. And we need computers to do it because people are bad at counting. Computers are good at it. People are good at walking. Computers are really bad at it. Right. You can make a robot walk upstairs. You win an award. Right. It doesn't mean while babies are one year old can walk upstairs. Right. It's a nice symbiotic relationship between us and computers. Yeah. So they're good at things that we're not. That's why we made them. Right. There's no accident that they're good at things we're not. That's why we built them to be good at. Yeah. Does this make some sense? Just has to make some sense. Question. Yes. I'm still wondering about the empirical side of it. Let's say first and foremost, the possible composition is four children. One of them help. And then two of them don't help. And then three of them help. Blah, blah, blah. And then I don't wait for the action of helping. So I just calculate these possibilities without the event. OK. Is it correct? No, no, no. You've got the data. And now you're asking what the calculation is. What it means is having seen the data now, how many children helped. What proportion of children help in the population? Yeah. So it's like you've got some bag of children. Right. In this metaphor, you get there's a bag of children. Some of them are helpers and some of them are nasty. And you draw a child out. And you observe their behavior. You write that down. We've observed three kids. One of them, the first one, helped. The second one didn't. And then the third one helped. And now we're going to estimate the proportion of children in the bag who are helpers. So this is after the experiment. Yes. Of course, we're doing statistics on the data. But you can draw out, we'll get to this later, you can plan the experiment using the theory. In fact, you should. Because you want to know if the experiment can discriminate what's actually going on. So this is how you decide how many children you need to draw out of the bag. Yeah, does that make sense? Yes. Yeah, yeah. OK. It's a good question. That's a good question. So good. Other questions? All right. Yeah. OK. Building a model. So let's think about this in a constructive way. And I'll give you another animated example of how this works that looks more like the way a conventional statistical analysis. Because we don't draw gardens of forking data. There's in applied statistics, there is this narrative that we use most of the time. But not always. The first part of this is we design a model. That's a story of how the data could arise. Maybe we don't have the data yet, but we anticipate its form. We know the possibilities. We've asserted that kids are either helping or unhelping or that marbles come in blue and white. Maybe we haven't done the experiment yet, but we know the structure of it. We know what the data will look like. Maybe we already have the data. Maybe we downloaded it. Our colleague gave it to us. Somebody quit and handed us their PhD dissertation, whatever these things happen. And now we're going to analyze the data and get a publication. And so the first thing is to use your scientific knowledge of the topic of the discipline to tell a story about how the data could arise. And there'll be multiple stories. And the question is to use the data to distinguish a model. And that data story helps you design the statistical model. The statistical model should embody the assumptions of the data-generating process. What's the causal hypothesis for how these data get produced? Yeah. That sounds fine in the abstract. In specific examples, you have to think hard. And you have to know your scientific discipline. So this is often why statisticians are not very useful for scientists. I may say that. I know my colleagues are listening, but it's true. It's because the field of statistics has a design techniques ignorant of the application. So it creates this vague, horoscopic kind of advice about, in general, you want to do this and that. But in any particular scientific domain, given your scientific expertise, you should use your expertise to design the statistical analysis. Because the statistical model should embody scientific assumptions, your causal hypotheses about how the data are born. And that's why we call these things data-generating models, or I'd like to say a data story. You want me to tell me a story about how these marbles came to be. And you'd say, well, there was a bag. This weird guy came in, and he gave us a bag. And he said, there's some blue marbles, white marbles in this bag. And then he drew three out and put them on the table and left. That's how the data came to be, right? But in that, you've got enough to build a model, because you saw the weird guy draw the marbles out and then put them back in. So you know they were sampling with replacement and the other facts. Yeah. Does that make sense? More seriously, you have some developmental hypothesis about child behavior. There are scientific theories about how that behavior develops. And then that can be more or less compatible with the evidence. Yeah. And that's what we do. Then you condition on the data. This is the counting the ways through the path. Now the data definitely exists. Updating means you take your original plausibilities by which you rank the hypotheses. Maybe you think they're all equally plausible. That's why you listed them prior to seeing the data. And then you condition on the data, meaning you constrain your beliefs based upon what's now impossible given the data you've seen and what's less possible, less plausible, given that there are fewer ways for certain conjectures to produce the data than others. Makes sense? That's what we just did in an animated sense. And then you evaluate the model, which we have not done yet, but you critique it. You think, OK, there's a certain conjecture that's most consistent with the data, but you know it's still pretty terrible. We're still not predicting the events very well. We still don't understand why some kids help and some kids don't. What additional scientific assumptions do we need to validate, make, and then test with data to make better predictions about how kids develop? Yeah, there's nothing in the blue marble example to be better, but in real science there typically is. So the best model may still be terrible, and that's why this third stage is necessary. This is the golem of Prague, remember. It's still a golem. It's not false or true. It's just a good tool or not. And this evaluation stage is when you realize that hammering a table together with a screwdriver is a bad idea, right? And then you try to find another tool. And models can fail in really spectacular ways. Bayesian updating won't tell you that. It'll just update. It takes the model as given and tells you the number of ways that model could produce the data, but it won't evaluate on its own whether it's good or not. That's up to you as a scientist. So you have to step out of the mathematics at some point and do criticism. Let's do another example that's closer to the sort of estimation problems you might see. Let's think about a globe. Sometimes when I teach this, I have this in my office. If you ever visit me in my office, you'll see this inflatable globe in my office. I use this for teaching and I throw it at audiences. So no one has ever been injured, I assure you. And so to have this inflatable globe and I throw it into the audience and someone panics usually and it bounces off their head and then the person behind them catches it. And then I say, okay, you holding the globe, tell me where your right index finger is. Is it over water or over land? And they'll say, nervously, like what do I win? And I say, nothing, throw it at somebody else. And we throw it around, counting, water, land, water, water, water, lands. However, for some number of times, in this example, nine times, nine tosses into the audience. First one, there was a water, then a land, three waters, a land, a water, land, water. We stop and say, okay, let's write that down, let's data. It's data and we're gonna use that data now to estimate the proportion of the globe that's covered in water. Yeah? Now you may know the answer to this question, what proportion of the earth is covered in water. Notice we're asking the question, what portion of this globe is covered in water? And it's not the same question, but that's part of the lesson, is this is how measurement works, right? Because there are other events, there's an air valve on the top of this globe and that's why they're neither water nor land, and other sorts of things. And sometimes people land on a coastline and they're not sure how to answer me, right? And you have to make assumptions, but those issues aside, that's measurement, and those issues are different in every field, let's leave those issues aside and think about assuming that the globe was a good representation, and it is a pretty good representation of the geography of the earth. What should we, how should we construct a statistical estimate from this sequence of data for the proportion of the globe covered in water? Any of you know the true proportion? Yeah, that's right. It's almost exactly, it's a little over 70%. We'll say 70, yeah, it's about 70% water. This is mostly a water world, right? Viewed from space, there's a whole half of the globe that's the Pacific Ocean, which is almost entirely water. And so let's go through that three-part sequence, construction sequence that I had on two slides ago and think about how we construct an analysis from this. Okay, you with me? This is the mission. So the first stage is to design, we're gonna tell the data story. The data story was, again, weird guy comes in, throws a globe in the audience, and you can tell this data story, and from that you get facts that help you design the statistical model. Relevant facts in this case, the different samples are independent, at least approximately independent to one another, when you throw it to somebody else, the thing spins chaotically, it's caught in some awkward way, it bounces off a couple of people, gets caught. There's not correlated samples, right? You don't get water, water, water, water, because the first one's water, they're mixed up. And then the probability, assuming some true proportion of water P, again, I'm using P just to mean a proportion, and you toss the globe, then the chance any individual toss comes up water is that value. Yeah, that's an assumption about how the tossing process works. I'll say it again, assuming there is some true proportion of water, we don't know it yet, but we're gonna label it P for proportion. When anybody catches the globe, the chance that their index finger is over water is P on any individual trial. So now we're gonna have a sequence, and that'll let us say what the probability of the whole sequence is. So it makes some sense? And you've done stats problems like this, yes? Yes, we, very good. So how do you do this? I will show you. Indeed, we will have an infinite number of conjectures, and it turns out that's no harder than a finite number of conjectures, because of math. I'll show you how we do this. There's an infinite number of possibilities, right? And we can rank them all. This is what's great. Your computer can do this, no problem. This is what calculus is for. You have an infinite set, but you can still rank all of its members. We're gonna do calculus, but you're not even gonna notice. It'll happen, and you'll be like, wow, I just did calculus, and you'll feel fresh for the rest of the day. Yeah, good? That's a great question, because this is the trick that always, but actually infinite sets are often easier in mathematics than finite sets. There's lots of awkward stuff that come from finite sets, and I won't go to that road of discourse, but there's lots of fun things about that. Okay, so this leads us in our day of story. You toss the globe, there's a probability P of water, and a probability one minus P of land, and there are no other events. That's an assumption, we're asserting that, but probably we have consensus on that. We might debate these assumptions later. And each toss is therefore assumed to be independent. You could make a model where they're not independent, and this is called the Serial Auto Correlation Model, Serial Auto Correlation Model, where adjacent tosses have some additional similarity, and we make models like that all the time, but this is supposed to be a simple example, so we'll leave stuff like that out. Now, you can translate this data story into a series of probability statements, which are just ways to count the pass through the garden of working data, and that's for an infinite number of conjectures, right? This is like a bag, and there's an infinite number of possibilities it contains, but the steps are the same, it's just there's a mathematical procedure for compressing all that, and it's called probability theory, and the laws of probability, right? And we're not gonna go into the details of that for this, but I think Robert sent you a link to my book, a PDF to my book, and so this is chapter two, I think, of the book, and I go through and I show you the R code to do this stuff in there and how all the construction works, but I don't leave it out of the lecture because the concept is what really matters, and so I want you to come away with some conceptual understanding of what's going on. Okay. So, Bayesian updating now is the conditioning part. This is the counting part that I showed you for the bags of marbles. It's often called Bayesian updating. It's updating because you've got some initial set of plausibilities. The machine needs some initial information state about how plausible the different conjectures are. You can make them all equally plausible, or you could have pre-existing scientific knowledge that some of them are silly. In this case, that would be the case, right? So you know that the earth is more than half water because you went to school, right? So you know that's a fact, and so any hypothesis less than a half is already ruled out for you. And you could start there. To make this simple, we're not going to even use that knowledge. We can just make them all equal, but the procedure works exactly the same regardless. So you've got some prior information state, which is called the prior here, and Bayesian updating updates that to the posterior, and these just mean before and after. Before and after what? The data. That's all. Does that make sense? So now it turns out that because the way Bayesian updating works, there are many models where this, the time ordering doesn't matter, and you'll end up with the same belief no matter what order the data arrived in. And that's true in this case as well. That sequence of tosses could be reshuffled, and you'll end up with the same inference at the end because order doesn't matter. There are models where that's not true because you have autocorrelation in the tosses, and then the order has information in it, but that's not true here. But all that comes from the data story. Okay, so you programmed your golem, it has a prior information state. Here we're just gonna, the prior information state is for every infinite, there's an infinite set of conjectures labeled P, and it's all the real numbers between zero and one, and we need to assign each of those some prior plausibility, and again in math you can do this, you assign a distribution on that set P where they're all the same. And that'd be a uniform prior, which amounts to the state that they're all equally plausible. Or you could set it so that everything below a half is impossible, assign it to zero, and everything above it to some other number, doesn't matter which number, because it's all relative, assign them all one. It works. Then you condition, do the updating, and again if you look in the book in chapter two, I'll show you the code to do this and we walk through it, conceptually it's just counting paths through the garden of data, but it's done with combinatorics, there's a formula that's derived in the text to take that data story from the globe tossing and produce a formula you've probably seen before the binomial sampling formula. It's the coin tossing formula, you've seen it before. Okay, and then you get a posterior, this is a new confidence in each value of P and this is conditional on the data. Make some sense? I know, very exciting, right? Let's see this in a cartoon form. It's a little easier to understand in cartoon form. So the infinite set of possibilities is on the horizontal axis at the bottom, right? Everything between zero and one, we'll consider them all and now we're gonna define a prior and I'm just gonna make it so that they're all the same and that's this dashed horizontal line which says that before our golem has seen one toss of the globe, it's programmed to believe every possible proportion of water is equally plausible. It's a dumb goal, right? You're smarter than it, because you went to school, the golem didn't, yeah? Make sense? Then we update, we see the first toss and what I'm gonna do in the next series of images is we're gonna do one toss at a little bit of time and we're gonna let the machine update and you're gonna see how this line changes into a curve that represents different plausibility rankings for the different values of P and all that comes from probability theory, it's completely deterministic given the data. So the first thing is we see one W, see I put the whole sample at the top of the image, this W, L, W, and W, everything's grayed out except the first one because we're only considering, we've only, the machine's only seen that first water right now. So this is the N equals one, that's the sample size, N equals one. Now the prior was this horizontal dashed line, the posterior now is this diagonal solid line. Why is it, and that seems weird, but this is the only mathematically legitimate posterior that you can reach, it comes from the laws of probability. Well, if you think about this, we've seen one water that increases the plausibility that there's more water, so you're putting more plausibility up on the high end, and we've eliminated only one hypothesis, which is zero. So there's zero probability of zero now. That's the only thing that's been eliminated, but these are less plausible because there are fewer ways if you were passed through the Garden of Forking data that you could get water if water scares. If water is common, it's very easy to see water, and so that's why the plausibility has gone this way. Now there's an exact calculation which enforces the fact that this has to be a straight line. That won't be true very soon, but it's true right now with one observation. And given that prior, if the prior were different, that wouldn't be a straight line, it'd be different. Okay, now I'm gonna put this up here and we're gonna do all of the tosses, there are gonna be nine of these, we're gonna go through them all, and I'll show you how the machine sees stuff. Okay, second toss. Now we see a land, and now we get a symmetric hill. There's just as much evidence that there's water as their land, same amount of evidence of water as land. You started out with equal prior plausibility of all of them, so you have perfectly symmetrical plausibility. The most plausible state of the world right now for this golem is that half of the globe is covered in water, but you notice it's not very sure of that. Yeah, there's lots of plausibility for a wide range of proportions on both sides. Yeah? Now we get another water, the third toss, and now the hill shifts, right? So in each of these, I make the previous posterior becomes the prior, and I make it a dashed line, and then the solid one is the new posterior, you see that in the animation across. So now we get another water, and now it shifts over, because it's more plausible that there's more water now on the third one, but still it's very vague. This golem is not very confident about any of the plausibilities, you have no significant result, right? I'm not doing significance testing here, right? There's no hypothesis. Next three, n equals four equals five, n equals six, we get another water, it moves to the right again. Notice here it's getting taller. Why? Because it's concentrating, right? It's getting tight around a narrow range of values. There's still the same amount of area under the hill in all of these pictures, and every one of these pictures is the same amount of volume underneath the curve. It's just getting concentrated in a narrow area, so the curve has to get taller. Yeah? Does that make sense? So the sample size in a Bayesian analysis doesn't have this special character that it often has in a frequency analysis, because it's embodied in the shape of the posterior distribution already. The posterior distribution summarizes everything you've learned from the data, including the sample size and everything else, and then you don't need to do anything with the sample size, like calculate degrees of freedom or any of that other stuff that you do in a frequency analysis. It's already there in the shape of the posterior distribution. And then, yeah, with five, we see another water, again, shifts up and to the right, but then on the sixth sample, we get another land, and then, oh, you get a coarse correction, right? Jerk to the left again. Yeah? In a very specific way. Last three, I think you've got the theme, right? You know, this works. In the last three, what you'll notice is the same jiggling is going on, but the jiggles are getting smaller and smaller, because each new observation contributes relatively less information. Yeah? The machine is getting more and more sure about what's going on, and so it's not as influenced by each individual data point anymore. And this is the sample size effect that you're familiar with in many statistical procedures. So by n equals nine, there's still a lot of vagueness, right? You need more data before you could publish this. And... But it's clustering around higher values, or 0.7 is somewhere around... I'm sorry, I'm touching my slide. People won't be able to watch this later, but I'm touching about 0.7. There's lots of plausibility there. There's a homework problem I often assign when I teach this class for credit, where they take this example and then they see how many tosses of the globe you need to get it to narrowly contract around the true value. The answer is pretty big, actually, but it depends upon how much precision you need. Yeah. I mean, we didn't really know the proportion of the earth that was covered in water until we had satellites, right? It's a pretty hard problem to get it exactly right. But you can get close to 0.7 with a small amount of data. Okay, that's the conditioning step. A quick run to summarize that. In this particular case, the order of the data are irrelevant. We could have done the presentation to the machine of the W's and L's in the same final picture at the end, but that's only because the data story that we told assumes that the tosses are independent of one another. I've said this before, but that's an assumption we make. The other thing I want you to draw away from this is this dance with the dashed lines and the solid lines. Every posterior becomes the prior for the next observation, right? Now, you don't have to, when you do amazing data analysis, you don't present the data one observation at a time to the model. And the machine does it all. But you could present them one at a time and it works. An important realization from this is there is no minimum sample size required for a Bayesian analysis. It's just that when you have very little data, you will not have any confidence with which to distinguish the hypotheses. But you can have one observation as the minimum number for a Bayesian analysis. It's just that you will make no conclusions. Because you'll basically get the prior back. Well, it depends upon the model and how you evaluate it actually. Sometimes one observation can eliminate a hypothesis and then it's really good, right? So this is the famous thing with Einstein and gravity and light, right? Things like that. One observation can sometimes be very decisive if you have a good model. Make sense? Yeah? Okay. Evaluation. It's hard to say anything generally useful about how you evaluate models because it really depends upon the domain and the scientific question that it makes in the scientific field. But it's important to say that you're not done when you do this. The Bayesian machinery is what I call a small world phenomenon. It's perfectly logical. It's about epistemology. You have to check it against the facts. The model may have unexpected behavior. It could be the theory that comes out on top could be the best relative to the others and it could still be terrible. And it just takes your scientific judgment to figure this out. So this is what I call on this slide the Golem supervision. Bayesian inference just answers a logical question. These Golems just answer the questions you ask them. The question could be bad. And often you don't realize it's a bad question until you get the answer, right? So this is like in folk tales, right? You find a genie and you ask it a question and you're like, oh damn, that was a bad question. And now I will suffer. This is how these things work with programming computers. If you often realize it was the wrong question when you get the answer. And that's what the supervision step is. From the Golem's perspective, it did its job and you should pay it now. Thank you. It did exactly what you asked. But you asked a bad question. So we might ask if it malfunctioned. Sometimes that happens as well. But even in the absence of malfunction, does its answer make any sense to you given what you wanted an answer to? The answer will make sense to the internal state of the machine but it needs to make sense to us to be of any use in science. So then you often loop back to a redesign of the model because that's how you ask a different question. You redesign the model and so on. And I know since you all are engaged in actively in science, you could tell stories of this realizing models were terrible after you run the experiment. This happens to all of us. It's part of the manic depressive cycle of investigation, right? You think you've got it and then you don't and then you've got it and then you don't and it's just normal and it never stops. Okay. So if we're going to really build these models, let me give you a little closer to how statistical models are presented in scientific journals. There's this script by which we go from the hypothesis to attaching it to the data. If we're going to actually build this statistical model of an arbitrary statistical model based upon our scientific assumptions, the first thing you do is you list all the variables. What are variables? Well, we're going to come back to the blown costing thing on the next few slides and I'll show you what the variables are in that case. I think you probably have a guess. Then you define the generative relations among those variables. That is, if you knew any one of them, how does that help you know the others? Those are the generative relations among variables. And those generative relations depend upon the science. That's where they come from. Statistics does not determine them. But you've got to build them in statistical language. Then there's this question mark part which is where you do the estimation and stuff and then you profit. You publish the paper. This is an old meme, but maybe somebody knows it. One person. Very good. All my memes are old. There are no fresh memes here. You in? Yes. I'm just with problems with this sentence that hypotheses are not models. Uh-huh. Why? Because typically there will be a huge number of statistical models that match any one hypothesis. So hypotheses are not specific enough to tell you all of the steps that generate data. So usually assignment fields will have some vague hypothesis like something increases with something else. When you make a statistical model, you've got to say exactly what that increase looks like. And the background theory often does it say. So that's why. The background theory is what? Well I don't know. You give me an example or a huge number of possibilities. It's not the biasing when you mention these terms. No, no. I'm talking about science. I'm not talking about statistical frameworks. I'm talking you've got a scientific hypothesis that something happens when it causes something else. And those hypotheses nearly always are vague. They don't specify mathematical functions. When you explain your scientific hypotheses to your colleagues, you write down equations. Probably not. Maybe you do, but then you have a model. This model claims from a hypothesis. But it's not the same as a hypothesis because there will be an infinite number of models which are consistent with the hypothesis. Infinite number. This is famous in biology because say you get a big distribution of the frequencies of alleles in a population of organisms. You have some generative, you have some hypothesis. For example, how much does selection explain the distribution of alleles in a species? There are literally an infinite number of specific models that are consistent with the hypothesis that selection matters. There are also an infinite number of hypotheses consistent with the hypothesis that selection doesn't matter. So this is a famous debate. In chapter one of my book there's a whole section about this that will give you some more background on it. This is the norm, I believe, in science. Hypotheses are a giant bag of potential statistical models, each of which much specify exactly the functional relationships among variables. And we don't usually engage with that level until we do the statistics. Or if you're a theorist then you do the theory. So in cognitive psychology there will be a huge range of cognitive models which are all instantiations of the same background hypothesis. Like reinforcement learning. There's a thousand models of reinforcement learning. I'm probably underestimating there. But they're all that learning reinforcement, which is a hypothesis. And then you get into sub-hypotheses and for each of those, again, there will be a bunch of mathematical hypotheses which are consistent with it. Did that help? I think it's common for people to identify the word hypothesis with model. But when people communicate in the pragmatics of science hypotheses as are written down are almost never models. They're vague causal relationships that then must be specified before you can predict data. Okay. Once you've got the model you input a prior, I say joint prior because it must be for all those variables. You need prior plausibilities for the values all the variables could take. And then you show the model the data and it deduces the posterior. Okay. The case for the globe tossing example the joint model looks like this. This is what statistical models look like in statistics, whether you're Bayesian or not. They always look like this. So we've got some variables here and you can guess what they are. W is the count of water that we've observed. It's a number between 0 and 9. Before we do the experiment we know it'll be a number between 0 and 9. Why? Because we understand that's our data story. It tells us it has to be. It can take no other values because it's a count of the number of times someone looked where their finger was and said water. And that's between 0 and 9 because we're going to throw it 9 times. Make sense? So that's a variable. When we see it after we've done the experiment we call it data. But it's just a variable that could be observed or not. It depends upon the experiment whether it is observed or not. N is also data. It's how many times you throw the globe. Yeah? If you did different experiments it would be a number other than 9. But it's an input into the model. The model doesn't know it. You have to tell it. P is this variable that's the thing we're trying to estimate. It's true value is what we'd like to know. We don't know this and we can't observe it. You have to infer it from other things. This is typical in the sciences is there's some question about a value in nature and it can't be directly observed. We have to measure it indirectly. In fact most of the time in science you have to do statistics just to measure things. This is what effect sizes are. It doesn't let you estimate differences and effects causal interventions. The consequences of causal interventions they can't be directly observed because causation is epistemological. Everybody here has read their Kant manual. The best Kant. The causation is a belief. It's an assumption always. Causation can never be observed in nature. It arises from scientific assumptions which constrain the possibilities. P is this the world we'd like to estimate. We know things about it. We know it's a number between 0 and 1 because it's a proportion. We still know something about it. Then you have to say we're going to observe W we're going to observe N we're not going to observe P so we need to assign the machine some initial plausibilities. These are the conjectures, the possible constants of the bag and in this case the distribution is exactly as it sounds it's a flat line. It's uniform. This is not required. It could be highly non-uniform if you had good reason to make it so. This is just notation. No matter how complicated the stats model there'll be a series of statements like this which just define the generative story. W is generated as a binomial sampling sequence that's what binomial means with N trials and the probability of each succeeding as P. Yeah? And then P has uniform plausibility before we see the data between 0 and 1. After we see the data the distribution of P will change and that's the information we get and it will no longer be uniform. So Bayesian models very importantly are generative which means they can be run in both directions. What does that mean? If you run a generative model forward it's a simulation. But you can ask the machine if I ran this experiment what would the data look like? That's a generative model. Does that make sense? Bayesian models are always generative. Non-Bayesian models might be generative. This is one of the nice things about the Bayesian approach is you can run things. You can do power analysis with them. You can run them forward. You can do experimental design, study design because given the assumptions of the generative process you can produce data that lets you imagine what the experiment would look like and whether it could distinguish the different hypotheses. That's what I call running the model forward. As time flows. That's the reason I chose forward. You go forward in time and data up here. Now the backwards direction is data are here and we'd like to go backwards to the process that produced them. Now we run the model in reverse. In the forward direction you input the parameters. You would choose a value of P and in the reverse you don't input the P. You input the W in the N and you run it backwards. This is the Bayesian updating and then you get a distribution for P. It won't tell you the exact value but it gives you the relative plausibilities conditional on the data. These are the two directions. The second reverse direction is statistical inference. In the physical sciences they often call this the inverse problem. Inverse of time. This is the inverse problem. If you want to generate data that's doing the experiment. That's the forward problem. Yeah. Causal processes have implications in time going forward. If we have hypotheses about those implications we can use them to make inferences about which causal process produced data. These are the two directions. Does this make some sense? This is very standard terminology. This is how scientific inference works except that Bayesian models are always generative and not all models are. There are lots of statistical models for example in economics which are not generative. They don't predict distributions of data at all. Even though they make useful estimates they don't predict distributions. Here's our joint model. Again all of you know are if you want to think about what the forward simulation means from this model you set in to some sample size make a symbol N, nine tosses then we sample from the prior distribution of P 10,000 values between zero and one uniform distribution P, P is now a big bag of values of P that are present in proportion to the prior plausibilities and then for each of them we sample a W using our binom here and then we end up with 10,000 Ws and I just use the table to summarize them and as you imagine since all the P's are equally plausible to start with over these 10,000 imaginary experiments you get the whole smear of counts of W between zero and nine because this is what the model thinks. It says they're all equally plausible if you ask me to simulate I'm going to return all these stuff about equal proportion. If you change that initial value of P you get differences. So you can do this after you educate the model as well. After you update you can repeat this and this tells you what the model now expects and this is how you generate predictions from statistical models. Once you train the posterior and built it with real data if you wanted to predict the next event this is the procedure to do it. Make sense? It's always the same. Okay, so running this thing in reverse in the computer the exact calculations could be done a lot of different ways to do the updating and I'm not going to step through these but I wanted to give you an idea of some of them that are common although there are many others no matter how you do it it's still Bayesian inference it doesn't matter the exact calculations some of them are only approximations but they're very useful approximations so usually when you start an introductory Bayesian course you'd see an analytical approach I don't do that because I think it's almost useless to learn the analytical approach it's good for conceptual understanding but you can almost never use it in any reasonably complicated model but nobody, no mathematician on the planet can do the intervals that would be required to do the updating but the computer can do it numerically and that's what we do instead one way you can do it numerically is this technique called grid approximation where you segment up in a grid all your hypotheses and then for each of them you just do the counter and this is called the grid approximation it's an approximation because there's not actually an infinite number of hypotheses there's a finite number on a grid approximation but if you make the grid really tight it's good enough at least for science yeah and in the second chapter of my book I'll show you how to do the grid approximation for the water example so you can see how it works it's an easy set of calculations it's just counting but you can't do it for any reasonably complicated model because the number of things you'd have to evaluate and count explodes combinatorically and you just can't do it you can't find a dissertation at some point and so you just can't keep computing there's this grid approximation that I call the quadratic approximation sometimes called a Laplace approximation and this is an approximation which says that the posterior distribution will be approximately Gaussian in shape normal in shape it's an approximation it's often really accurate most of non-Basian statistics makes a very similar assumption it asserts that the uncertainty around an estimator is Gaussian it's an assumption it's often not true so when you do a Gaussian quadratic approximation in a Bayesian context you're getting estimates that are very similar to typical maximum likelihood results in non-Basian statistics a very similar sort of procedure most of the Bayesian work that people do though is using this Marc-Antoine Monte Carlo approximation which makes no assumption about the shape of the posterior and but it takes longer to run but other people write these algorithms for you and you just push buttons now and it goes but you do have to learn how to supervise these things and the whole second half of my book is learning how to do that how to be a responsible Marc-Antoine Monte Carlo operator because these things are jet engines and you don't want to stand in front of the wrong vent on these things they're perfectly safe mainstream tools people use them all the time there's nothing exotic about them but in biology every grad student learns how to use Markov chains it's just ordinary there's nothing bizarre about them at all just a way to do counting okay one thing about predictive checks before I shift into the last half hour of this so when we evaluate the model again it's hard to give advice because good advice will depend upon knowing your background problem the problems you're worried about the particular kinds of mistakes all of that comes into how you check the predictions of your model and worry if it works but you really want to do this forward simulation after you've updated the posterior to see what the model now thinks about the world and you might then realize it has ridiculous beliefs and then that leads you to revise the analysis so I often call these predictive checks this tradition comes from this handsome fellow here in uniform this is Edwin James he was an American naval officer as you can probably guess but also physicist who had made many important contributions to Bayesian statistics during his career and he was very big on the idea that you have to respect models and check them trust but verify and these are this predictive checking style but it's not like a significance test exactly but it has the same spirit you're checking if the model is consistent with the evidence with the data and that's what a predictive check is like you're checking the non-null is the non-null that we've estimated from the data consistent with the data this is something that's not often done in Bayesian statistics it's a standard part of the workflow in Bayesian statistics given what the model has learned does it make any sense at all but there's no universal best way to do it now you have to use your judgment and there's no way to justify any threshold like 5% of course there's not in non-basic statistics either 5% threshold for statistical significance well because there was a bony fish that crawled out of the ocean in the Devonian and it had 5 rays on its arms that's the true story and so 5 is a cognitive attractor and this is why Fisher said about 5% he's trying to stop agricultural scientists from destroying the food supply of England he picks 5 but it's just because the bony fish had 5 rays and that's why mammals have 5 bones in their feet and hands and that's all it is at 7 we'd have the 7% threshold we'd have a base 14 system instead of a base 10 counting system and we still call it 5 but anyway let's not go off on my science fiction short story is this the same related to the alpha value this is a deep conversation it's related so they're not always the same alpha value is Naaman Pearson Fisher is a whole different theory which I say I don't want to open up this can right now but when you talk about alpha and type 1 error rates and calibrating error rates that's Naaman and Pearson who are rivals of Fisher oh they hated one another it's a miracle they didn't have a duel with pistols and Fisher's significance testing is different but yeah it's for the sake of not doing too much harm it's about the same principle yeah the alpha value being 5% but in Fisherian statistics p values are continuous and they're just a threshold and the absolute value of the p value is informative so we don't have alpha value variation well you have errors of different types but no we're not doing significance testing because it's a bad idea because we can predict everything well we can't predict we may not be able to predict anything it's because rejecting a null hypothesis is a useless ritual that's why we're not doing it we want to build a substantive scientific model and that's what this procedure is about we have substantive conjectures we're trying to estimate which of those causal processes produces the data significance testing doesn't tell you what caused the data it tells you what didn't but then we can go further with causation we're going to talk about causation later but significance testing won't tell you what causes what it absolutely will not it just measures the strength of an association yeah so it's yeah this criticism I'm giving by the way is not amazing there are lots of frequent statisticians who don't like significance testing in fact the americans physical association every year basically publishes a joint manifesto saying please everybody stop doing this but scientists don't listen I know some of you have seen these statements but it's like groundhog day every year we have to do this over and over again statisticians saying scientists what you're doing with significance testing is illogical stop doing it is not a basing critique it's absolutely not a basing critique okay but you asked like 5% 5% is arbitrary yeah there's no good reason to adopt it okay let's come back to Mars everybody's favorite planet after earth if you like earth more Mars is great so this is an important story because there's no sampling variation you've got this path you need a model that can predict it but you can't use the frequentist device of saying that our uncertainties arises from variation across trials that doesn't work at all the Bayesian formula works fine here because you've got some initial expectations and you update it with data your predictions get tighter and tighter if the model can be trained correctly or you can completely reject models because in the posterior check it can't possibly predict the path that's how scientific inference works before Kepler we figured out why you get this loopy loop in the sky because we're orbiting the sun and so is Mars and so there's this parallax effect with our relative motions which makes Mars look like it's going backwards it's not actually going backwards the planets are going around in the same direction but at different speeds before that there was this clever fellow Claudius Ptolemy actually lots of clever people like him he just published one of the most famous compendium of this had a bunch of very successful mathematical models to predict the positions of the planets in the sky and they're completely unrealistic but they work incredibly well and so this is the Ptolemaic or geocentric model of the solar system you're all familiar with this this is a really cool thing it works perfectly there's no problem using this to predict where Mars would be it makes perfect predictions absolutely no problem at all when Copernicus said let's put the sun at the middle his model made no better predictions than Ptolemy's they were empirically indistinguishable they predicted exactly the same data this is why people didn't think Copernicus was right because it's just an arbitrarily different model yeah the sun at the middle now you just upset the pope thanks anyway in beginning of chapter 4 of my book I recount some of that history and I talk about how we can distinguish these models and also at the beginning of chapter 7 the relevant thing today is how do they do it? orbits on orbits, things called epicycles so Earth it's such a great model of geocentric system because Earth isn't even at the middle it's like offset the other planets are orbiting some point between the Earth and some imaginary point in space called the equine and it's just the music of the spheres but it works it makes perfect predictions it was trained on the observations they had the data they found a set of mathematical functions which could almost perfectly predict the presence of Mars and the other planets but it had this crazy system in it this system turns out in modern language to be something called a Fourier series written here on the left of the slide a Fourier series is a way of approximating any periodic function by decomposing it into a series of circles and you can approximate any continuous repetitive path with embedding circles in circles and this is called a Fourier transform it's a workhorse of applied mathematics you could do lots of really cool things with it but it's not a claim about causation it's just a description it's a mathematical technique, a really good one for approximating to any arbitrary precision some path of the thing and it's amazing Ptolemy discovered this he didn't know it was a Fourier transform but he discovers this and they do the trigonometry it's a real achievement people make fun of geocentrism but none of us are sophisticated enough to make this model this is a real scientific achievement it's a really amazing thing and it also demonstrates that statistical models don't contain any information about causation in them they're just prediction engines machine learning is about prediction, it's not about cause and the big distinction is now if you want to use this model as a model of the solar system to do some work you're going to do some causal intervention in the solar system what would that mean? this is a bad model you're going to miss Mars it's not going to get anywhere near it so it matters when you do an intervention but in the absence of an intervention the predictive model without any causes in it is fine does that make sense? and so this is why inferring cause is an additional step you don't get it from doing statistics the field of statistics is not a field about causal inference it's a field about describing associations among variables and the cause is up to us as scientists we bring the causal information in comes from the scientific background we design experiments that can distinguish causes but all of that is an interpretation laid on top of the model the model will be perfectly happy with this always okay, so all these sorts of models like regression or like this they're essentially geocentric regression is the geocentric model of statistics it's just a big descriptive engine for making predictions about cause and regression none of the symbols mean cause if you want them to mean a causal effect of a variable you need additional assumptions outside the regression model and I'm going to show you some of those in the next slides but this is a very important point that I think is left out and causal inference is a big topic in applied statistics it absolutely is, but statistical models don't have it in them okay I wanted to get Gauss in here invented linear regression to describe the motions of planets actually this is what he did with it and these are very powerful machines I'm not trying to say they're bad these linear golems can do lots of really sophisticated things they're all based on the Gaussian distribution Gauss didn't name it after himself he just said here's a distribution of errors people later named it after him and so I wanted to give you a quick glimpse into why we use normal distributions so much as error distributions in nature and where that assumption comes from and I wanted to be clear that it's an ignorance assumption it's an assumption that it's not a claim about how things are going to turn out it's a claim that however things turn out this calibrates our uncertainty in the right way normal distributions arise spontaneously through aggregating processes in nature all the time they're completely unremarkable which is why we use them so much but they represent states of ignorance they're like a half informed state of ignorance I can't spend much time on this this is Gauss's 1809 derivation of linear regression by the way, Bayesian the word Bayes doesn't appear here of course because everybody was Bayesian at the time but it's probability theory inverse probability is what they called it so why normal? let me give you an example imagine we all go out to a football pitch and we line up on the midfield like this each of you has a coin we have your coins in our pockets because we live in Germany and we flip it you need it to live you need a hero always it's a part of your survival kit so you flip it and if it comes up heads the proud eagle or whatever then you take a step to the left and if it comes up the other side you take a step to the right and we repeat this process multiple times and our positions are going to drift some of us get the eagle some of us don't and then we drift a little bit more and our distances from the midfield line scatter more and more now we do this some number of times we measure our positions and the question is what's the distribution of the distances from the midfield line on this and I assert that after a few coin tosses they will be approximately normal they have to be because the process of your steps and the distance you move is adding fluctuations that are generated by the coin toss and by that fluctuations you end up with normal distributions and that's it there's a deep information theoretical explanation for why that has to be so which I'm not going to repeat here but it's in my book so I'll say if you're curious about it and this has been known for a really long time so here's an animated version of this story the football pitch we start out at the midfield position zero there on the top after four corn tosses I summarize the distribution of distances say that we put on the field at the bottom and you see it's looking kind of Gaussian already but the tails aren't thick enough after eight now it's pretty Gaussian and after sixteen it's almost exactly Gaussian this happens spontaneously all the time just comes from dampening oscillations when you add oscillations together in natural systems they dampen one another so that the most plausible value is in the middle because most likely thing that happens is the oscillations cancel one another in the middle again and huge numbers of natural processes produce things like this so here's a quick movie this is maybe a too bright to see people make physical machines to simulate this this is a big board with pegs and they pour a bunch of marbles down it and then they collect in the bottom in a Gaussian distribution because they bounce to the left and the right and then those deviations add like the coin flips on the football pitch in the Gaussian forms so if we're ignorant about the exact path that the marble takes but we want to guess where it's going to be we should guess it'll be in a Gaussian position that is more plausible in the middle and falling off in plausibility exactly in this particular form does it make some sense? that's why we use the Gaussian it's not a claim about where things will be exactly it's a calibration of our ignorance about where things will be but it's an educated guess through this because I want to get to causal inference so linear models, they're geocentric they're incredibly powerful and useful when we teach Bayesian statistics we don't tend to teach all the little special cases of regression we just call everything regression or a linear model so there's a huge range of specialized procedures and tests like t-tests, single regression, multiple regression ANOVA, ANCOVA, MANCOVA are these familiar at all? they're all the same thing they're all linear models and so I just like to teach the linear model because then you get the full power and you can mix and match and do what you need, given your scientific purpose it's about learning the framework instead of the individual little tools and it's this construction perspective and these are I just wanted to show you in the Bayesian perspective what a linear regression means it means now that there are an infinite number of plausible lines that connect two variables together and we want to rank them all in the relative plausibility and that's what Bayesian regression does there's an infinite number of possible lines before you see the data so here in the upper left of this slide we introduced the first 10 data points these data points are people and we're estimating the slope that connects weight to height in an adult population and you know it's positive? so when we see 10 individuals from the posterior distribution here now the posterior distribution contains lines lines have a slope and an intercept so you can describe where the line is and they're very scattered because the model is still not sure because you've only seen given it 10 people and it's like yeah this is the best I can do right now all of these lines are plausible we give it 10 more it starts to contract because now a lot of those lines are less plausible and it starts to contract around the data after 50 you can see that scattered there's more uncertainty on the ends because of the way lines tilt after 100 it's getting quite confident over here after 350 the plausible lines are all tightly bunched up around a single best line which is the line you get in a frequentist analysis that best fit line is in the middle the uncertainty in the Bayesian analysis is bow-tie shape of uncertainty that's all the lines around it that are roughly equally plausible to one another and so depending on the model the posterior distribution can contain a really complicated functional shape and what the posterior distribution posterior distribution is doing is ranking them all relative to one another how plausible are they given the data but it can be really complicated, it can be a bunch of curves all kinds of things quick example, you can do curvilinear things so this is a march temperature trend, historical temperature trend for Japan on the top and from the year 200 to the year 2000 and on the bottom is the first day of the cherry blossoms this is recorded every year because culturally this is very important in Japan, people have picnics and all kinds of stuff in the cherries blossom and so if we wanted to estimate the trend of the cherry blossom thing so we could compare it to the temperature trend there's a linear model in here actually that's used to construct this fluctuating trend but it's highly uncertain, you can see this gray region I've drawn on this the posterior plausible center part of that trend at each year but you'll notice it's why because the data are finite and they're highly variable, right so this is still a posterior estimate but it's a highly wiggly thing that's the technical term in statistics, it says high wiggliness okay, I think I'm on schedule actually to do this okay, regression is right it's geocentric, it doesn't have causation in it and this is where we need to supervise it and as a golem, it's an oracle but it doesn't have your interests in mind you have to be very careful what questions you ask it it will answer your question correctly but the answer may be nearly useless and mislead you so one of the things that we worry about in regression models and this would include ANOVA's and everything else when you start putting predictor variables into them and treatments and factors and whatever it is is you're worried about confounding what do you need to control for and there's this tradition that evolves that people will think well, let's be safe, let's put everything in and is that a good idea so think about this read scientific papers and people say we controlled for age and socioeconomic status and gender and a bunch of other stuff and then you're supposed to say oh well those possibly can't it must be safe now, this is a causal effect this is a bad idea this is what I call causal salad causal salad is I want to make a causal inference so I've got all these variables let me toss them together and do some statistics and then I'll call the resulting parameter values causal this does not work you can prove it logically and mathematically that this is bad and I just want to give you some conceptual examples to take away this is not a uniquely Bayesian thing at all everything I'm going to say here applies equally to all perspectives on doing statistics it's a problem for everybody it governs everything else it's the most important thing the biggest problem if statistical practice is not whether you're Bayesian or frequentist it's that people are never taught how to legitimately claim causation and they just run models and claim the parameters are causal so adding variables can create confounds just as well as it can remove them I'm going to show you a case where it removes them and then show you a case where it creates them and then I'll let you go home and be happy okay an example I think you'll be able to do it and this is the way to introduce you to causal diagrams what you're seeing here is something called the directed acyclic graph don't worry about what that means it's just a terrible name for a causal diagram this is causal because the arrows indicate causation there are three variables here age, height, and a math score for students and say you're interested in the hypothesis of whether taller people are better at math that's why the question mark is there we don't know that we want to estimate the causal force on this so this is like structural equation modeling which I know some of you have certainly seen structural equation modeling comes from the same origin here from a biologist named Sewell Wright age influences height and math ability we're pretty sure and so as a result age is a confound if you wanted to estimate the causal influence of height on math individual age gets in the way of this because it's going to create a correlation between height and math ability even in the absence of any causal relationship between them I assert that taller people are not inherently better at math but on average in a population of children height and math ability will be correlated because as they grow and they study it creates an association but that association is not causal yeah? no one wants to argue that point with me? okay good sorry I choose examples that are intuitive we normally call this a confound it's something that causes both of the other two things which means if we just measure the association between height and math it's confounded by age and we have to somehow remove the age effect to estimate the causal effect of math on height and this is where adding something to regression model works and I want to show you why it works and give you an idea because then we'll also be able to explain why this doesn't always work so what we'd say is that math is independent of height and conditioning means conditioning means for each age we throw away all the other kids we just look at all the kids with the same age and then we assess the association between height and math score that's what controlling for age means it means stratifying on the variable stratifying the population by each unique value of age or if they're not all the same age you look at similar ages that's what conditioning on means and that's what you do when you add a variable to the right hand side of a regression or an ANOVA you're conditioning, you're stratifying the other estimates by that thing it's a way of conditioning so if I simulate this population and I did we've got, I think this is like 1,000 kids or something I forget how many it is, 5,000 kids math score, age and height the way to read these, this is called a pairs plot the way to read this is the bottom axis here is what's below it so this is age on the horizontal axis and the top middle plot on the right axis is math score so their math scores are improving as they age yeah because they're learning yeah, education works and this graph in the upper right is math score on the vertical and height so this is age height is the horizontal thing so taller people are better at math but does height cause better math scores and of course not because it's the biggest data and age is causing better math scores and age is causing height but that generates a correlation between the two and that's the confound effect so now what we're going to do and also yeah age and height are correlated because people grow well, kids grow I'm done growing, I'm shaking but what we're going to do is we're going to stratify by age so we're going to take this plot up here each age cloud at the correlation between height and math ability that's what conditioning means and I can show you that we just look at four different ages here age equals seven, eight, nine and ten scatter plot between height and math and you see there's no pattern once we condition on age that's why stats is helpful for making causal estimates but you have to intuit what might be a confound and put it in the model this is what leads to causal salad you see this example and you think oh I should always do this I always put everything in the model because then I can remove these confounds no let me show you the opposite problem now and I'll leave you with the happy story of the opposite problem let's think of a different kind of causal relationship think about a light like one of the lamps in this room that's on and it's caused by two things at least approximately the presence of electricity in the building which I have labeled here power which makes it possible for the lights to turn on and there is a switch on the wall both of these things have to be on and as it were for the light to turn on that's why there's an arrow from each in this causal diagram pointing to light does it make sense why the diagrams like this power doesn't cause the switch right the switch can have any state on or off regardless of the power if the power is out the light can't be on right and it doesn't matter what the power is like if the switch is off the switch doesn't cause the power power doesn't cause the switch there's no arrows between them they don't cause one another at all make sense right this is not a circuit diagram of the room right it's not what it is it's about what causes what and the power doesn't move the switch on the wall and moving the switch on the wall doesn't make power flow into the building it just makes it flow to the light yeah does it make sense no this is think about if the power is out right if you wanted to know the state of the light you need to know the value of both of these variables that's what these causal diagrams be light is a function of the presence of electricity and the state of the switch on the wall that controls the light that's what a causal diagram needs and that's all it needs perfect common sense and so we were we were gonna when the light is on you know something about these variables because you're a person right so the light is on you know both of those things are true there's power and the switch is on yeah and this is the sense in which causally speaking the switch is independent of the power they have nothing to do with one another you can if the power is on that doesn't constrain the lights to be on you can get up and turn them off yeah I'm always telling my son this you can turn the lights off that would be fine but it's still there and you can still turn it on with no power yes of course you can you can move it it moves on the wall try it it moves it moves on the wall it pivots yeah in my house it just won't do anything but then you take the light out also no the light is still up there we haven't disassembled the room the lamp is on the ceiling the switch is on the wall right this switch means the state of the switch is it indicated up on or off whether that causes the light to turn on depends upon the presence of power in the city yeah that's all the diagram says right so in a population of buildings there's no causal relationship between the power and the switch they're independent of one another right but once you here's the point I want to make once you know the state of the light they're not independent they're causally independent still but they're statistically associated and this is why it's bad to just throw things into a regression model so let me walk through this let's think about why knowing the light gives you information about the other variables it's because it's jointly caused by them so if the light is on and the power is on what's the switch on got it right it's easy this is science causal inference right you have inferred this the power doesn't cause the switch though but they're statistically associated in a population as a consequence of this so once you condition on the light knowing power tells you the position of the switch and that's the statistical association in a statistical model then conditioning on the light tells you it creates a statistical association between these two it means that power gives you information about the switch once you learn whether the light is on but it's purely statistical if you don't know the causal diagram you won't know that this is a confound you won't know that it's not a causal relationship and this is why just adding variables like light to a model can screw up your inference this is a let me do one more example if the light's off and the power switch is on then you bless the switch off right so statistical association you can predict even though there's no causal relationship between the power and the switch you can predict the state of the switch if you know the other two yeah so light light is not the cause of the switch but if you add it to the model it'll tell you that power predicts the switch which misleads you now of course you would never do this in this case because you understand electricity and the switch on the wall and all that but if you just start doing causal salad with variables you can easily be tricked this is a special effect in statistics this is known as collider bias the switch is known the light rather is called a collider because two arrows enter they collide on the variable let me show you another example that isn't totally silly this is a real example from the literature on happiness which is a very fun literature people study how to be happy so there's a big literature on it right I want to know how to be happy science can help me this would be great so this is a published example so say you're interested in investigating the relationship between age and happiness you want to know whether you're better or happier probably non-linear this is probably some curvilinear relationship it's plausible that marriage is a confound because that affects people's happiness and we're not going to say which direction but it might have an effect be careful what I say here this isn't being recorded suppose the true cause of relationship is that age doesn't influence happiness at all this is a thought experiment you're going to ask the question you're going to regress happiness on age but you're asking should I control for marriage because the relationship might be different among married people and non-married people sounds reasonable but if the true state of the world is that there's no relationship between age and happiness people are born out of certain happiness and stay that way their whole life or they drift around but it has nothing to do with age but age and happiness affect marriage why would that be true I think this is very plausibly true if you live you have another opportunity to get married so older people are more likely to be married than younger people age causes marriage it does I know it sounds weird but from a variable position it does yeah there's also divorce and remarriage and all these other dynamic processes that are fun but leaving that out of the model for the moment happiness also causes marriage people don't want to marry sad people so all the things being equal happiness is a predictor of being married in my book I think it's chapter 6 I wrote a simulation of this and I show you that if you do a regression where you're predicting happiness with both of these you end up concluding that people get sadder with age and this is because it's a collider you find out so happiness is independent of age that's an assumption of the model but once you condition on marriage it's not it's dependent on age conditional on being married because there's this finding out effect so again let's think about it like a light switch if I know that someone's married and I know that they're young then they're probably happier than average so I find out happiness by knowing age and vice versa if someone's married and they're really old they're probably less happy on average they're frowns in the audience I'm sorry but it's not causal it's the statistical artifact of conditioning on a collider which is a very bad thing to do how do you know if something's a collider? you need theory it's not in the model it's outside the statistical model okay our time is coming to an end so let me try to summarize that a bit this is my why not just add everything so remember causal salad is just my playful term for this very common procedure in the applied sciences not the applied sciences in applied statistics of just conditioning on everything that you've measured in hopes of avoiding being confounded that works for benign sorts of confounds like age creating a spurious correlation between height and math ability it will not work for colliders colliders exist this is a real threat there are many famous examples of artifact relationships because people condition on colliders happens all the time it's not even exotic there are all sorts of other things that I haven't given you examples for too although there are examples of these in my book things like conditioning on post treatment variables if you run experiments you're not safe from this so a post treatment variable is something that happens after the treatment is applied and sometimes people put these as controls in their models this is very dangerous you can end up concluding that the treatment has no effect when it actually does why? because the treatment is mediated through a post treatment variable and so if you put the post treatment variable it gains away the treatment after you know the post treatment variable you don't learn anything extra from the treatment it's a very risky thing to do to condition on post treatment variables sometimes it's safe though it depends upon the details of the causal model likewise there are pre treatment variables things that happen before the treatment these can be colliders and then you don't want to put them in the model either but you might want to condition on them because they might be confounds you haven't fun yet there are like two chapters in my book which are all about these problems question are you calling colliders the biases? sometimes it's a bias sometimes it's not it's a bias when you condition on it and you're not aware it's a collider then it creates a bias in the estimate of the causal effect so what's your is it the concept of Bayesian or it's just your this is the terminology in the field of causal inference standard terminology in the field of causal inference it's neither Bayesian nor not although most causal inference stuff is Bayesian but it doesn't have to be this is independent of that there is good news you can make causal inference in observational systems that's good because that's mostly what I do I'm an anthropologist we go off to the field for years learn foreign languages, get parasites that's what we do professionally I can tell you stories our systems are observational when we do experiments what we're really doing is just measuring things they're just ways to do control measurements but they're not really treatments being applied at random and we still want to do causal inference though so we need strong theory to do it and we use these causal diagrams to govern our inferences and work there's a lot of resources on how to do this so I'm going to leave you just with a couple suggestions obviously there's my own book which you have a copy of there's a lot of causal diagrams in it to introduce you to this in the context of doing estimation as well there's also this great book Causal Inference and Statistics a primer by Perle, Lamar and Joule which is a very gentle introduction to causal inference it's the pre-statistical considerations that we've done at the end of this presentation today it's meant to be introductory it just has very basic statistics which I think all of you already know linear models, t-tests, those sorts of things are very confounded and when you can decide how to remove the confounding or if it's possible even to do so so let's just return to the headline here on this slide both of these things Bayesian inference and causal inference are really just counting the implications of assumptions that's all we've got in scientific inference is this dance between making assumptions seeing how well they predict reality and then adjusting the assumptions we need some middle ground to count up the implications of our assumptions so we can do comparisons against reality and that middle ground is probability theory and both of these frameworks they're just applied probability theory thank you for your indulgence and I hope some of this was useful