 All right, hi everyone, I'm Richard for the next 10 weeks, well not the next 10 weeks, but I will deliver you 10 weeks of instruction interrupted by the winter holidays, 10 weeks on Bayesian statistics. Before we talk about statistics, I want to talk about science. I think like me, most of you are here because you're scientists or researchers of another sort and statistics is just some terrible thing you have to do to produce inferences, but you have no, let me just speak for myself, I have no innate affection for statistics as a subject. It's just something we have to do to make inferences and I'm really passionate about making inferences about the natural world. What got me into science is ecology and trying to understand nature and all of its beautiful forms and in particular why this nature and not some other nature, why these birds and these flowers or why do people find birds and flowers so appealing to look at, questions like that and to seek natural answers to these questions. I'm an anthropologist and the building you're in is an institute devoted to the study of evolutionary anthropology which is the application of evolutionary theory to understanding basically why are people, why do people exist, why do they live in the sorts of societies they do, when did they arise, when they did, and where are we going. This is what I study and I think regardless of what all of you study, what it will have in common with evolutionary anthropology is that the phenomena are incredibly complicated. They involve many different time scales and many different measurement problems and they're scientifically interesting. There are a tremendous number of problems to solve at all different scales and that's why it takes a large industry of people to do it and make progress on it. And so we're trying to explain things for example like, well history, why is it that human societies are the scale they are and great spectacular night time shows from space. That's the border of India and Pakistan by the way, that border visible from space. All of these phenomena, whether you study the structure of the brain or like me you're interested in human societies and the evolution of organisms, all of these things are complicated processes that are not reducible really to individual statistical tests. And I think then the great disservice that the Society of Science, the Society of Research more broadly has done is that most of the procedures we have canned for us on our computers are things that were invented to study well this, agricultural trials. In a particular part of England. Right, and this is a hard problem but it's vastly simpler than the kinds of problems that most of us actually try to address. So things like t-tests and analysis of variance, right, used to compare cases where you have randomization and experimental control in a system in which the effects are large. The use of a pesticide versus no pesticide, the effect as big as statistically discernible. You can control lots of measurement issues. We understand a lot of the relevant biology and so on. I submit that most of us at the cutting edge of research have none of these luxuries. We work on systems that are very, very different than this or the correlation structure among the replicate units. Even when we have experimental control is radically different than it is in the sort of systems that Ronald Fisher studied. So most of you I assume will have taken an introductory stats class at some point and you will have the scars, the psychological scars of that experience as do I. And in those courses you may have memorized some decision tree, something like this one. This is one from the first chapter of my book, but you could search the internet for statistical decision tree and find any number of flowcharts similar to this. And I want to say about these charts is that this is madness. This is pure madness. This is a great way to test undergraduates on something you can teach them. But this is no way to prepare people to do research. There's no scientific principles in this. It's just stuff about the data table you have. It's a real disservice to think that these things are sufficient to understand, for example, the border between Pakistan and India. These things are little robots. So to stick with that metaphor for a second, you can think of the man Whitney U-Test down here at the bottom, something I know the biologists at least will have heard of, man Whitney U-Test. Think of it as a little robot. It takes particular inputs. It does something with them and gives you an output. It's like a room buff if you guys know a room buff. Right, sweep your floor, little robots move around sweep your floor. Yeah, they're great actually. You can reprogram them and do all kinds of things. And like chase the cat and so on. These little robots, all of these tests are like little robots. And I want to encourage you to think of them that way because then you see them less as logical or rational entities that are revealing something true about your data. And just more as like little tools, like a room buff. And if you use it in the wrong way, like the chase your cat, which is bad by the way, then it'll misbehave and produce bad outputs. And it's important to be clear that robots are not particularly clever. They do very clever things, given expected inputs and outputs in narrow ranges of circumstances. So it is possible today, of course, to make robots like this one, which can literally read the New York Times. But there are lots of other things, which you find trivial, which it cannot do. It's very recent that we managed to build a robot that can walk on two legs. It's this thing about computers and robots as a special kind of computer that they're terrible at things that children are good at. And they're really good at things that we're terrible at, like playing Go. Computers are really good at playing the game of Go. Humans are bad at it. So we look at them in awe. Meanwhile, almost no robot can climb stairs. This turns out to be the cutting edge of artificial intelligence research. It's getting a robot to be able to climb stairs. So what we want to do, of course, is that's OK in some sense. There's this complementarity between us and the robots. We want to use them to engineer them to do things for us that we're bad at. But with that comes this responsibility, is that their behavior is very fragile in the sense that it's tightly tuned to particular contexts. We have to use them with a great deal of responsibility and circumspection. We cannot trust that they are rational entities that will police their own boundaries of behavior. So instead of robots, let's think about another mythology here, a Golem. A Golem is, I think, the first mythological robot. Golem is a legend from Jewish legend. Clay robots, you fashion clay into a humanoid-like shape. And then you inscribe some words on its forehead and do some kabalic magic. And it comes to life. And there are a number of Golem legends, the most famous of which is the Golem of Prague. In the book, I have a longer section about this, and I'll give you some background to it. I think it's a great legend, and those of you who've been to Prague, we're not very far from Prague here in life, right? You can take a bus and get there today, if you want. There'll be lots of tourist stuff you could buy about the little Golems. It's all over the city. I have a t-shirt with a Golem on it. And let me skip to the relevant bit, is that this legendary, actually did exist, this rabbi, Yuda Lovben-Betzal, in legend built the Golem. He actually existed. He was a real person from birth records and such. He may have built the Golem. I don't know. I'll suspend disbelief for a second. But in the legend, he built it for a very good cause. That is, to defend the Jews of Prague from persecution, blood libel accusations, and other such things. But then, because the Golem takes instructions extremely literally, it does a lot of damage and accidentally kills people, and wrecks havoc, and does harm, and then it has to be decommissioned. And the rabbi ends up, well, destroying it. And this is a moral lesson about, in the Hebrew tradition, of taking God's power of creation into your own hands. Even if it's for good purposes, it's just too dangerous. Mortals cannot have this power of creation. In this class, we're going to use the power of creation. We're going to make statistical models. But I want you to remember the Golem, because you're making little Golems. And their havoc will be isolated to your laptop, for the most part, I think. But nevertheless, you will feel their destructive power. And what I hope is, by the end of the course, you have enough skill and wisdom that you can use them responsibly and usefully, or restrain their power and be skeptical enough of them to use them well. So let me try to summarize this analogy. So Golem, made of clay, animated by truth. This is the rabbi had to write the Hebrew word for truth on its forehead to bring it to life. And they're very, very powerful. They can do things that people cannot. But they're blind to their creator's intent, and they take instructions incredibly literally. So you have to be very careful what you ask for with these things. So it's easy to misuse. And of course, it's fictional. I don't think the Golem ever existed. Luckily. Well, we are gonna work with our statistical models, which have an analogous symmetry with the Golem legend, with robots in general. They're made of, well, they're virtual entities. They're instantiated in silicon. They're made of bits of information. They're made of electrons. We're, they're animated by truth in the sense that most researchers, myself included, are interested in some concept of truth. We're trying to discover the true state of nature. Because then we can do things to make nature better, right? They're hopefully powerful when they're right. Again, models are blind to our intent. Your analysis of variance does not know what you want. It's just an algorithm. It's just a Roomba. It's just a Golem. And it can wreck pride. But you have to be very, very careful about how you use it. And of course, all models, I say at the bottom here, not even false. There's this famous saying in statistics that all models are false and some are useful. But George Fox, you may have heard it before. And I think it's a great, it's a great aphorism. I think the mistake there, what causes unnecessary debate is that it doesn't make sense to talk about models as being false or true, right? They're just engines. They're just Golems. They're just robots. A robot can't be false or true. It's a robot. It's obviously there. The question is, how does it behave? And in what context is this behavior useful to us? Right? Does that make sense? Cisco models are not true or false things. They're not rational entities. Or look, their behavior does have a logic and does follow a logic. And that lets us design them and analyze why they behave the way they do. But they're just tools. So they can neither be false or true any more than a hammer can be false or true. There may be better tools for making a table than the hammer you have at hand, but the hammer isn't false. It's just worse than the ideal hammer you'd like to have. So, let me talk a little bit about what you should expect from this course. I have prepared 10 weeks of material. This is 20 lectures. This is two hours a week. There's also a book that goes with it and some software. So this is week one. We're going to be doing the logical foundations of Bayesian inference. This covers the material in the first three chapters of the book. I'll give you some links in a moment. You can grab the PDF of the book. And then in the weeks to come, we're going to march all the way from basic regression. I'm going to assume you know nothing about regression. Now I know that's not true. Most of you in here know a lot more about regression than you think you do. Absolutely the case, right? If you know t-tests, that's a regression. T-test is a regression. And we're gonna march all the way from there through multivariate models. We'll do some in week three and week four. We're going to do some causal inference. And we're going to do overfitting and cross-validation in week four. Information criteria like AIC, W-A-I-C. I postpone Markov chain Monte Carlo until the middle of the course so that you don't have to fight with that. Get your principles out of the way. And also because, look, Bayes isn't about how you get the posterior distribution. It's just about getting the posterior distribution. And Markov chains are just a way to do that. So if that doesn't make sense to you yet, that's fine, you're in the right course. But we're gonna, the material is structured to be as kind as possible. And I think the kindest thing to do is wait until week six for Markov chains. But then once you get them, you will feel very powerful indeed and you will begin to wreck Prague. And we will be here to clean up the rubble and make it work and as we move into more powerful models, generalized linear models and then multi-level models. And really, this is a course about giving you a practical set of tools to make multi-level models and use them responsibly. So there's a GitHub repository where I have this calendar and I'll be posting links to all the slides and the recorded lectures. So here's what I, here are the goals I set for myself and the goals you should hold me responsible to. The goal is to deliver practical model building skills and model criticizing skills. I assume most of you don't want to be statisticians. Literally, right? You have some scientific preoccupation already and that's what you want to pursue. And this is a place to come and learn some particular skills and use it. And I'm totally on board with that. And that's what this course is about. It's not about making statisticians, it's about making statistically competent researchers. There's enough philosophy in here to ground you so that you don't rec-prog too routinely or that when you do, you can figure out what happened and clean up the rubble. And I want you to develop enough confidence through doing exercises. This class involves a lot of code exercise. Not so much mathematics, lots of coding because that's what researchers need to do. None of the problems that are relevant to any of you or to me are solvable analytically with analytical mathematics. We have to use code to do it. So those are the skills we emphasize. And I just, the exercises are structured so that I hope you get enough confidence to be comfortable being confused for the rest of your life. Because that's what it's like when you're working on hard problems. Nobody knows the answer to is you feel constantly confused. I testified this myself. This is what I do for a living, is feel confused. But you need to have enough confidence that you can work through that confusion and produce little bits of wisdom that you can share with your colleagues. And that's what we do for a living. Again, whether you're a scientist like me working for the public or whether you're a private researcher, it's the same game, right? There's a lot of confusion. If you're studying a problem no one knows the answer to, you should expect to be confused and frustrated on most days of the week. It's perfectly normal. And so I want you to get used to that in this class. I will give you homework assignments that make you feel a little bit confused at first when you start, but then they are solvable. And I know they're solvable because hundreds and hundreds of students have now solved them, right? Well, there'll be some new ones and we'll see how those go. But that's always my goal. So last practical slide before we get back into things. You should spend this week installing the software and starting to read the chapters and catch up. The first week is not gonna be highly computational but there'll be a problem set assigned on Friday that you'll wanna work with. So the first thing you need to do is install the Stan R package, link up doc. And then my R package, these slides will be online, by the way. So you don't have to copy all this down, Natalie. And you'll be working with the experimental branch which I know sounds very exciting, right? Because there are new features in here that I'm building into the second edition of the book that I think are very good. They're cool features and I want to try them out on all of you. And see how you use them. And also with this, there's the second edition of the book which is up on my website. You need a password. The password is my favorite 1980s TV character, Blossom. One person who knows this joke. Okay. So what's going on with the second edition? Some of you were taking this course for the second time, I can see you. Welcome back. So, or the fifth time? Yeah, something like that. So things keep evolving here. And I think the problem with publishing a book is it kind of freezes an evolutionary process at a particular point in time. But every time I teach the course, I figure out how to do something better, I think. And so second edition is helping me do that. And so you get the second edition for free as a PDF and what do you get for this? Well, you're gonna get a large number of typos, but in addition to that, you're going to get some really useful stuff like prior predictive simulation. And you won't know what that means right now, but it will help you to understand priors way better than previous versions of the book would have allowed you to do so. We're gonna do a lot more causal inference with these magical things called DAGs. You need to know what that is right now, but we'll do that and fun things that arise from those like colliders, instrumental variables. And then there's some software bookkeeping, which I think adds some really nice features. We'll get to that when we bring those tools up. And then a number of new examples, which I think let us see some new interesting statistical phenomena. Okay, let me get back into the stream of building up the foundations of Bayesian inference. So for the rest of lecture today, we've got another 40 minutes here today and then another hour on Friday this week. I wanna go from you're knowing nothing about Bayes. I know you know more than nothing, but let's assume you knew nothing. To a basic philosophical understanding of the nuts and bolts of what it's doing. In the least glamorous terms that I can possibly deliver it. That's always my goal, is to take the glamour and magic out of everything. So it's fun to loop with me. So this is where before I got into course mechanics, I was talking about how these charts for picking statistical tests are a form of madness in the sense that they don't really help you a lot because each of these little robots has very specific operating criteria. And it's difficult to deploy them in real research context. You need more freedom than that. It's a big problem in the sense that what you need to be able to do is build your own robots. And sometimes the modifications you wanna make to these robots are small, but you still need to know how to do that. So you need engineering skills. Luckily engineering statistical models is a lot easier than engineering actual robots. So these things that we typically call statistical tests are little golems, little specialized procedures. In some stat software they're actually called procedures. Most of them were developed in the early 20th century thinking only about randomized experiments with large effects. Where measurement problems had been solved. And they're discussed as if they're somehow rational entities that use principles of truth to discover true things about data. But that's not the case, they're just little robots. And sometimes they work and sometimes they don't. But they just produce inferences. They don't really produce decisions. So the thing about them that I wanna spend a few slides trying to talk you out of is the idea that the purpose of a statistical procedure is to falsify a hypothesis. I think it's fine to say that the purpose of a research paradigm is to falsify a hypothesis, that's okay. But the purpose of a statistical test is definitely not. Or at least if that's the purpose you put it to it's you're going to be disappointed. So here's why. Let me give you an example from my own area of research or say my own broader field. But I imagine you can think of your own. The basic problem is that statistical models are not hypotheses and they do not embody them. It's much more complicated than that. So it used to be last century, there was a big fight in population genetics over whether evolution was neutral. And what that would mean is, of course everyone believed the natural selection mattered but neutral evolution would be that most of the molecules in your DNA had nothing to do with selection. And most of the molecular variation in DNA had nothing to do with selection. This is the so-called neutral theory. So you can imagine on the left we've got some vague verbal hypothesis, which let's base it, that's what hypotheses are like, right? Starts out over a few beers in the pub and you come up with an idea like eh, what if people, what if feeling war makes you happy? Right, it's like that. And then the whole research paradigm comes from that. So here it's evolution is neutral and a squishy blob on the left. This leads to some mathematical instantiation, a particular mathematical model, dynamical systems model in this case about how nucleotide frequencies change in a population over time. The so-called neutral equilibrium model. Equilibrium meaning there's a constant population size and there are mutations which bring in different alleles over time. To test this process model with real data you need another model, you need a statistical version. And this is much more specialized than the process model because the process model makes a lot of predictions depending upon how you summarize it. You can think about different time scales in the process model, you can take cross sections through it, you can view frequency histograms, all kinds of different decisions you make about the data that spit out of an actual simulation. And this would be the statistical model on the far right. And so people did early tests of neutral equilibrium and it turns out to be very hard to reject the hypothesis that evolution is neutral. The data are compatible when they're looked at in a particular way. Some of you know this story and you know I'm skipping lots of fun details like people getting kicked out of international conferences and other things like that. But this was all last century and people have forgotten a lot of this. There's another class of fun hypotheses called selection matters. What's fun about this is that there are lots of different ways obviously when you say there are lots of ways to say selection matters that you immediately think, but of course there's different ways that selection matters. It could matter in this way or that way, it could affect chromosomes, all kinds of stuff it could do. And so you have to make multiple process models. Well it turned out early on, a fellow named John Gillespie showed that there's a particular version of selection matters, the fluctuating selection model at the very bottom, which produces the exactly the same statistical predictions as the neutral model. This is not an unusual situation in the sciences for anything more complicated than a trial of one barley variety versus another. Is that multiple process models make exactly the same statistical predictions. So not all hope is lost, you have to look at the data a different way, but you don't realize you're gonna get into trouble unless you have more than one model. You can't falsify either of these things if you're naive and only using one model. It's much more complicated than that. And I think this is true of almost any mature research context is that there'll be multiple non-null models on the table. And our job is to tell them apart, not to falsify some particular standing model. To make this even more fun of course, there's more than one neutral model. There's no unique null model in population genetics. So trying to falsify some models if it was the standing null makes no sense at all. So for example, what if you had a non-equilibrium neutral model? Well then you get yet another, right? What is non-population sizes aren't constant through time and that affects the frequencies of mutations that you see. So it's just a big mess. In the book, I have a box also about Neanderthal inbreeding and I know the people in front of this building are interested in that and know a lot about it. Follows the same problem. There's no null model to be falsified there, right? Why do the neutrally equilibrium and neutral equilibrium? The question was why does neutrally equilibrium and fluctuating selection make the same predictions? That's a very long answer and if you read the book there's a citation and you can go read the paper. I'm sorry, that's like a really hasty thing but I would need two hours to work up to the sampling distribution and all of that. It's a bit of a mess. Why? The short version of the answer or something we'll talk about later on in the course is maximum entropy. The reason is maximum entropy. There are a bunch of processes which aggregate to the same statistical distributions because of entropy concentration. That just sounds like a lot of science fiction talk right now. But now you suddenly entered a Michael Creighton novel but I trust we get the information theory and then I can give you a better answer, okay? All right, so Popper. So most scientists, I think for most scientists, Carl Popper is the only philosopher of science you've ever heard of, right? And he's kind of famous. And Carl Popper was a very good force for science, I believe, and he deserves to be famous. But most of his work is not very famous. What he's known for is just one particular idea, the idea of demarcating what is science and what isn't science by whether it's falsifiable or not. But he didn't think that's how science works, right? It's clear that the falsification criterion is about demarcation, what's in and what's out. But confirmation, he wrote about it quite a lot. There's lots of things about evidence that requires having more than one model and seeing which is consistent with the predictions you see. It's not enough to just falsify things. You have to build a substantive theory at some point, not just knock down null models. So that aside, in Popper's view of falsification, you're falsifying the explanatory model, not some null model of zero effect. So in statistical procedures, like in that diagram I showed you, this has all been reversed during the 20th century. And now what scientists try to falsify with statistical tests is not their research hypothesis, but some non-hypothesis that they don't like, that nothing's going on. What we should be doing is making predictions about what will be happening and trying to falsify those. That's the Popperian view. So if you like Popper, like I do. Papa Carl, that's what we call him. If you like Papa Carl, then you follow Papa Carl's program and build a substantive research hypothesis with point predictions about what should be happening and try to falsify those. Don't falsify the silly idea that nothing's happening because something's happening. This is nature, right? Things are correlated everywhere in nature. Correlation is not rare. You should expect it. The question is to predict its structure. Okay, I have lots to, all of chapter one of the book is me endlessly waving my hands about this stuff. And I hope you will read it and find it rewarding. There's lots of citations in there as well to back some of this up. Let's get a little bit more practical. Give you some expectations about what we're gonna learn. So as I said, I want you to become Golem engineers and entry level Golem engineers. We need some framework and some set of principles that we can build our own statistical models. We're gonna go into this, not with the idea that we're gonna be selecting some, from some toolbox of pre-made robots, we're gonna build our own. And you're gonna learn the principles by which they're constructed and you're gonna learn the principles to criticize them and refine them. There are lots of ways you can do this, actually. Lots of different philosophies and principles. I've put together one that's built on these three legs here of Bayesian data analysis. Multi-level modeling and model comparison. So I just wanna spend a few minutes telling you in summary what each of these is and then you can patiently await the unfolding of bits of them throughout the next 19 lectures. What is Bayesian data analysis? Bayesian data analysis is the old fashioned way of doing statistics. So Bayesian data analysis is the original statistics before the imperialism of the English and the frequentist English and Ronald Fisher and those. So this was the continental way of doing statistics, the original probability theory, but it was underdeveloped for centuries, largely because we didn't have good computers. And it stands upon a very basic conjecture, the idea of using probability as a language to describe uncertainty. It's an extension of ordinary binary logic, true and false statements to continuous plausibilities. That's the opening gambit. It's computationally difficult outside of simple toy problems. And so really we waited until the 1980s when we could put Markov chains on the desktop and this has led to a revolution in the use of Bayesian methods. Since then, there was this project called the BUGS project. Some of you may have heard of. Bayesian inference using Gibbs sampling is what BUGS stands for. BUGS is now kind of deceased. It's old rickety software. It's been replaced by newer things, but it was a heroic project and it led to a revolution actually in stats departments and applied stats programs and in private research of using Bayesian methods to study complex problems. It used to be really controversial and in this course I'm gonna ignore that controversy except for this because in stats departments, this controversy is basically over. Stats departments the world over are essentially Bayesian now. Most of their research is Bayesian. Stats taught to undergrads is mainly non-Bayesian and this is a weird friction and in time this might change and there are still scientific fields in which it's controversial to be Bayesian but I tell you in the stats community it's not at all. This is a very weird friction between things. I'm not gonna talk about this controversy because I think it's history and we don't need to worry about it and I'd rather teach you how to do stuff rather than talk about the history of what happened last century. This is last century's problem, not your problem. You can make your own new problems. So I give you some citations to history too. Here just a couple of three of the most famous people developers of Bayesian inference. Of course there's Laplace who did the most of any single person to develop the Bayesian approach and then Harold Jeffries and Bertha Swirls, a pair of physicists, English physicists, geo-physicists. Actually, so Harold Jeffries was a geo-physicist, Bertha Swirls was a quantum theorist and they were early proponents of, they were for the resurrection of the Laplacian approach to inference at a time when Ronald Fisher was telling them that they were all wrong about it. Okay, let's talk about what it really is. I got 30 more minutes so I think I can do a lot and you can leave your feeling like you learned something other than what you're gonna learn. It's the goal. So if I was gonna boil down Bayesian data analysis into one statement, this would be it. And this leaves out a lot, a lot of stuff that you will learn in coming lectures. But I think this is the core thing you wanna get. If you wanna know what Bayesian data analysis is and not how it's different from other things but how it really is, what is it essentially, it is just this. You count up all the ways data can happen according to your assumptions or your robot does and the assumptions that have, assumptions with more ways that are consistent with the data with the things you've actually seen are more plausible. If this doesn't make sense right now, that's okay, I'm gonna have a really extended example. But we're just, you make some assumptions about what the world could be like, what is the causal process going on. Then you see some observations which are presumably consequences of that process. And you then say, okay, I've got alternative sets of assumptions, one of these sets of assumptions, it is much more plausible to see this than the others. And that's all Bayesian data analysis does but it does it in a very specific counting way. You have to actually count stuff up or rather your goal of does. Your computer will do the counting but you will program it to do so. So let me try, we're gonna do that today as we go through. Before we do that, let me tell you what multilevel models are in model comparison and then we'll do some work. Multilevel models, we'll wait until the second half of the course, multilevel models are models in models. So if you like models, good news, we put models in your models. And what's nice about this is it lets you deal in a very transparent way with lots of really routine scientific problems like measurement error, missing data, heterogeneity of effects. All of those come from an embedding models within models. So as a set of examples, you wanna deal with repeat sampling or invalid sampling, variation across studies or between subjects within studies. You want to avoid averaging but let system model do the averaging, right? Say you take multiple measurements of the stride length of a person, you could pre-average that or you could just put all the individual measurements into your model. You should do the lab, right? Let it, the sample size matters, right? How many you take matters? And then lots of sorts of problems that we deal with in the biological and social sciences like phylogenetic inference, factor and path analysis, network, spatial models. All these are varieties of multilevel models. They share a real blood relation with one another. And there's a natural Bayesian strategy for building these. Model comparison means having more than one model. This is essential for a number of reasons but two that I wanna emphasize today and then we'll fill in this in later lectures. The first is of course, we've got to compare multiple models so we know what's going on, right? Non-null models. We can't falsify a null. We have to falsify an explanatory hypothesis and we're gonna have different explanatory hypotheses, all of which are credible. And the first problem you encounter when you do this is that there's this phenomenon called overfitting which is the most important phenomenon in statistics. And overfitting is the fact that most ways of training a model on data lead the model to really love your sample but not love the world, right? So it gets really, really good at encrypting the sample you feed it. But it won't necessarily be good at making predictions. And this is the phenomenon of overfitting. And so we have to work in a way that acknowledges that overfitting is always happening and try to guard against it. And I think this is a place actually where people who do research in private industry are much heavier about overfitting in my experience than scientists are. Scientists overfit for a living in my experience speaking as a scientist. It's like professional overfitting. Later on in the course that maybe I'll show you a nature paper which fit four parameter model to a four data point data set and got a perfect R squared. And that'll get you a nature paper, right? In the industry that'll get you fired, right? So it's a different set of criteria. I tell the anchors like this because scientists are often down on private research but there are lots of really good statisticians in private industry doing really good stuff and we just have different cultures, right? So the other thing is causal inference. You need more than one model because you're trying to figure out some network of causes and effects and think mediation analysis. The psychologists here know what I'm talking about because you all learned something about PAS, right? You need more than one model to figure out the network, right? Or given a network to test whether it's true or not. You need more than one model. So we're gonna work towards tools to deal with those things. Okay, so what I want you to do when you think about Bayesian inference is keep in mind that inside of the model there is this perfectly logical world in which all of its assumptions are met and we count up the possibilities of things. This is what we're building up to is our definition of Bayesian inference. But that's just what I like to call the small world. So here's an analogy to help you remember this. This is the opening story to chapter two of the book. Christopher Columbus, a fairly famous and fateful person in European history, rediscovered the Americas in 1492. Famous story. He was very proud of himself. And I think the most interesting thing about this story is that he charted his course using a globe, this globe, which makes the earth much smaller than it actually is. So this is ironic, of course, because the ancient Greeks knew how big the earth was, right? They used shadows and wells to figure out how the circumference of the earth very accurately. But Colombo obeyed a Austrian geometer, Bayhain, who we have someone in here with the same last name, probably a descendant, I don't know. Yeah, and who for his own reasons that are lost in time decided to make the earth smaller. And so the Atlantic and Pacific oceans ended up being joined on this map into one fairly small ocean. And Colombo looks at this ocean. And he's like, I can do that. And so on this map, this is behind's globe, which you can, I think there's, is this in a Berlin museum? I think it's in Berlin. They have this in a museum in Berlin, the actual globe. You can imagine sailing from Spain and you go past, you know, you're going over here and you're trying to get to the West Indies and there's Japan, the Supangu is Japan, that big island that's kind of in the middle is Japan. The Americas aren't there, right? That's just a big Atlantic ocean. You can store enough potable water and food to make this journey. If you're Christopher Columbus, then you'll be great. Now, ironically, of course, there are some continents in between here on the real planet and this is where he ends up. He ends up in what we now call the, you know, what he called the East Indies, but, or actually it's actually the Caribbean. And this is lucky for him because otherwise he would have starved in the ocean. Right, he never would have made it. The Pacific ocean is really big, right? Really, it's like half the earth. And so he had this globe, which was his small world, right, to be analogical. And he was mistaken about how big it was. And the real world was the big world, the large world, and it interfered in his predictions. And he ended up in a very different place than he expected to. He was lucky he lived, right? I mean, the expected result is he just would have died at sea like most sailors did. So, I want you to consider this bizarre analogy and keep it in your mind to think about when you make a model, you're like Christopher Columbus planning with this Austrian globe, right? And you're betting your life on it. Well, you won't be betting your life on it, right? And, but you're betting some prediction on it. And so your responsibility is to remember after you get an answer out of your procedure, to think that you might be wrong about this. And there might be a couple of continents there and the earth might be bigger than it really was. And we're going to learn ways to study the output of our models so that we can discover these effects and reconsider them. And this is what is called in Bayesian inference, the small world, large world distinction. And this is from an influential Bayesian statistician from the middle of last century. Leonard Jimmy Savage, L.J. Savage, had a book in 1954 which was very influential in development of Bayesian inference in the second half of the 20th century and which he makes a substation between the small world of the model where Bayesian inference is optimally, is arguably optimal. There's no other mathematical procedure given the truth of a set of assumptions which can use information more optimally than a Bayesian procedure. That's provably true in a small world context. However, we don't live in that world. We live in the real world. And in the real world, there are no optimal procedures. And using information optimally, conditional on a set of assumptions is no guarantee that it will guide our behavior optimally. And Jimmy Savage makes this distinction very clear in his book. So I want you to keep this in mind and that our responsibility as researchers is to use both of these worlds and to bounce back and forth between them and remember the distinction. So let's get back to this statement. What is Bayesian data analysis again? In the small world, what is going on, we count all the ways data can happen according to a set of assumptions, some set of assumptions, those sets of assumptions with more ways that are consistent with the data are more plausible. Let me draw a cartoon for you to help you understand what this means. So I'm sure some of you will know a certain Borges short story, The Garden of Forking Pass. Yeah, if you don't know it, read it. It's very good, one of the best short stories ever written in the European canon, I think. And we're gonna use this as a way to think about probability theory as well. In probability theory, we have branching paths alternative events that could happen and we need to count those up to see how plausible it is, the thing that did actually happen. So let me take you through this in a set of slides where we build this up in pictures to think through it. So consider a Garden of Forking data now. I'm gonna tell you a parable here in which I have created a mystery bag and I tell you it has four marbles in it and marbles come in only two colors, blue and white. Why? Because I'm making up the story here. This is for the sake of understanding and that's why it's simple. So there's two kinds of marbles, blue and white, that's it. There's a bag and it has four marbles in it. For sure, I guarantee you, I'm not tricking you. There's not five, there's not six, there's four. Your job is to tell me, based upon a few draws from this bag, what are the contents of the bag? How many blue marbles and how many white marbles are in the bag? And we're gonna use Bayesian inference to do this. So, since I told you there are four marbles and there are only two kinds of marbles, the first thing you can do is you can list all the possibilities. There are five different possibilities of what's in this bag. Either number one, they're all white. Two, there's one blue. Three, there's two blue. Four, there's three blue. And finally, five, they're all blue. Does it make sense? So we've enumerated all the possibilities. This is always step one in a Bayesian analysis. You're asked a question and then what are all the possibilities and enumerate them? Rather, your computer will enumerate them for you in later examples. So here we're not using the computer yet. We can do all this ourselves. Now, I draw with replacement from the bag three marbles. What does this mean? I reach into the bag, I pull out a marble, I hold it up for all to see blue. Right, an official court-appointed observer records blue marble, I put it back in, I shake the bag, I draw another marble. Whites, I put it back in and I draw third blue, put it back. So we get a sequence, blue, white, blue and then I ask you, what's in the bag? How many blue marbles and how many white marbles? So let me show you how a Bayesian robot could do this. You might have a better way to do it, you may. But let me show you how a Bayesian robot could do this. So let's begin with this. We're gonna take each conjecture one at a time, each of the possible constants of the bag and we're going to count up all the ways we could see the actual data if that was the truth. So let's begin with the conjecture, blue, white, white, white. Yeah, three white marbles, one blue marble. So we're gonna make a garden of forking data at the bottom here. I begin, there are four paths. When we draw that first marble, you could get the blue marble or you could get one of the white marbles. Now the white marbles look alike to you, but they're actually different marbles, right? They care about the difference. So this means that you're three times more likely to pull out a white marble because there are three of them. Even though they all look the same to you and you can't tell them apart, it's still three different ways, it's not one way. Does that make sense? There's three ways to do it. So there are four paths to begin. There are four things that could happen on the first draw. Then on the second draw, each depending upon what happened, there are again four things that could happen and our garden branches out and all these different timelines of things that could have happened. So on the left, if we drew a blue marble, again, we've got all the same paths branching out from that. Why? Because I put the marble back in and I shook the bag. Yeah. And the same for the other. So we get four paths out of each of the four and there are now 16 different data sets that could have arisen from this process. And then we do it a third time because there were three draws from the bag and now we've got four times, four times, four different paths radiating out all these different kinds of data sets that could have arisen. Yeah. And now what we're going to do is we're going to prune this garden down and only look at the ones that are consistent with the data. And how many is that? They're exactly three paths that are consistent with the data where you've got blue, white, blue. So there are three ways. If the bag contained blue, white, white, white, there are only three ways that we could have seen these data. And now you're thinking, so what? What is three? Is that big? Is that small? You have to compare it to another model. We have to compare it to another conjecture. So let's do another conjecture. So I filled in, we're going to make a table here where we count up all the ways to produce the observed data. We've done possibility number two. One blue marble, three white marbles has three ways to produce these data. What about the others? Well, I hope it's obvious to you that number one and number five are impossible. Yeah? But you could use the garden to figure this out. It'll be zero A's consistent. Yeah, does it make sense? We're not usually this lucky that we can eliminate things that easily. But now we've got two still standing though and we've got to figure those out. So bear with me, we're going to do the same thing but I'll do it in much faster this time. What I've done up here is I've showed you in this arc the thing you've already seen. These are the four times, four times, four possible data sets that could arise, assuming one blue marble, three white marbles and the three past consistent with that, with the data. Yeah? And now we can do the same for conjecture where there are two blue marbles and two white marbles. See that, if you look in the center ring, there's two blue, two white. Now there are eight past consistent. There are eight ways to see the data if the bad is half blue marbles and half white marbles. And then finally, the case where there are three blue marbles and one white marbles, there are nine ways to make it happen. So now you can compare and you've got these relative counts, right? And this is Bayesian inference and that is it. And so every time someone says they've done a Bayesian analysis I want you to think about the bag of marbles because that's a Bayesian model, no matter how complicated just counting marbles, right? There is a set of assumptions and it counts the marbles, counts all the ways those marbles could have appeared according to those assumptions and that's what it reports. And that report is called the posterior probability distribution but this is what you have done. You have just calculated a posterior probability distribution. I know it's magical, isn't it? Now you can do really amazing stuff with this because it's logic and logic is nice. Remember computers are good at things that we're bad at. Computers are good at counting. We're bad at counting. I'm bad at counting. Computers are bad at being ethical. Maybe I shouldn't say people are better at being ethical than computers are. And so you have to bring the responsibility and ethics to these counts because the computer will not. So now we've filled out our table. We've got zero, three, eight, nine. These are the relative plausibilities of the different things. And you might think at this point, well, there's just not enough evidence to say but it's probably one of the, it's either two blue marbles or three, probably. That's what the weight of evidence is for. It's not a big difference though, right? If we had some more evidence, we might feel more confident about it. So say we draw another marble. One more marble from the bag. Again, I pull it out. It's blue, the court observer records the blue marble I put back in. You could start all over again and make the trees again. The wonderful thing about Bayesian inferences is you can just take your previous counts and update them with the new count. This is called Bayesian updating. There's nothing more than taking the previous counts and multiplying by the new counts. So, and I'll show you why it's multiplication later. So, first thing to think about we just do some counting again. We've got one blue marble. How many ways can each conjecture produce an observation of one blue marble? Well, if they're all white, zero, yeah. If there's one blue, there's one way you can get one blue. If there's two blue, two, you see the pattern. Yeah, it's not hard. And then we've got our previous counts. And it turns out that if you wanna integrate this information, you just multiply the counts. Why? Because that's the way the garden forks. Every time it forks out, it's like multiplication. You may remember from elementary school that multiplication is just addition compressed, right? It's just a way of doing addition fast. And so that's all this is. You're still just counting. The multiplication is there because you're doing the branching. It's four times four times four times four, right? Each of those counts. And so it's just compressed counting. And so zero times zero is still zero. Three times one is still three. But we've got eight times two is 16. Nine times three is 27. Zero times four is zero. So now it looks pretty implausible that there's only one blue marble. Much more plausible that it's two or three and the distance between those two is growing. Yeah, you could draw another marble and so on until you reach some level of confidence if you think it's safe to recommend, I don't know, oh, this is a silly example. I don't know what you'd recommend from this. You tell me what you think and then I give you 20 euros or something if you're right. The same thing holds if you have other information of a completely different type, you can incorporate it the same way. One of the strengths of Bayesian inference is that it's easy to use different kinds of data to address the same inference. It works rather transparently. You can do this in other statistical paradigms, I want to say, as well. It just looks very different and it's very straightforward in Bayesian inference. It's possible in other approaches as well. So say you have a friend who works at the factory and they testify that blue marbles are quite rare actually. And every, but every back contains at least one. They have a manufacturing process at the factory that they always put one blue marble because when they sell these bags of marbles and then someone buys them and there's not any blue marble then they get angry. So they ensure that there's always one blue marble. And he tells you in particular the information that in the factory manufacturing process they have no bags that are always white and no blacks that are always blue, but for every one bag that has three blue marbles, there are two that have two and three that have one. The blue marbles are relatively rare in the bags, but they're always present. So these are the ratios, right? Does this make sense? How do we use this? Well, you guessed it, you multiply. It's the same, because these are ways that bags can happen, right? For every, there are three ways that you can get a bag with one blue marble. For every two ways you can get a bag with two blue marbles. For every one way you can get a bag with only one blue marble. So these are just ways and we just add them into our prior ways. So we had three prior ways from our previous four drawn marbles from the bag that it could have been one blue marble in the bag. We multiply this by the factory count of three and now this seems more plausible, because it's not, because those bags are common in the factory, right? So that's a reason to think that maybe we have a bag like that. The other bags are rare. We notice now 16 gets multiplied by two. We got 32 ways to get a bag that's half. That's now on top because of the factory information. Now this is a silly example, but it's silly because it's easy to learn, right? Unlike real science. But the goal is to help you understand mechanically what's going on inside every Bayesian model. You're not gonna literally hand count possibilities in your model. Of course, we're gonna use the computer to do that. And what's even more amazing about it is your computer is gonna count an infinite number of different conjectures. It's gonna rank them all and count up all the different ways they could produce data. And this is easy thanks to a magical invention called calculus, right? Just invented near here, actually. My fellow named Leipniz, right? And there's also that Newton guy that we will talk about him. So what we end up with is these counts get converted to plausibilities because these counts are gonna get big very, very fast. If you have a reasonably sized data set, they're going to be a really, really large number of different possible paths through the garden of working data. They could produce any particular data set. That's how combinatorics works. Combinatoric counts get big really, really fast. You don't wanna carry those sums around. So instead, we normalize the sums so that there are numbers between zero and one. And now there are probabilities. And that's what probability theory is. It's normalized counting. And it's wonderful. But then all of the actions of probability theory are just ways of dealing with counts that have been normalized to be between zero and one. And you can derive it all that way. And this is not mysterious. Mathematicians know this, right? But then they get the action to take on their own flavor and they look more mysterious. But this makes it convenient, actually. It's nicer to work with these sums that are between zero and one. So these plausibilities end up getting converted in the right column so that all of them add up to one. How do you do that? You just sum up all the thing, all those numbers in the middle column and then divide each by that sum and you get a plausibility on the right. Usually when we work in statistical models, we assign some parameter to index the different possibilities of what's in the bag. And so here that will be called P, the proportion of blue marbles in the bag. And we can describe the constant to the bag with that. And we usually speak as if we're trying to estimate P. We get a posterior probability distribution over P, which is the frequency of blue marbles in the bag. And we can consider an infinite set of those possibilities, right? Now when I told you there's only four marbles in the bag, it has to be those, it's discrete. But it could be, what if I told you there were a hundred marbles in the bag, right? Then you'd worry about 101 different combinations. So in code in R, this is really easy to do this. So we can just add up all the ways. So set up a vector with the different in the ways column three eight nine, I'm dropping the zeros, right? Things that can't happen can't happen, right? So just ignore them. And if you divide each of those by the sum of three plus eight plus nine, you get the probabilities on the right. And that's all the probability is. Relative weight of evidence for that possibility. And that's Bayesian inference. So plausibility is probability in this view of it. It's just some set of non-negative real numbers that summed one. And those are the relative number of ways that each of these conjectures could be true, conditional on the evidence. I will say phrases like that a thousand times in this course. Yeah, it will become second nature to you to think of it that way. And probability theory is just shortcuts for counting. So Bayesian inference is just counting. It's counting weird things, but it's counting. Okay, so I got one more minute, but I wanna prep where we're gonna start next time a bit so that I can start here again and you feel like there's some continuity. Where do we go from here? From here, we need engineering skills. We need some procedure by which we build up a model. So I'm gonna, when you come in next time, I'm gonna start with a data generating example. And we're going to start with, like literally, I'll generate some data for you. And then we're gonna start with building a model, a Bayesian model of how to explain that data from what we know about where the data come from. And you'll see in every modeling exercise in this course, there's this recursive cycle of three steps that we go through. The first be, we design the model. We design the model given the scientific information we have about the data. What do we know in general about this process before we see the exact values that have appeared? If we blind ourselves logically to the exact numbers in the spreadsheet, only knowing about the generative process, the nature of the scientific background, what can we design a model? And that's what we want to do. Then we condition on the data using Bayesian inference. We update, we've got all the conjectures that were nominated from the scientific information, and then we count ways. And then we get critical. We step out of the small world and to the large world and we worry about the Pacific Ocean. The fact is probably there. And what is wrong? What is missing in our story that could be messing up our inference? And this will lead recursively, perhaps, to changing the model. And we try to do these steps transparently, publicly, so that we can get our peers to help us. And that's what statistical modeling is like. Okay, so I'm gonna stop there. When you return, I will have an inflatable globe in my hand and we will generate some data from it. Okay, thank you for your indulgence. I'll see you on Friday.