 10 weeks approximately, if I'm willing to slow down, if necessary. The entire book, which is an introduction to applied Bayesian statistics. The hope this week is to do chapters one, two, and three, and that sounds ambitious, but I don't think it is because chapters one, two, and three are mostly words, very light on code and mathematics. They're conceptual delivery, foundation building. So we'll do two lectures each week, an hour each, so the lectures will not be exhaustive of the book content, but I'll hit the conceptually most difficult points and do an example of each of the applied tasks that we're interested in learning as we go through. You're welcome to interact with questions, and the microphone might pick up your questions, but I'll recite them in some hopefully correct interpretation for the microphone anyway, and then that will be an opportunity to tell me whether I've understood your questions as well. But questions are great. People get a lot from them, so please don't be shy about it. Okay, so with that, I'll get started. First chapter is not about statistics itself, but more about philosophy of science. So let me take a step back and forget about statistics if you will for a second, if you can. I think like most of you, I got into this line of work not to do statistics. There's nothing about statistics that I enjoy. It's just something I have to do to get to work, right? Like lots of things. Like I don't like riding the bus, but if I want to get to work, sometimes I have to ride the bus. Since statistics is like a bus, you need to ride it to get from data to inference. There's no leaving it aside. There's no way to process one page data without some sort of statistical framework. And probably like most of you, I was inspired by natural phenomena. I'm an evolutionary scientist, so the range of questions that interest me could be summarized perhaps as where does nature come from and what does it take to form it does, the natural world. And in particular I'm interested in human evolution, but I'm also interested in things like the origins of species, the diversity of birds, the Pacific Islands and all sorts of things like that. All sorts of questions in evolutionary ecology. But even if you're not an evolutionary ecologist, you're a social scientist, there is a similar range of questions. Why do human societies take the forms they do? Are there histories and dynamics? What are the consequences of modifications in the environment? These are rice fields in contemporary China shot from above. I think this is a fantastic photo of the sky reflecting off of them. This sort of terraforming is a hallmark of our species, like the building we're in is a form of terraforming. These are natural processes that we also study. And these are the phenomena that interest us. Scientific theories take the forms of collections of models that help us explain and predict phenomena of various kinds. So, we come to these subjects with an interest in developing theories and some of those theories are instantiated in mathematical forms, some of them aren't. And then we get evidence that we're using to threaten these theories and that's where the statistical pain arises that you can't dodge it. And the awkwardness, well really late 20th century statistics and now what the statistics profession is trying to address in the early 21st century is that introductory statistics focuses on a very narrow range of statistical tools which are really not up to the job of doing cutting edge scientific work. And most of those tools revolve around the analysis of agricultural experiments. So, Sir Ronald Fisher was a giant in evolutionary theory, but also a giant in statistics. Often those two communities aren't aware of one another in those terms. And his statistical work mainly did not focus on population genetics, but on the analysis of agricultural trials at Rotham State Agricultural Station, which is still going strong by the way. The photo on the right is a fairly recent photo, aerial photo of Rotham State Experimental Field. On the left is a stained glass window that commemorates Ray Fisher's contributions. And the color pattern is the one of those to represent one of his randomization schemes for plots, that's one of the things he's famous for. Fisher's contributions to the analysis of our cultural experiments are fantastic. I'm not trying to malign them, they're great. So this is where analysis of variants that most of you, the psychologists in the room will know this, right? The deep tunnel, you think about Innova, you black out, right? When you wake up, there's some sums of squares on your page or something like that. And so Innova is a workhorse of analyzing factorial experiments and it's great. And there are other procedures that Fisher and his contemporaries developed, which you can think of as, they're often taught as tests, so little procedures. In substats software, they're actually called procedures, right? Like in SAS. These procedures can do specific things, but the problem is that when you're learning these procedures you're not simultaneously learning a framework to develop or to connect any arbitrary theory or model to evidence. And so what you end up with, what many of us end up with is, well, flow charts like this. So you get a data set, you've got some set of questions, and then the suggestion is, well, literally a flow chart like this. I drew this, but I drew it from somebody else's, so this is an actual flow chart. I've only just redrawn it to protect the innocent. And the idea is, well, it's the way you interrogate the data has only to do with a few small things which are features of the data itself and not the theory. This is terrifyingly bad. This is an awful thing. I'm going to assert and hopefully convince you over the next 10 weeks that we can do a lot better. And the statistics profession is, of course, in unison about that. These tools, like one-way ANOVA, Chi-square tests and so on, they have their place, absolutely. But learning these procedures and having flow charts to choose among them is not a robust way to interrogate theories with data. The biggest problem, if I can focus in on the thing that will consume us today, is that all of these procedures are taught in a way that they're focused on testing null hypotheses. The goal of all these procedures is to reject some null hypothesis of no effect, typically no effect. That gets rather subtle when you have matrix data, what no effect means, but still whatever the analogy might be to that in a matrix. So to try and explain to you a little bit of what I mean, let's think metaphorically about these things. There are many lots of metaphors in the course, and I hope that's okay with you guys. So I want you to think about statistical models as robots of a certain kind. The problem with the word robot is people think of robots as being precise. Those of you who work with robots in labs may have different impressions of them, but pipetting robots sometimes go wild. So I want to use a slightly different metaphor, golem. We live relatively close to Prague here in Leipzig, so the local audience may be familiar with this. Golem, this is a legendary robot. You think it is the first robot of folklore, and the golem comes from Jewish legend from the Kabbalah. It's a clay figure animated to life with Kabbalah magic. It really exists in automaton. It's constructed, it doesn't have consciousness or will of its own, but it carries out orders dutifully, and it's much, much stronger and can withstand more punishment than any human creator. And this golem legend is best known through the legend of the golem of Prague, in which there's a legend about how it's constructed as well. So here's my modern sort of internet version of how you make a golem. This is almost a complete instruction set. You get a ton of play, form it into a humanoid shape, inscribe on the brow in Hebrew letters in your finest calligraphy, imit, which means truth in Hebrew, and then you can give it commands, but very carefully because like with your computer, your computer is also a robot, you know sometimes you give it instructions and the problem is it carries them out. You haven't thought very carefully about the instruction set, and it does exactly what you say, and that's the problem. So unlike people when you give them orders, there's context and the background interpretation and communication between humans works because we share background knowledge that creates, gives words meaning. And the problem with robots is they don't share our background knowledge and so communication is much more hazardous, so to speak. So in the legend of the golem of Prague, it's constructed by an actual historical figure, Rabbi Judith Ben Betzeo. I'm probably saying that worst accent possible, but it was an actual figure. This person existed, it was born in 1512, died in 1609, and was Rabbi in Prague. Probably didn't make a golem, but I'm willing to believe in lots of things maybe. Maybe he actually built the thing, it was a legend. But in the legend, it said the Jews of Prague, like many Jews in Europe at the time, were persecuted, and the rabbi built the golem to defend the Jews against dangerous mobs and blood libel and such things. But it ended up taking orders to literally and kill innocent people, and so he was forced to decommission the golem at the end of the story. And this is a story that's told in various forms as a caution against assuming God's power. The problem with the golem is that mortals can't handle the power of this. It's just too dangerous to create life. That's the story. That is not the lesson I want you to get about statistics. Mortals cannot have the power and so on. What I want you to get from this is that statistical models, well, they're logical and that's where their power comes from. But it's also the danger of them, is they will carry out your instructions to the letter and only exactly logically. And so you have to understand their internal functioning in order to be a responsible user of them. And so in this course we're going to spend a lot of time thinking about the construction from the bottom up of the statistical inferences and look into the guts of the golem, of the model, so that you can understand its behavior. Because eventually you will make a statistical model which will misbehave. But it only misbehaves according to your expectations. It's behaving exactly according to its design. So understanding that is what I want to teach you to do. So think of the rabbi. So here's my final joke slide on this metaphor, which I hope is evocative for you, sticks with you. Golems is my comparison between golems and statistical models. Golems are made of clay, models are made of silicon, and at least for now they're in computers. They're made of silicon. The golem in legend is animated by truth. Models are animated in a sense by truth because we're trying to discover some philosophical concept of truth. We won't have that discussion now, what that means. Give me some beers and I'll talk about it with you. Golems are powerful. That's the reason in legend that they're created and it's the same reason we construct models or robots because they can do things that are very difficult for us. The thing about computers and robots and models is that we design them to be good at the things people are bad at and they're fantastically bad at the things people find easy. Computers can play go, but they can't recognize birds and photos. This is the thing. There's this complementarity between them. Models are hopefully very powerful. They can do things that are very difficult for people and make it seem easy. But they're blind to our intentions so we have to be very careful in how we design them. Both of these are... Golems and models are very easy to misuse as a consequence. Golems, finally, are fictional. I think they're just a legend. Models say they're not even false. That is, they're intellectual constructs which are meant to process information. It's a category error to talk about whether they're true or false. Does that make sense? Again, give me some beers and I'll give you an impromptu lecture on that sometime. I've lost a few signs, but it's not necessary right now. I want to keep moving. Returning to tests in this terrible flow chart, again, every statistical procedure in a flow chart of this kind has its place and I don't mean to malign them in general. What I mean to malign is the idea that a mature way to do cutting edge scientific research is to choose among these little isolated procedures and reject all hypotheses. I think that is a very bad idea. Part of the reason is these tools were developed for fairly simple factorial designs and things under experimental control. It's a fact that many of us start with a model of the system already. A factorial design model is just irrelevant because we have some dynamical system representation. So in ecology, you don't study population dynamics of links and hair with a factorial design. It doesn't make any sense. There's an underlying ballistic model of the population density that you want to fit to data. But if you don't do science like that, that's fine. Your science is still awesome. There are lots of other things about your situation which may be special. So cognitive science has all kinds of issues like this with time series. If you have multiple samples from the same individuals in a session, then the classic factorial designs don't apply and you're wasting information. I know most of you know this, right? Because now random effects are a common thing and you're probably here to learn about random effects and you will. There will be random effects all over the place. But choosing from these flow charts is not going to get you to some reasonable difference about it. What you want is some framework so that you can go from the model of the system to evidence and ask how different models predict things. So let me spend a little bit of time now on this issue of the null model. I want to convince you that the biggest problem with these procedures is the not necessarily the underlying models in each but how they're used as procedures for rejecting null hypotheses. And that's maybe, I would say, the only good thing I could say about that tradition is that it's useful at the very beginning of a field when we know almost nothing about what's going on. Then discovering that something is going on maybe makes some sense. But we can nearly always do better and it's not hard to. So I want to convince you that falsifying null models is not sufficient to learn how the world works. I'm going to use an example from my area of expertise, of course, because I know it well, but I'll try to present it in a way that your understanding of it doesn't depend upon the details too much. So this is an example for population genetics. So there's this classic debate that started in, even in the late 70s in population genetics about the extent to which natural selection is important for structuring DNA, it's called the molecular neutrality debate. And summarized in some crude way as is evolution neutral or not at the molecular level? Now nobody thought that natural selection wasn't necessary to explain the design of organisms. There's been a consensus about that forever in biology ever since Darwin, right? But there's a lot of debate about the exact structure of genetic sequences and the extent to which those sequences depend upon selection or mutation instead of the relative balance of those forces. So you can think about a null hypothesis that evolution is neutral, and there were vigorous debates in the 70s and 80s about this with camps arguing that the goal here, what we should do is reject the null hypothesis that evolution is neutral to show that selection is important. And you'll recognize that, I think, it's a kind of standard scientific trope. So the thing about using statistics to enteritate data is there's not a one-to-one correspondence between what I'm going to call hypothesis models and statistical, process models and statistical models. Let me say a little bit about what I mean here. So a hypothesis is often some vague mass of concepts and path diagrams and other things that give you some expectations about natural systems. So the blob shape on the left of this slide has been to represent that it's a bit squishy. And so this is something like evolution is neutral. A statement like that is consistent with a bunch of different detailed process models. Now you have to teach a machine to do it. So there's this famous saying, I think it's from Don Canute, that science is everything you can teach a computer and everything else is commentary or something like that. And I don't quite believe that. But it's a nice kind of thing about is that when you have to teach the computer your hypothesis you have to answer a bunch of questions that you didn't realize were necessary and that's useful. That's really useful. So one process model and the famous one in this debate in population genetics over neutral diversity is the so-called neutral equilibrium model where there's just, there's no selection in a population. Alleles are appearing randomly at some low rate and this accumulates two mutations in the population and you're looking at the frequencies of different specific mutations and you characterize the population in terms of the frequency spectrum of different alleles. Those of you who aren't familiar with pop-gen you don't need to understand that detail to get the point that I'm driving at, I promise. And then from this so that process model can generate data in many different representations however the thing that I just described to you in the spectrum of alleles is the way we interrogate with a statistical model. You take some aspect of the process model but only some aspect and you cast that out as a statistical expectation and that's what you look at in the data. But there are other ways to look at the model like a time series that would make the model look different and we're going to come to that in a bit. Does this make sense for now? Just linking? Okay. There's not a one-to-one correspondence. So what happened in the history of this debate is, well I should say for a second, if you do this evolution looks really neutral. It's very hard to reject the null hypothesis. Very, very difficult. Lots of people were bothered by this most famously a popular genesis named John Gillespie who's now retired but was at my former university University of California Davis and John made a bunch of selection-like models. This was like a hobby for him. He made selection-like models that imitated neutrality and he could make them like it was like a lunch hobby for him. I'm being a bit flippant but this was a big fight in the literature but here's the basic intuition. So we have some other range of hypotheses that selection matters. Now you can probably see that there's a bunch of different ways selection could matter and selection can take many different forms and so Gillespie's whole research career before he got involved in the neutrality debate was arguing with his colleagues that selection fluctuates in natural systems. Directional selection only means for a little while. That's the way he thought about the evolution of beak length or anything else is that organisms are getting shoved around from season to season and year to year and epic to epic and it's not some steady march of increasing body form or anything like that and that had consequences at the molecular level that genomes become this jigsaw puzzle of different sweet partial sweeps. So he would have mathematical models of different ways that selection could work so that the one that the proponents of the neutral theory like most famously Motokimura who's a fantastic population geneticist was thinking of constant selection the kind of classic everybody's first Darwin model birds with bigger beaks survive more right but eventually that has to stop right otherwise the world would be a beak expanding at the speed of light it wouldn't work after a while so and you can reject that but that gives you some set of predictions M3 I'm calling it here a particular set of statistical expectations that looks quite different from M2 the neutral model but there's another selection model that Gillespie derived where there's fluctuating selection which makes frequency distributions that are indistinguishable from neutrality if you only look at the data that way and so when populations just realized this they're like oh wow okay let's let's sit down for a second and rejecting the no model couldn't possibly tell them what was going on in the real world and this was a growing up moment for a field it's it's even more interesting there's lots of cool stuff here you can think of other ways that evolution is neutral to what if it's neutral but the population size is not constant so this is what I call neutral non-equilibrium and then you get another representation of the system which I'm calling M1 so it's there's not even a single no model there's so in any reasonably interesting natural system there's no unique no model even to inspect and I think this is also true of experiments we can come to this maybe when we go into examples of experimental data but there are trivially more than one no hypothesis to worry about in most simple factorial designs we'll bug me to come back to that when we're going through a factorial design later in a later week maybe and I'll put some meat on that bone does this make some sense to you guys all I want to get across right now is rejecting no hypotheses can't get you very far it can get you into a subject so that's that's cool but as soon as there's maybe something going on there of interest you're going to have to have multiple models and interrogate them and look for a representation of the evidence that can tell them apart because if they make the same kinds of predictions then you're given the wrong experiment so the general area of this is that philosophers of science are in unison that the way scientists deploy no hypothesis significance testing is a bizarre inversion of the falsificationist philosophy of science so which is ironic because I think many scientists think of null hypothesis testing as following in the footsteps of Papa Carl as I call him, Carl Popper you don't have to call him Papa if you don't but so let me go through the summary of this material for you first thing I want to convince you is no models are not unique so even if you reject a no model you may have rejected the wrong one there's some other no model going on what we should do instead is have multiple explanatory models that we want to try and threaten and of course for Carl Popper the whole point was to reject your explanatory model not some no model, not some straw man model that you don't think is explaining your system but the thing you think is explaining your system you're supposed to make risky predictions from the explanatory model and see if they're falsified or not that was Papa Carl's program now of course the caveat here is that falsification so scientists do falsify hypotheses we do it all the time but it doesn't happen simply through a statistical procedure it happens through all the well all the drama of scientific life right the storm and drama of everything we do and this is saying that philosophers of science say that falsification is consensual it arises through some consensus debate process of interpreting the evidence in that way because sometimes the data are wrong right now we hate it when you know we criticize someone's hypothesis and then they tell us our data is wrong but sometimes our data is wrong so we have to allow that window to be open so falsification isn't some logical procedure it needs meta-theoretic debate outside of the statistical processing besides falsifiability for Popper apologies I know a lot of you guys know this making courses on this stuff it was about demarcation it was about drawing a line between what science and what isn't it wasn't a model of how science has to work he allowed all kinds of confirmation like processes to be productive in figuring out how nature worked but in principle things had to be falsifiable to be in the arena it was a demarcation it was drawing a boundary it wasn't a process it wasn't an argument for a unique process okay let's move on what I want to teach you guys to do is some engineering we're going to have some clay metaphorical clay in the form of computer code and we're going to build little cube golems little tiny ones like that one in the picture which is from a fantastic graphic novel called Breath of Bones which is about the goal of legend kind of set a little bit in the future of what had actually happened so what we all want as scientists is a framework for developing statistical procedures that address the problems that we are interested in and we don't need a menu of procedures or a menu in SPSS god forbid that we choose things from apologies those who use SPSS I pity you and I will free you from your shackles I'm here to help you so we need some frameworks some way to think about in general the connection between evidence and theories of interest whatever those theories are and of course there are several options this course is going to be about Bayesian statistics but it wouldn't have to be I could teach this course entirely without mentioning Bayes and much of the content would be the same because there are some isomorphisms between the different frameworks I happen to think that the Bayesian approach is the easiest to teach and it has a unified coherence to problem solving that makes it easier to teach for that reason but you could solve all the problems I'm going to solve for you in this course in a pure likelihood of spring work I'm nothing against that the thing I'm against is non-hypothesis you can do that in Bayes just as much as you can do that outside of Bayes so this is not a Bayes frequentist fight at all that's not what it is the fact is that often the easiest way to fit a model is with some Bayesian machinery so it actually solves problems it has a reputation for being fancy right away to show off it's a Bayesian drinks cappuccino it does Bayesian statistics but it's not fancy as I want to convince you and I'll start into this today it's brute force simple and logical and the Bayesian approach actually lays bare this is what it feels to me lays bare all the imperfections makes it clear that it can't ultimately tell us what's true or not a lot depends upon assumptions and all it is is an engine for processing assumptions so we're going to use Bayesian data analysis as the general framework in this course for building golems and we're aiming at multi-level modeling because I think that's the tool that most scientists know that they need this is the thing that in training programs everywhere students know that is the thing you've got to learn is random effects or whatever you want to call it hierarchical models, random effects, multi-level models I tend to call it multi-level but I also use this so I'm going to teach you to do that because you need to understand those models they're incredibly useful in all sorts of contexts whether you do experiments or you analyze observational systems but I'm also going to use them as a way to trick you into learning even more stuff because multi-level models are a gateway drug into solving all kinds of problems like measurement error issues so measurement error is a kind of multi-level model and other things too like factor analysis turn out to be a kind of multi-level model everything's a kind of multi-level model and I want to help you see things that way so you see the unity among these different things instead of factor analysis being an island over here and a great terrifying ocean of dragons and then there's random effects models and they're all kind of the same kind of model and I want to help you see that so you can design your own you don't need my approval or anybody's approval to make a custom model you need some care in the engineering make sure it doesn't wreck Prague but you should feel some freedom to design and code your own models so we're going to need some tools as well within the course to do model comparison because if we're not going to reject all hypotheses what are we going to do we're going to try to make multiple non-null models of the system and compare them so the framework that we're going to do this in is information criteria but there are other frameworks too I think the important thing is that you worry about the problem of overfitting which is what I'm going to focus on and find some way to measure your overfitting risk information criteria are a natural tool for doing that they're kind of a cross-validation metric but we can punt on that okay so let me spend the rest of the time today getting into the actual meat of introducing Bayesian stats we're going to build the foundations of what it is this embarrassing machinery and so Bayesian data analysis is old it's hundreds of years old now it's older than most the statistical tools that you learned in your first statistics courses that were developed in the early 20th century or late yeah most of them in the early 20th century by Fisher and Naaman and Pearson and people like that Bayesian statistics goes way back I would say there are lots of people who were responsible for it you could say that probability theory was developed by French gamblers so they could win money that's actually not false we know this from their correspondence they were really interested in dice games so what did French people with mathematical education do in the past? they did other things too but lots of probability theory was built up from people arguing about dice throws and probabilities of things but if I was going to pick a single individual who compressed all this into a framework for general inference for general class you can think of as the father of applied probability theory in the Bayesian sense so in this framework we think of probability as a way to describe uncertainty it's epistemic rather than ontological randomness is a property of your knowledge the fact that you don't know something not of the world so the philosophical conceit anything that appears random to you is because you can't predict the outcomes because you don't know stuff but the actual physics are deterministic and it's just the level of abstraction you're looking at it, you can't tell that and this is of course true there's this interesting irony when we talk about random numbers you may know that when computers generate random numbers sometimes people call them pseudo-random in the Bayesian position everything's pseudo-random there are no true random numbers because there's a deterministic process that produces distributions and we just have taught computers ways to mimic those distributions and the fact that we can teach them deterministic algorithms that produce random patterns is pretty good evidence I think for this this philosophical view is at least consistent with nature but you don't have to buy into that necessarily at the level we're going to think about data we don't get into quantum mechanics and stuff like that here's maybe the most useful way to think about it most of you know what is called propositional logic, truce tables things about true and false I can't think of one off the top of my head but there are these various games you do in introductory philosophy courses with Linda's a bank teller she's also a democrat and stuff like that and then you ask other questions about Linda and so there are these what is a valid deduction and this has been a long running interest in philosophy and the Laplacian tradition extends propositional logic to continuous plausibilities it's not that something is only true or false it's kind of true and what that means is I'll put some meat on that as we go through today and when we come back on Friday as well we'll finish it up so that's the goal is we want to characterize our knowledge of the system what is the information in the data how plausible does it make the limitations of the data that's our interest in it there are other things you can do with it sometimes Bayesian inferences is taught as a model of rational belief we're not going to do that I'm not going to use the word belief again in the course we're talking about the golems belief not yours the scientists but your machine your machine is developing beliefs in the form of probabilities you're going to inspect your machine so aside from fairly simple models in what are called conjugate families you need some other way to do the calculations so now this is quite easy on desktop computers starting in the 1990s a family of algorithms called Markov-Chain Monte Carlo were written for microcomputers desktop microcomputers and led to skyrocketing rate of use in the statistics profession of Bayesian statistics it had been dormant for a long time because you needed other kinds of approximations but now we can with sufficient care we'll focus on this later in the book you can draw inferences from arbitrary Bayesian models for many data sets on your desktop in reasonable time so this is what I'll teach you to do should say Bayesian statistics used to be controversial so it was suppressed in the first half of the 20th century by Ronald Fisher and his contemporaries and now Fisher maybe some other time again this is one of those buy me some beers I'll tell you the whole story of the kind of things Fisher was really against the use of Bayesian methods which at the time weren't called Bayesian they were called inverse probability I think Fisher was probably the one who was responsible for them getting the name Bayesian they should be Laplacian we should name it after Laplace but regardless in his 1925 handbook for research workers which was extremely influential in both biology and psychology for teaching ANOVA, factorial designs all he says in the introduction is Bayesian analysis must be wholly rejected and he says elsewhere I've argued with no citation yes you can figure out if you're reading your stuff what he's leaning on let's just say that his objections have been suitably answered in the meantime I say Harold Jeffries and Bertha Swirls were also two major early contributors in the contemporaries of Fisher physicists, geophysicists who actually Harold Jeffries on the right was a famous geophysicist he's credited as the discoverer of the earth's internal structure the way to use seismic waves to figure out that the core of the earth is solid stuff like that and his spouse was an early and very important quantum physicist the lady Jeffries she's officially known as but her real name is Bertha Swirls and they were early proponents that carried the torch of Laplace through the dark ages of the early 20th century while no one else in the British Isles was allowed to touch it because Fisher would fire them sorry I'm joking a little bit it wasn't quite that bad but there's some truth to my jokes so let me boil down to say that I like Bayesian statistics because as I said it's inadequacies or transparent it lays the assumptions bare all the other statistical frameworks are equally inadequate but sometimes they seem more powerful because the assumptions aren't as easy to read that doesn't mean they're less useful it just means that I think some sorts of mistakes are more likely here's what I want to boil it down to and it's modest basis Bayesian inference is nothing more than the ways that data can happen according to your assumptions that means the model, the goal the assumptions you plugged into the goal the assumptions with more ways to cause the data are more plausible and that's all statistical machinery can tell you and Bayesian inference is just that and probability theory is a way of counting ways to produce things that's all it is probability theory is renormalized counts so I'm going to start by teaching you probability theory over again just counting we're not going to talk about probabilities we're just going to count stuff and then I'll transition to the probability representation and hopefully do it in a way that is seamless enough that it makes some sense so let me contrast this view of probability with the dominant frequentist view and this is the last time I'll probably say anything about this difference in frequentist probability probability comes is ontological it's defined by realized frequency distributions in the world on repeat samples these repeat samples are at least for Fisher they were purely imaginary there weren't things you could ever observe statistical populations in the Fisherian view statistics are just ideational things you can't so for evolutionary biologists this is quite obvious it's maybe not obvious for psychologists because you can imagine an endless line of students going through your experiment maybe but for biologists say you're interested in the diversification of some birds in the Andes as one is and it's a very interesting question actually you don't have replicants there's some replicant earths out there we're going to rewrite history we're going to regrow the Andes it doesn't make any sense at all Fisher was well aware of that so for Fisher it was just ideational but sometimes it's taught as if there's an actual empirical population and we're taking samples from it but the the key thing to keep in mind is that and I tell you this just so you don't make this mistake in interpreting Bayes if you're in the frequentist framework it's the right way to think but if you're in the Bayesian framework it'll lead you to misinterpret the model output the probability doesn't come from repeat sampling or sampling variation it's just epistemological it's the number of ways that are consistent with the assumptions given the data it's not this repeat sampling thing often this doesn't matter simple things like factorial designs you get numerically the same answer with the Bayesian analysis and the frequentist analysis but sometimes it really does matter because you get conceptual roadblocks the example I use in the text is astronomy so on the right hand side of this slide is my imitation of Saturn as Galileo saw it Galileo with an early telescope looked at Saturn and he saw what we now know are rings on Saturn so this is cute drawing in his notebooks it's three little circles one big circle and two little circles on the side kind of looks like that this is probably what he saw and you get this just by blurring Saturn you get this so the question is what's the trio image so this is like a crime scene investigation sort of a problem image analysis problem sampling vary no matter how many times you look at Saturn you're going to get the same image so sampling variation your uncertainty doesn't arise from sampling variation what does it arise from it arises from the natural process by which light scatters and so the Bayesian approach leads directly into you can use probability to talk about the probabilistic scenes that could produce this image the frequentist approach gets stuttered a bit you can solve image analysis with a frequentist approach but there's this conceptual stuttering in the middle okay so it's not only in the golem it's not in the world keep that in mind you have to retrain your thoughts a little bit about that so maybe it'll help to think like with coin tosses coins are not random right it's our inability to predict which side will land up that makes them random we can use them as a randomization device because if you flip them fairly right then as you flip them then it's a chaotic system and they're so sensitive to initial conditions that they're essentially unpredictable and that's why we call it random but physics is deterministic at least at the scale of coins everybody agrees about that disagreement at smaller levels proton level people disagree about that so coins are not random randomness is a property of us in our knowledge if you spin a coin I guarantee you it's not random you can predict very easily which side will come up the lighter side will come up more often Euro coins are great for this and eagles on one side of the coin wait sorry that joke works in Germany sorry people who are watching from other countries don't know what I'm talking about that'll happen a lot so transitioning to chapter 2 now past the introduction and we're going to start building up Bayesian inference using counting so let me step back into a metaphor here so we can think about what our job is and where probability as a concept exists so this is a drawing of a fantastic globe one of the first globes that was manufactured in Europe this was produced in 1492 by an Austrian I believe it was Austrian geographer Bayheim and I think this globe is in a museum in Berlin now I don't know you google it and find out but anyway there are 3D scans of it online and go and look at it and people have drawn it and Columbus used this globe to plan his voyage which is an interesting bit of the history because the interesting thing about this globe is unlike most geographers of the time, behind thought the earth was smaller than it actually is many other geographers, contemporaries of behind knew the earth was, knew how big it was I mean the ancients I knew this, there were Egyptian geographers who measured approximately the true size of the earth by using shadows and wells we can talk about that story some other time maybe but Bayheim was like nah, smaller I'm being a bit flippant I don't know why he thought this but it's a very consequential error because it compresses we get, you know the whole Pacific Ocean is missing in this globe and so what you're looking at is the combined Atlantic and Pacific this is Asia over here, this is Japan this thing called Tsipangu is Japan and that's Europe over there you can see Hispania and Galia up there, the British Isles and then North Africa, the Canary Islands I'm sorry, over here I love this map, you can stare at it it's all kinds of cool stuff on it so Columbus plans his voyage like this and he stocks enough food and water to get him across that distance because he's going to the East Indies and he's going to get some spices it's going to be great this could have been a deadly error imagine the Americas weren't there there would have been a lot of ocean and what would have happened they all would have starved to death and died of thirst in the ocean well maybe it would have rained enough on them and they could have lived but it would have been a big mistake I want to use this as a hopefully memorable metaphor something that in Bayesian statistics is referred to is the contrast between the small world of the model and the large world that we're actually going to make predictions in we study the large world as scientists we're interested in explaining the actual world that we live in this is difficult the thing about the world we live in is we don't know the boundaries of the processes the categories aren't naturally nominated for us in the small conceptual worlds we use to interrogate evidence in statistical models all the possible events are pre-nominated in the model and you have to do that in order to use the machinery so it's the nature of like when you make a globe you've got to define how big it is and put all the land masses there and then you plan your voyage but then it turns out well ok the earth is a little bit bigger like about twice as big as this Austrian fellow thought and there's more continents in the way and by the way there are people living there as well so yeah I'm going to put this in superimpose it doesn't fit right it just doesn't fit so you think about in this representation if the earth were this small California would be well well Baha California in Japan would be merged they'd be stuck together so a very well known basic statistician from the mid 20th century L.J. Savage called Jenny Savage in the literature is responsible at least this is the first reference I could find to this distinction and he lays it out in a famous textbook published in 1954 this distinction between the small world and the large world our work in probability theory is a small world logical world and in the small world Bayesian inference is optimal there is no other way of using the available information which could produce more correct inferences so that's all these proofs about Bayesian optimality are proofs about small world assumptions and given some set of assumptions and given some set of evidence it is optimal to process it Bayesian but if you're wrong about what events are possible or such then all optimality groups are off so this is what leaves open for what I call the heuristic view of analysis squirrels do very well in their environments without being Bayseans because they're not solving the problem that scientists have to solve when they fit statistical models to data they're solving a more generalizable problem of how to survive to their next meal that's not usually our job as scientists so this alternation between an interest in the large world but the use of small world tools to interrogate it is part of our business as scientists and it's where the rubber meets the road for statistical analysis so we're going to worry about both of these things as we go and I'll have a lot to say about how we do a lot of applied work in the small world and fitting models and that's often the hardest part to learn because it's so awkward it's the thing that computers are good at and we're not myself included I am human so but then we need to always think about the large world and ask if the models results make sense or what other possibilities are missing here or do we trust the data so again let me remind you what is probability theory is counting all the ways data can happen or Bayesian inference is counting all the ways data can happen according to assumptions assumptions with more ways to cause data are more plausible so let me build this up exactly an accounting framework it'll be a bit of a silly example but my experience is this works it builds things up for people so some of you will know this famous short story from Borges about the garden of forking pass so I'm going to talk about this in terms of a garden of forking data there are lots of possible data sets that could have happened you run an experiment you walk a transect in a forest to count birds whatever it is you do you watch baboons groom one another whatever it is you do and lots of data sets could have happened and our job is to say given the data set that did happen which processes are most plausibly could have produced them and that's our goal and we can just count them up so let's start with an example we're going to have some data there are many possible different events but each observation eliminates some of those possibilities so I'm going to ask you to think about we've got a bag say I apologize this is not the most exciting example but there's a bag and that will hopefully mean the details don't distract you there's a bag and it contains four marbles and the only thing you know is that marbles only come in two colors I'm going to make this easy it's supposed to be blue it's a bit dark on the slide white and blue so since there are four marbles there are five possible contents in the bag they could number one all be white or there could be one blue two blue three blue or all blue agreement those are the only possibilities conditional on our assumptions now the data a friend of yours reaches into the bag draws out one marble puts it back in the bag is shaken pulls out a second marble pulls out a third and they are blue white blues this is sampling with replacement three marbles from the bag and now the question is given that data what's the most plausible contents of the bag classic annoying party game at least in parties I got to alright no I literally have to do a party where a bunch of drunk and applied mathematicians challenged one another with probabilities games yep psyche times so yeah try doing backwards induction drunk it's a little bit harder but ok so let's map this out and I'm going to get this established I have just enough time to get this established in the end of today's session so again remind you at the top here upper right we've got data it's blue white blue we're going to consider just the first marble to start how many ways could we get a blue marble and to do this we got to take a particular conjecture and start with that and start counting because the goal is for a given assumption about the process that generates the data how many ways could we get the data we actually observed and then we're going to compare the different sets of assumptions so let's start with the conjecture there's one blue and three white marbles you with me so if there are four possible events that could happen on the first draw bear with me right four possible events because there are four marbles in the back right the fact that three of them are white is irrelevant they're still different right they're all look the same to you only because you're only interested in color but that's that comes from the way you're representing the data but they're actually different events they're different ways you could get a white marble so there's one way to get a blue marble and three ways to get a white marble so it makes sense because there are three white marbles they all look the same to you but they're still unique snowflakes yeah each of them is special to their parents yeah okay so there's one there's one way to get the first date of the point makes sense now the garden branches this is Borges garden so whatever happened on the first draw four things can happen after it so this is the garden of forking past and forking data so if you drew a blue marble on the first go you could draw a blue marble on the second go or any of the three white marbles and so on for any of the three white marbles four things could happen so now we've got 16 possibilities right so this is an interest we're interested in getting a white marble on the second go right and so how many ways could that happen three right so conditional on getting a blue on the first go there are three ways to get it on the second go so there are three total ways I'll summarize this as being a don't worry there won't be a quiz three ways to get the data so far and then finally the last marble gets drawn now we've got a really big garden and we've populated it all the way out and our job now is to count of all of these terminal points how many of them trace out blue white blue because that's the data in sequence you with me this is what Bayesian inference does for you but you don't have to make the garden every time I'll teach you how to teach your computer to do it for you automatically that's the wonder of probability theory is it makes these gardens for you this is what it's doing so I can fade out all the paths that are not consistent with the data and they're exactly three you can see you got to get a blue on the first one then there are three ways and there's only one way to get a blue on the last branch so there's only three ways consistent with the data set so if we assume that the bag contains one blue and three white there's only three ways to get the data make sense so it is like big deal okay now what well now this is useful because we're going to compare it to the other possible contents of the bag so we're going to make a table here so here on the left hand column the possible contents these are our models our possible conjectures to generate the data and we're asking what produced how many ways are there to get the data we actually saw so all we've got is building a three there for the second row we've got to fill in the other rows now so I assert but I hope you will intuit the first and last conjecture to generate the data because there's at least one white and one blue marble but you could make the garden encounter if you feel so inclined let's do the other ones so this arc on the slide is the thing you've already seen this is if we assume you look in the middle part here's our conjecture one blue and three white there's three ways to get the data set let's consider another one the bottom arc here there are two blue and two white now there are eight ways it's the same kind of garden but it gets constructed differently because there are two blue and two white marbles so two ways to get blue on the first go then there are four ways to get white on the second go and then you've got two blue again so you get a total of eight different ways to realize the data set if the bag is half and half blue and white and then finally if there are three blue and one white same way to construct the garden three ways to get blue on the first go one way to get white for each of those so three ways to get white on the second go and then for each of those three more ways to get blue so nine ways in total now you can imagine as your data set grows in length this garden gets really big combinatorics is really mean and this will grow super fast this is what computers are good at so this is the job we get the computer is to do this but probability theory compresses this counting down into a continuous number between zero and one that's why it's so useful but it's just renormalized counting literally it is so let's populate our table first conjecture no ways to produce it I assert you've got zero ways to get a blue marble on the first go you can think about this is each of the steps multiplies then the multiplication this is where the multiplication rule in probability theory comes from actually it's things that have to happen together end up being multiplications and it's because it multiplies the ways it's branching past in the garden so zero times four times zero is zero there are zero ways to get the data set for the first conjecture for the second conjecture there's one way to get a blue marble on the first go three ways to get white marbles on the second go blue that's a total of three ways and then two times two times two is eight two ways to get blue two ways to get white again multiplication rule it's just the branching of the path this is where the multiplication rule in probability theory comes from it's just the garden and then three times one times three gives you nine ways and then four times zero times four if there's ever a zero it's all zero zero is the trump card and it's still zero so these counts give you relative plausibilities and when you return on Friday I will take this and we will slide gracefully into probability theory as you normally think about it and fit a more interesting model and do Bayesian update alright thank you guys and I'll see you on Friday