 All right, I think it's time that we start our third session. Yes, I'm Lasmus Bult and I'm super happy that we have Richard McElrath with us today. I first encountered Richard McElrath's name when I found Brad who is a book in one of my colleagues' books case and I thought it was great. And now, a couple of years later, I just checked, your book is the second best celebration statistics book on Amazon. Just second to government's station data analysis, which is amazing. So I'm super happy that you're here. Welcome, big applause. Thank you, Lasmus. So I'm very happy to be here. This is my favorite kind of conference, because it's cozy. So in thinking about the kind of talk to give, I thought most of the talks would be a bit technical, so I decided to give a fairly non-technical talk aimed at kind of a confluence of my interest in Bayesian statistics. And so let me give you some background. I'm not really a statistician, at least not originally. I'm an anthropologist and I come to statistics with a very definite topical interest. I study human evolution and in particular the evolution of human behavior. And one of the methods we use in anthropology to study that is ethnography. And the ethnographic method to the extent that it has one at all is that you go places and you live with people so that you can get an inside view of their society and how it works. So you want that inside view because it helps you develop a better outside scientific understanding of variation among societies. And so this is me during my PhD work. I spent a couple of years in Tanzania getting an inside view of culture there. When I came to study statistics, I applied the same ethnographic method to learning Bayesian statistics. Take the inside view and I got that perspective from some of the famous writers in the field. My commitment to Bayesian statistics grew out of the strengths I found in that inside perspective. It makes particular things easy. And it fits in particular with the kinds of difficulties of the data we collect in anthropology. And so what I want to do today is use this as a launching off point to talk about what I think is there are systematic problems in teaching Bayesian statistics because most people encounter another paradigm first. And then there are top Bayesian statistics from the outside view rather than the inside view. And so I think we've got a lot of work to do as a Bayesian community in better developing an inside view that we can agree upon because I don't think there is one actually. So I'd like to start this conversation by proposing some elements of an inside view and illustrating them in the sense that they provide pragmatic solutions to common data modeling problems of at least the kind that I study. So first, a little bit about what the outside view is. The outside view is fine. I'm not going to say bad things about it. It's just the outside view, at least not today. So this is the honorable Sir Ronald Fisher, who certainly not the only person associated with the outside view but he defines likelihood in the way that most people use the term likelihood. And maybe you've all read this by the time it's been up here on the slide for a little while already. You don't need to read it all just to say that likelihood is defined in a very odd way in statistics. It means a very special thing. It's a function. It's not a probability. It's a function of parameters. It's not actually conditioned on the data. There's a semicolon thing because you can't marginalize over it. Weird stuff like that. And that's all fine within that paradigm. But then when people encounter Bayesian statistics, they are taught that we use this likelihood and then we add priors to it. And then now you're Bayesian. And that is incorrect, of course. And I think there's mental friction that's created from that. So very quickly, the outside view has a bunch of elements. And the outside view is actually a lot of different views. But some of the common elements that people come across before they learn Bayesian statistics include things like the data head distributions, the parameters don't. There's a very important distinction between parameters and statistics, at least in the Fisherian view. The likelihood is not a probability distribution. I remember being screamed at for calling it a probability once in a math stats class. And there's this imaginary population that is a device for creating uncertainty in statistics. Now, this is the frequentist sampling theory view. And then, after you've learned all this and passed some exams on it, you learn that Bayes is all this stuff, plus we add some priors. This lets us do Bayesian updating. And these priors, well, they're very subjective sort of problem. I'm not going to spend any time arguing against this outside view. Although, if you judge from this art I have on the slide, you might get some idea of how I feel about it. So let's say that the outside view teaching Bayesian statistics is like the British going to Egypt. They disrupt the society quite severely. It's a colonial view on the statistical paradigm. And therefore, it's a failure to take the inside view. And so it gives up some of the strengths of the perspective. And the full strength of the Bayesian perspective is unleashed by taking some insider view of what goes on. Not deriving it as sampling theory plus priors, but rather taking it on more fundamental terms. Of course, I'm not the first person to say this. Dennis Lindley put this in probably every one of his papers, a complaint of this kind. So here's probably the most succinct quotation from him. He says, what most statisticians have is a parody of the Bayesian argument, a simplistic view that just adds a woolly prior to the sampling theory, paraphernalia. They look at the parody, see how absurd it is, and thus dismiss the coherent approach as well. Lindley has some very colorful papers, by the way, if you've never looked through them. They're full of things like this. So the conceptual friction in my experience teaching statistics that arises from the outside view plus priors, rather than an inside view on Bayesian inference, include things like students coming to believe that the data must look like the likelihood function, or at least that the residuals need to look like the likelihood function. The outside view, maybe that's true. On the inside view, it's definitely not true. This concept of degrees of freedom is something people are taught in an introductory stats course, and then they encounter Bayesian models where you have like a thousand parameters and ten data points, and they say, you can't fit that, and I say, watch me. Now, it's not that you're going to get much updating from that, but you can definitely fit it. And a whole bunch of other concepts like identifiability are really non-Bayesian concepts. We use those words to describe Bayesian models. We cause problems with the understanding among our students and ourselves. Sampling is a source of uncertainty. It's true in non-Bayesian approaches. Definitely not true in the Bayesian approach. You can have uncertainty that isn't stochastic at all, purely epistemic. Defining random effects via the sampling design, I'll have a little bit more to say about this later in the talk. And often, although it's not a necessary feature of the outside view as I'm calling it, a neglect of data uncertainty. When there's measurement error, people wave their hands a bit and say, yeah, I worry about that, and then they fit a model that ignores it. And in the insider view, I'm going to try to convince you there are obvious solutions to common problems like uncertainty and measurement. And all of you probably have your own conceptual confusions that you first encountered when you started learning Bayesian statistics. So now I have to admit my book perpetuates this problem. So I just started trying to do the second edition of this now, hacking away at bits of it. And I had to fully engage with my guilt over the problem that I feel bad about many of the choices in the book, as all authors do. And foremost among those regrets is that it uses the outsider vocabulary. I use terms like likelihood and parameter and estimate in ways that really only have coherent definitions outside of Bayes. And people tell me they learn things from my book, so I guess it's not awful, but I think we can do better. I think I can do better on a second pass. And I want to start thinking about that in this talk. So one of the problems is that this generates friction because using my colonialist metaphor, this is like explaining Indian politics using British political parties. Well, there are these things called casts and there's the Hindu system and all this other stuff that matters, and none of that exists in Hogwarts or whatever. So inevitably there are things that just cannot be explained in terms of the other framework. When this perpetuates lasting confusion, people thinking, for example, that tilde means sample. Who was I talking about this with recently? Yeah, with Rasmus. We may ask perhaps that this is a historical necessity to use terms like likelihood because people still encounter non-Bayes statistics first, but I'm at least willing to try with all of you to do better. So let me try to outline another path in the remainder of my time today. The claim I want to entertain, I'm not sure I'm convinced myself of this yet, is that Bayes is easier and more powerful when we understand it from the insider perspective. Now the key problem first of all with this claim is that there are lots of insider views on Bayes, right? Bayesians argue amongst themselves as well. This is classic paper 1971 by I.J. Good called 46,656 varieties of Bayesians. Anyone else know this paper? It's a two-page paper. He's thinking he has 11 criteria and he goes through the permutations of all the combinations of them and it's actually, it's a really nice, you can learn a lot about the epistemic possibilities from this. Yeah. I had artificially made some facets discreet. My heading would have been on the infinite variety of Bayesians. So I'm going to pick a particular insider view that is useful to me and solve some particular problems, but I don't think it's unique in being an insider view on Bayesian statistics. So here's the insider perspective that I use most of the time. The key thing about the Bayesian approach that engages me as a scientist is that it's a joint generative model of all the variables. What do I mean by variables? I mean data and parameters, because they're the same kind of thing in Bayesian statistics. This perspective has these two key unifying ideas that where things that are distinct and must be treated differently in the outsider view are indistinct and treated the same much of the time in the insider view. So variables, what we usually call data and parameters in the Bayesian view, in the Bayesian view data parameters are fundamentally the same thing. They're just variables. And sometimes we get to observe them and sometimes we don't. But they have distributions that calculations are done the same on and so on. Distributions likewise, there's no fundamental distinction between likelihoods and priors as there is in the outside view. I want to again say that there's nothing necessarily wrong with Fisher's definition of likelihood, that's the outside view and this is the inside view of it. So I want to give you some examples to try and back this up and give you an intuition about why I think breaking down these distinctions can be useful. And then hopefully we can have some conversations about exactly what terms we might want to use to refine this. So here's a kind of typical line from a statistical model, a mathematical statistical model. And I've used some Nordic runes in honor of the workshop instead of Roman or Greek characters to obscure whatever convention you would normally use to decide whether something was data or parameter. You probably don't have stereotypes about whether a futark rune is a parameter or data. So something till the normal something something. And so now I might ask you, is the symbol on the far left data, is it something that was measured? Is this a likelihood in that case? Or is it instead a parameter, which would imply it's a prior? And you can't tell, right? There's absolutely nothing about this statement which reveals which of those two cases it is. And that's because in a Bayesian model, it's the same kind of epistemic statement. They're fundamentally the same issue. And in for a common data generating model, a joint model of data and parameters, from one stage to the next, B might be observed, or it might not be observed. When it's observed, we treat it as data, we call that as a likelihood. When it's not observed, we treat it as a parameter, and we call it a prior. But it's the same statement about the underlying science. Does that make some sense? So I want to show you three kinds of models today, which are simple toy examples, but they're real working statistical models where this collapsing of definitions between data, parameters, likelihoods and priors is revealing of some of the unity of the Bayesian approach and why it behaves the way it does. So I think the cases I'm using are not necessarily the most common kinds of statistical modeling problems. People come across, say, in experimental sciences, and the experimental sciences, you're lucky to have clean data. You can set up your factorial experiment and fill all your cells, recruit more students, make it work, grow more yeast, whatever it is you need to do. And I'm an anthropologist, and in anthropology we go to war with the data we have, not the data we wish we had. So we deal with lots of inconvenient sorts of models, and I'm going to show you those. You might think of these as corner cases in your fields, and that's fine. In these corner cases, the distinction between data and parameters is often very hard to make. So this will include things like generalized linear mixed models, missing data models, and measurement error models. There are many, many kinds of strange machines, like occupancy models, joint species distribution models, have features like this, that have these features as well. Okay, so let me introduce the toy example, and then I'll go through three kinds of varieties of this toy example. In which this collapsing of definitions can teach something useful, if you're just learning Bayesian statistics, or maybe if you've even practiced it for a long time. So let's imagine a simple kind of observational experiment. There's a room in which there's a bird, and a cat. And the bird likes to sing, and when the cat is present, it scares the bird a bit, and it tends to sing less. When the cat is absent, or sleeping, say, the bird tends to sing more. There are four variables in this study that we're interested in, because we're estimating the rates, the effect of the cat in psychology terms. What's the effect of the cat on the bird singing? And so there are four variables. There are the count of notes in some interval. There's the presence and absence of the cat, right? And then there are these two unobserved variables, that are rates, which are calculated from these things. The rate of singing when the cat is present, and the rate of singing when the cat is absent. I hope there are people here who like cats. That's why I chose cats, because people tend to like cats, right? Put cats on slides. So just to summarize, we have two of these variables are observed, and two of them are unobserved in this simplest model. So you would call it typically the ones on the left data, the ones on the right parameters. As we move through examples, I would like to make you question those distinctions a bit. But for the sake of it, let's start with the initial joint model of these four variables. Again, the thing about the Bayesian insider view to me is that the model is a joint probability distribution for all the variables, all of them. And so what does it mean? It means this thing. I should have just put a P there. We were talking at lunch about how in statistics it's frustrating every function is called P. So you get P, notes, cat, rate conditional on cat, rate conditional on no cat. And I don't know about you, but I have a problem visualizing a four-dimensional probability distribution. So I struggled to try and put one on a slide, and it didn't come up with anything that looked great. So I'm just going to skip straight to saying, how would we define this in a conventional statistical framework? Here's a simple version that we can start working with, just to think about. The notes at time t are distributed as a Poisson variable with a rate lambda t. Lambda t is just switched, switches on and off. There are two rates, alpha and beta. One for when the cat is absent, that's alpha, and one for when the cat is present, beta. And then, so we've got our two observed variables, notes and cat in here. You see how they're data, and they're affecting the rate. And now we've got our two unobserved things, and we need priors for them. We have to say what the distributions of these things are, unconditional on the data before we see it. And I'm going to assert these for the moment, and I'll justify them in a couple slides, that they are exponential with a mean of 10. Okay, with me so far? I'm assuming, and I apologize if it's not true, that everybody's reasonably familiar with this way of writing stats models. If not, you'll become familiar with it, and you'll learn to love it. It's like this is phrase Stockholm syndrome, right? It seems appropriate. Okay, Lund syndrome. So, how is prior formed? You might ask on the internet. That's someone who understood that joke. So, there are many ways that basings go about forming priors. I tend to come from a school where we don't talk about beliefs ever. In fact, it's almost like a taboo in anthropology to talk about anybody's beliefs. But we talk about other things. So, you can ask, what predata information do you have about the unobserved variables in this case? And so, let me walk you through what I think of as the worst case scenario for determining priors, and this is leading up to a pivot, so bear with me for a second. So, what do we know about these parameters before we've got any data with which to inform them? Well, we know that they're non-zero positive real values. Why? Because they're rates, and rates are by definition non-zero positive real values. You with me? So, that's gotta be true. And we assert that all we're interested in is the average. We want the expected rate when the cat is absent and when the cat is present. So, we're gonna track one thing about them. If those are the two things that we know prior to being able to get information about the rate, then there's this fun argument called maximum entropy, which gives you the most conservative distribution that embodies that information and no other information. And the solution to solving this maximum entropy problem is that you use an exponential distribution for the priors. You still have to pick that mean, so you need some information and additionally, but it leads inexorably to the exponential. You can use something else if you feel motivated to do so, but this is a maximally conservative approach that spreads probability as evenly as possible while being consistent with the things you've said. So, that's what we're missing. It's positive and it has a mean. Those are the two things. Then it's exponential. So, the fun thing about this argument, whether you like it for priors or not, is that this argument, when applied to likelihoods, gives you GLMs. This is the quickest and most conservative route to specifying all the families of likelihoods that Fisher would have fused and did use in his lifetime. The same argument. So, what do I mean? It's like if you know the metadata on the outcome variable, before you've seen the values and you apply the maximum entropy argument, you end up with the likelihood families, the exponential family likelihoods that you get. This doesn't mean you have to do it this way, but this is why I'm showing that in the Bayesian perspective, or at least in my Bayesian perspective, the way we derive likelihoods or choose them can be justified by exactly the same argument as picking priors from maximum entropy. And what that gives you is very conservative flat distributions, which blanket as much of the possibilities prior to data as possible. So, in this case then, like the priors, the likelihoods, these are pre-data distributions. Likelihoods don't tell you how the data have to look. They just give pre-data expectations about the blanket of possibilities that it will appear in. The data are free not to look like the likelihood because this is a prior distribution. The residuals don't have to look like that because you're going to update. Nobody thinks that the posterior has to look like the prior, but lots of people think that the residuals have to look like the likelihood, right? But it's not true. You will still estimate the mean, even if you use some other distribution, right? Now, p-values do depend upon the residuals having a shape, but the posterior being calibrated does not. So, in this case, we think about what do we know about the notes before we actually know the values? We know that they're zero or positive integers. It's a count variable. And we know that all we're going to keep track of is the expected value. Again, maximum entropy leads to a unique solution, and that's the Poisson distribution, which is the maximum entropy distribution in this case. So, again, if you have other information, you could put it in and end up with some other kind of likelihood, but this kind of argument gives you all the conventional likelihoods of non-Basian analysis as well. And these are maximally conservative distributions. So, the point of that is that there's a unity to interpretation and derivation of likelihoods and priors, even in the simplest kind of what we call regression model, generalized linear mixed model. There's not even any mix here, generalized linear model. And that unity of interpretation and construction is incredibly useful for heading off misunderstandings like thinking that the residuals have to look like the likelihood. Here's how you'd implement this model. Just to prep you, this is not a complicated model, but I'm going to have, for all the models in this, I'm going to show you a slide like this. I don't intend to walk through the code, but I've got a gist of all of the code examples in the talk already up on GitHub if you want to pull it up and go through it and run it later. I'll fly through these slides a bit, just pointing out some key features about how you implement the models. The key reason to do this is that whatever conceptual unity and harmony I may lead you to believe from my other slides, I want to convince you from the implementation slides that there are real challenges, computational challenges always, and getting this stuff to work. And I don't want you to walk away thinking like, oh, they solved everything. Solve some things. The insider view doesn't make your code work. It might help you understand and build code, but there are some real challenges to making this happen. So on the left is the statistical model. There's your stand code. I displayed the full stand code partly to show you that Stan commits the same sin, data parameters. We could relabel that observed variables, unobserved variables, that would make me happy. And then we have a model where we define the distributions for the unobserved variables. We compute this lambda thing from the other variables, and then we define the distribution for nodes in terms of it. The tool that comes with my book, Map to Stan, that's what the model would look like. That's how you specify it. And Map to Stan basically guesses what all the other data that Stan would need to build the Stan model. You with me? Yeah, okay. So first example, building on that. Let's think about generalized linear mixed model of birds, in this case. These are toy examples, but they're all chosen to teach one little bit about the unification of data and parameters and likelihoods and priors. So now we're going to imagine that every bird is a unique snowflake. There are a bunch of different birds in different rooms. And some birds are more fearful than others, and they react differently to cats than other birds do. And we've got some repeat structure in the data, so we're going to take advantage of this. This is a conventional hierarchical model of that, except I've made it as simple as possible by just using exponential distributions for all the random effects. So just very quickly, same model up top as before, except now we've got bird eye at time t and the lambda for bird eye at time t so I should with cat for bird eye at time t. And then there's a unique alpha and beta for each bird eye. Now we have to define distributions for the new unobserved variables, alpha i and beta i. And the means for these groups, these vectors of alphas and betas are alpha bar and beta bar. So new unobserved variables. Unobserved variables, alphas and betas for each bird now. Those are analogous to the previous ones. And now unobserved means of the population of birds. Typical hierarchical model. Good times. And same justifications on down. So very quickly, before I plumb the lesson about unification out of this one, I want to draw your attention to this great paper by Andy Gelman. This is a paper that I think isn't read as often as it might be because it has a really boring title. It's something like analysis of variance, why it's more important than ever. I don't know if you're like me when you see the words analysis and variance anywhere near one another. She can run in the other direction. And I had a really traumatic mascot's class in graduate school that was nothing but sums of squares. Endless sums of squares. I blacked out and I woke up a semester later or never to do analysis of variance again. But this is a really good paper. And in particular, in the second half of it, there's this great list of all the definitions of random effects that you might come across in the literature. And it's just maddening. So just very quickly. I'm not going to read these verbatim. So what's the distinction between fixed and random effects? Fixed effects are constant across individuals. Random effects vary. For example, blah, blah. Effects are fixed if they are interesting in themselves or random. If there is interest in the underlying population, site, site. When a sample exhausts the population, the corresponding variable is fixed. When the sample is small or negligible part of the population, the corresponding variable is random. So these are all incompatible, right? With one another. If it goes on, if an effect is assumed to be realized value in a random variable, it is called the random effect. I don't even understand that. What does that mean? Fixed effects are estimated using least squares. So now this is an algorithmic definition. And random effects are estimated with shrinkage. There are other possibilities, right? Anyway, so what's my point? I sympathize with the student who is frustrated in countering random effects and wondering what they are. Because from paper to paper, even within the same person, they could be defined in incompatible ways. So what I want to say is that, for me, what we usually talk about is random effects. It's just that they exhibit shrinkage. There's shrinkage. What does that mean? There's some mean of the group of parameters that share some family resemblance. They're the same kind of cluster of things. In this case, there are birds, and there's replication of the parameters across birds. There are family of them. And we model the mean of that, and that results in shrinkage towards the mean of the different birds. If there's a bird that has a really extreme observed singing rate, and there's not a lot of data, that estimate will be shrunk towards the population mean. And that'll give you a better estimate. This is a famous argument that I think is familiar to most of you. Non-Basian sessions use the same shrinkage estimators. Well, it's not the same estimator, but use the same shrinkage phenomenon all the time. This is not a Bayesian versus non-Basian thing. But shrinkage happens everywhere. You've got a distribution that's a function of parameters. Every time. There's nothing about random effects in a Bayesian model or in a non-Basian model that uniquely creates shrinkage. Ordinary likelihoods create shrinkage. It's just in that case you call it regression to the mean. Right? So there's this whole famous argument actually from an anthropologist named Francis Galton about regression to the mean, which produces the same phenomenon as shrinkage. So two quick examples. Just remind you, here's the empirical Bayes version of shrinkage estimators, the James Stein estimators. This is from Efren's, I think, great paper on these estimators from baseball players, American baseball players, where the best estimates of their batting averages are shrunk towards a common mean because of variations. So you can also think about this in a time series. You're trying to predict the player's performance in the next season, and you want to shrink extreme values towards the mean. The same phenomenon, of course, happens in Galton's famous trying to predict children's heights from their parents' heights. There it's called regression to the mean. There's no random effect or hierarchical structure to the model, but you get shrinkage anyway. It's just it wasn't called that at the time. It's the same statistical phenomenon, and it arises from exactly the same mechanism inside a Bayesian model, even inside a non-Bayesian model. It's because there are distributions, and those distributions are functions of parameters, and those parameters create gravity that attracts the family of things, whether they're residuals, in this case, or random effects in the hierarchical model case towards a mean. It's the same fundamental phenomenon. So I have found that this helps a lot in explaining to students what random effects are about. It's just regression to the mean, and they already understand that. At least the ones that I used to teach in California did. They were like, oh yeah, they understood regression to the mean. This is just regression to the mean, but now among parameters rather than among data points. So there's some conceptual delivery from the unity, I hope. Here's how you implement this model. I think this will be familiar to a lot of you to see how it goes. Again, we've got the knotting words, data, and parameters, and we add in the varying effects, vectors, alpha and beta, and the bar terms. The model doesn't change very much. Right. Okay. Plowing forward. Now, let's get into a couple more examples where they have the flavor of the kinds of data problems that I work with in my own research. And these are cases where we don't have full measurement control over the things we normally call data. Sometimes that's just because of the nature of the phenomenon. There's irreducible uncertainty, and sometimes it's because, well, studies could have been better, but weren't. And we just want to get what is there. So before I get into that, let's revisit the previous model and let's just add a couple of lines to it to say that we're going to jointly model, at the same time we model the bird's behavior. Remember, the birds are singing. What are the cats doing? Well, the cats are entering and leaving the room. This is a great life for the cat. And I told you it was a toy example. And the cat is entering and leaving the room. And so we want to estimate the cat behavior. How often is the cat around? And we can write down a distribution for that, too. So now we've got the cat at time t is distributed as a Bernoulli variable with some probability, and we give a prior to Kappa beta distribution with some regularization. The 4, 4 makes it so that the ends have probability 0. Kind of shape like that. Oh, I've got it on the slide. There it is. So far, this is just a bigger joint model. Now we're jointly modeling both animals. But if we observe all the variables, now it's just simultaneously running two regressions. Nothing too special about that. You can do it. The real value of doing something like this comes when you don't always observe the cat. So let's start with the simplest example. Say, sometimes there are missing values for the cat. And I have this, blame it on the cat in this case. Say the cat steps on the keyboard occasionally and gets an NAs, right, in your data set. Or it's your research assistant. Blame who you are. In the sort of data I work on, this happens for a whole variety of reasons. Sometimes it's because one of the people collecting the data forgot to record a variable. And so you get a whole day where a variable is missing. It happens. And then you can't send them back to the field because the field is Singapore. And it's fairly expensive to just send them back. And that's how it goes. So now the nice thing is this model we've defined, it's a joint distribution of all the variables, automatically lets us handle the missingness. But here's the thing, cat now is data or parameter depending upon whether the value is missing at that spot. It is both things. And the distribution, the Bernoulli distribution on cat sub t is both a likelihood and a prior in the same model. And it solves the problem for us. Now it doesn't mean it actually tells us exactly whether the cat was there or not. That depends upon the data. But it gives us a statistical solution to the missing data problem. So here's how you define this in stand to deal with the missingness. You've got to, I'm not going to walk through this code in detail just to say it's on the gist and you can go through it. In stand, I marginalize over the discrete missing variables. These are like the indicator, the interaction indicator variables in the morning keynote. You've got to ask if the cat value is minus one. That's just the internal code for missing. Then we do a mixture over the two possibilities. Otherwise, we observe it and then we just do the two regression distributions. And there's also, let's say in map to stand, if you use the exponential branch that's up on GitHub, it will take this model definition and build that. It recognizes the binary missing variable and builds the mixture model from it. But you've got to use the experimental branch and I make no promises that there are not bugs in the experiment. So it's called experimental, but I use it every day. So that's all I can tell you. You can get the cat, the posterior probabilities the cat is present or not in the cases where the cat data is missing by using this generated quantities trick in stand as well. But stepping past the computational challenge, you can get, depending upon the data, you can get information about, sometimes reliable information about whether the cat is present or not. So now I've got cat in Pute 1 and 4 in the cases where, in my toy example, the cat was missing or the data on the cat was missing. We don't know if the cat is actually missing. And in one of those cases, the cat is probably present, but we're not sure. The amount of singing by the bird leans in the direction of the cat being present, but it's not a slam dunk. And in the other case, the cat is almost certainly missing. Still with me? Yeah? There will be a summary at the end. So we're getting there. So final example. Now let's think about an example that has elements of all the things we've done so far. So same joint model, but now let's consider the issue that the cat is not stepped on the keyboard. So there are no actual NAs in the data set, missing values. But what is true of every cat observation is that you can't necessarily trust it because cats are good at hiding. Sometimes the cat was in the room, but the person who had the data sheet couldn't find the cat. The cat was waiting to jump out at the bird or something like that. So let's imagine, again, toy example, let's imagine the bird always knows the cat is there. Because birds are smarter than people. But the person doing the data logging doesn't always know if the cat is really there or not. So you can trust a one. When the cat variable says one, the cat was observed. There's no phantom cat that was observed. But when the cat is zero, you can't believe it. Now the zeros aren't data. They're data of a kind, but it's not the variable you're interested in. There's a latent variable that you actually want to observe, but can't, and that's the true state of the cat. So this is make Schrodinger's cat jokes now. It's the cat in the room or not. And so we're going to do this statistical version of Schrodinger's cat. And so this is the statistical version of that argument. Now we have let cat sub TV the true state of the cat. It's in the box. We can't see the true state of the cat. What we get to see is cat observed at time t, and it has a Bernoulli distribution where it's the probability that the cat is observed is a product of the true state of the cat whether the cat's there or not. This is a zero one indicator times the detection probability. So when the cat's absent, it's always zero, and you never see the cat. When the cat's present, you only see the cat delta at the time. Yeah, you with me? Those of you who have cats, does this resonate with you? You don't always know. And then the rest of the models are the same as before, except we have a prior, or I should say, distribution on an observed variable for delta at the bottom. In ecology, we call these occupancy models. They do a lot of heavy lifting in field ecology, and they become really important in endangered species studies as well. So the implementation of this before I get to the key thing about it is even more complicated just showing you the model block here. There's lots of commentary if you want to read this later to understand how it works, but like the previous one, this is a mixture. There's multiple likelihoods for each of the possible missing-ness states in it, and you have to average over the inside stamp. But it works great, and you can get estimates for both the frequency with which cats are present and the detection probability of cats out of things like this. This is useful data. So at my institute, we use... I mean, these are toy examples that I've given you today, but at my institute, we use models exactly like these, almost exactly like these, in real research, not on cats and birds, but on chimpanzees. And there's this big project called the Pan-Africa Project based at my institute, which has about almost 1,000 camera traps across territorial Africa, taking photographs of anything that walks in front of them, actually videos of anything that walks in front of them. And so there are hundreds of thousands of videos. Most of them are not apes, but then there are thousands of videos of apes doing things, and we're interested in the distributions of behaviors among the apes, but we simultaneously need to estimate the population densities of the apes. And both the behaviors and the population counts are subject to the same uncertainty that you observe in shooting your cat that goes on. So we use these models in modeling this camera trap data as well. And you need to do it because ignoring the detection probability is a disaster. You get the wrong answer. You undercount things. That's the problem with it, right? Okay, so let me try to summarize here. So the general argument is there is virtue in taking the insider view on apes in unifying concepts that are split in the outsider view of apes, and these are the distinction between data parameters and the distinction between likelihoods and priors. Of course, there are times when it is useful to distinguish these things, absolutely, but there's also a lot of conceptual value in seeing them as fundamentally the same thing inside of a Bayesian model. So the first example I gave you today, the point I wanted to get across was both likelihoods and priors are distributional assumptions on variables, some observed or unobserved, respectively. And these distributions can be derived from the same informational perspective. There's an information state. What's the metadata on the variable before we've seen its value? And of course inside the computer, when you run calculations, you treat them the same way. You can't, in the outsider view, a likelihood is not a probability distribution, but of course it's calculated exactly as if it was one. And we write it in the mathematical model as if it was one. And that's because in Bayes it is a distribution, but over data, not over the parameters. The second, both likelihoods and priors induce the same inferential force, that is they cause shrinkage. And when it's a likelihood, in the outsider view, you call it regression to the mean, and when it's random effects, you call it shrinkage, but it's the same basic phenomenon that the distributional assumption induces skepticism and inference for extreme values. And it takes more evidence to overcome that skepticism, and that's what causes the shrinkage. And it's good, it gives you better estimates. Regression to the mean is a good thing statistically. It improves your predictions. And just as shrinkage on random effects improves your predictions. The third example, what I wanted to get across is that distributions do double duty in models. They can be simultaneously inside the same model, both a likelihood and a prior. And the same variables can be both a parameter and an observed variable inside the same model, depending upon the details. You start with a joint model, a generative model of the system, and then things happen. And the Bayesian framework takes care of the conceptual difficulties of trying to sort out data from parameter and such. That said, it doesn't take care of the computational challenges, which are substantial at times. Absolutely substantial. So again, and then the fourth I think I already said, that even inside the same analysis, the same symbol can be both data or a parameter. The same distribution can be both a likelihood or a prior. I want to say before I move past this slide that there are of course cases where it's very important to distinguish data from parameters or rather observed from other variables. Absolutely. And if you write your own Markov chain, you know of course you have to make proposals for one of these things and not for the others. So it's a very important difference in all the bookkeeping that goes with that. But all I'm arguing is that in teaching this and understanding model construction and interpretation, the unifying perspective is very important. So this is the kind of slide that I put up on the ends of talks to serve as a summary. So it's got way more text than I would normally put on a slide, so your apologies. But I've learned over time that people like summary slides that have a lot of stuff on it. So let me try to summarize very quickly the benefits of the insider view and I'll read this quickly, but you'll have access to these slides later if you want to study them with a glass of wine sometime. We're going to take the insider view. So the insider view is not necessary. The philosophy in general is not necessary, but it is useful. What I find about it that's useful is it helps me to think scientifically, not statistically. It makes me think about the joint model of the system and how the data is produced and I can engage with that and I get a model that will work for all kinds of combinations of missingness and uncertainty. I can build off of that and make a model that isn't statistically ad hoc. You can get data sets having already had a model and you can see what you can infer given the data you have at hand. Many solutions to common science problems arise directly from this approach. I call measurement methods which are true of all the projects I work on. They're measurement error problems because it's often field data collection. And it's very important for me and all of us, of course, to propagate uncertainty in the analysis, to not shed the noise around an estimate as we move through a project and the Bayesian approach makes that a lot easier to do. But, yeah, the computational challenges are very real. Sometimes it's difficult or impossible to fit the model we'd like to fit and we have to make compromises. But it's good to get your philosophy organized first and make wise choices about those things. Unified approach to construction of both likelihoods and priors is said before. What I like about all of this is that I personally find it demystifying and deflationary. Statistics is over-hyped, right? And leased by statisticians, but most outside of statistics. And the deflationary view says, look, this is a garbage-in-garbage-out project. You define the joint model of all the variables. All Bayesian inference can do you is tell you what the data says about that joint model. And that's all. And then there are no guarantees. We're guaranteed we'll not be offered. And I like that. I really like that it's a humble perspective that is nevertheless extremely powerful at the same time. Okay, so this is the final slide, which I mean to be a conversation starter. Say I want to make a modest proposal with all the implications, literary implications of that title. There are a bunch of conventional terms that we use in teaching Bayesian statistics and appear even in my own book. And I suspect that it's a mistake to continue using them, at least without qualification. And so it would be nice to, as a community, those of us who are interested in Bayesian statistics, applied or theorized to think about alternatives or families of alternatives to better teach this material because the population of people who want to use Bayesian methods is growing very fast. This is the time to get in front of this problem and think about developing new teaching materials that can make this better. So, very quickly, and then I'll end my talk, data to the observed variable. A parameter is an observed variable. A likelihood is just a distribution. I'm least sure about a prior is just a distribution. I'm not happy with that necessarily because sometimes you do want to make a distinction, but I'm not happy with the term distribution even because people have mythical ideas about what that means. I think people think distributions are sampled from. That's what makes me nervous. State of information would be an alternative, but no one knows what information is either. So that's including me. So posterior, can we call this a conditional distribution? If we're going to get rid of prior, then there's nothing to be posterior to. So this doesn't roll off the tongue, so we need some solution here, and conditional state of information would be even worse. Then we have terms like estimate and random, which I like to vote for banishing them, voting them off the island. These are terms that seem to do nothing useful for us, except cause problems. We don't have estimates in Bayesian inference. We have posterior distributions. Then there are things we can do with those posterior distributions which lead to behavior changes given some decision model, but we don't have estimators in the traditional sense. The word random just causes a lot of problems. Anyway, thank you for your indulgence, and I hope that was useful.