 This week you're going to learn a new way to sample from possible distributions that will let us go into the more complex model forms in the second half of the course. And if I fulfill my ambition, we'll start maximum entropy into this week too, which will be a mainly conceptual introduction. Which will be the foundation for the generalized linear models next week. So at the start of the course I already pitched this historical retrospective view that contemporary scientific inference is bizarre in the long run since. In that we bet on chance, we study randomness as a way to learn about deterministic nature of the world. And this seems unremarkable to us because we live in that age where we study stochastic processes and we have all of these formal apparatuses for doing so. But you don't have to go very far back in Western intellectual history for this to look like sheer madness. So as I mentioned before the Romans and the Greeks both had this goddess that was the personification of chance and fortuna. And there are various philosophers I quote too here, trashing fortuna just endlessly through history that you don't take chances. This is the Western rationalist tradition. And St. Augustine in particular here, he gets quite nasty in this. For St. Augustine this is nasty. How therefore is she good? Who without discernment comes to both the good and to the bad? It profits one nothing to worship her if she is truly fortunate, let the bad worship her, this supposed deity. And this is fortuna as the Romans personified her was the wheel of fortune, actually the wheel of fortune. The namesake of the game show right up there. And there's that man who's been writing up the wheel of fortune because fortuna is trolling him. She's letting him ride up the wheel and he's feeling great and you know what's happening next, right? Now of course what we think of as stochastic processes in the modern day is not really the same as this personification that the Greeks and Romans had. But that's the point. We think of randomness. Well I'm not sure exactly what people think. But analysts, professional statisticians use it as a description of incomplete information. And the mathematical procedures we use to process that have been very successful in lots of areas. So now we have these great historical artifacts, I believe I've shown you this before as well. Books of random numbers, right? Now you can just do this with your computer and you don't need it. But in the early 20th century publishers started producing these things. They're just books of random numbers. You can imagine future archaeologists discovering these books and being like, wow, they were crazy. These are a bunch of numbers. Or a whole dissertation could be. It could be just about this one page, decoding it, trying to figure out what the message was to those ahead. There's some ledges here, some information we have to get out of it. And of course this was designed for agricultural experiments, for doing random plot experiments. So what we're going to do this week is get a conceptual introduction to Markov chain Monte Carlo, which is another one of these workhorse procedures of Fortuna, a way for us to use chance to learn about a particular thing. That is the termistic shape of a posterior distribution, which remember is a logical implication of your assumptions. Posterior distributions are not random. They are distributions, but they're a determined thing, determined entirely by the assumptions, the model and the data. But we can use chance, it turns out, to learn the shape of these things. And that's what Markov chain Monte Carlo is for. It's a way to draw samples from an unknown distribution that we have defined all of the, in principle, all the axioms of, all the information needed to define it, but we're just not smart enough to calculate it any other way. Markov chain Monte Carlo gives us a really indirect and seemingly insane way to get samples from that distribution. So my goal is just to help you understand it, not to teach you the mathematics of it. And in other courses you can learn about the mathematics of Markov chain Monte Carlo. And there was a time when you really had to do that, because you had to write these things yourself. They're not that hard to write. I'm going to write a really simple one for you today, the most basic kind. But these days we all use packages, which do a way better job of defining the chain sampling problems than you'd want to do on your own. So I'm going to introduce you instead to using one of these packages, the most powerful convenient desktop package, in my opinion, called Stan. And the Rethinking library gives you a sort of convenient interface to this. You can use the same kinds of model definitions you've been using so far to define Markov chains. And then as we go forward you can define new model types in it as well. I'm going to give you some horoscopic advice as usual on diagnosing when things go wrong with this new kind of machine, because it's got its own foibles. You become familiar, I guess, with map estimation, and things that go wrong, or a VM in finite difference value, something happens. And so map is a hill climbing procedure. It has its own issues and foibles. You get accustomed to those. Markov chain Monte Carlo has another set. Markov chain Monte Carlo is not hill climbing. It's trying to sample from the whole distribution. It's not trying to find any particular point that gives you the whole distribution in set. And that will help you understand some of its foibles in ways that you can deal with them. And again, my objective is on Thursday I'll finish this Markov chain content in comfortable enough time that I can start the conceptual introduction to maximum entropy, which we're going to need to understand generalized linear models starting next week. Okay. So I had flashed this guy up before. I want to tell you about Markov chain Monte Carlo through a parable, a thought experiment. So forget about computers for a second and imagine that you're an autocrat of an island kingdom. Your name is Markov. You're a hereditary ruler and your rule is not contested, but unless you want to lose your head, there are certain contractual obligations to your people. In particular, you rule over the metropolis archipelago. And this archipelago has a number of islands, only 1, 2, 3, 4, 5, 6, 7 of them pictured here. And your obligation is that you should visit your people in proportion to their population density on each island. That is, the people love you and they would like to see you. They would all like to see you equally as much. And so the contract you have with your people is that if there's an island that has twice as much as the next island, you'll spend twice as much time on the island with twice as many people. That's your contractual obligation. You want to visit these islands in proportion to their population density. You need a rural, a royal tour schedule to move among these islands is the idea. Now, here's the thing. So, King Markov, however, is illiterate and enumerate. And he's a king, so he's not much interested in book learning. And he doesn't like calendars either. He likes to live as a free spirit. So he asks his advisors, come up with a way for me to honor this obligation, this contract with my people, without keeping a calendar. So they go off for a while and they figure something out because they're clever. And here's what they come up with. So at any particular point, to say every week King Markov is on any particular island, at the end of the week he flips a coin to choose an island on the left or the right of himself in this archipelago. They're ordered, so he can move an archipelago this way. In the book I asked him to measure it circular so he can loop around, which helps with the definition. And so he'll either half of the time the coin will indicate the island to the left and half the time it will indicate the island to the right. Let's call this island the proposal island. It's the island he's proposing moving to. And step two, he finds out the population of that proposal island. He can either send his minions across in a small boat to take a quick census, or he might just, someone might just tell him the name of the island nearby and he remembers the population density, whichever. He gets some quick assessment of the population density on the neighboring island, the proposal island. And let's say, let's call that P5, P for population of the fifth island. Then he needs the population of the current island. He's on island four, so he gets this number P4. So now he's got the population, so on each of these two islands, the one he's considering moving to, the proposal island, and the island he's currently on. So what does he do? And by this point you're like, we're back in the woods, right? You're wondering, why are we doing this? Hang with me, because this is how we sample from posterior distributions. This is what it does. Step four, we moved to the proposal island with probability P5 over P4. And you're like, why? Well, because it works. That's why. No, you can, people derive this stuff, but this actually works. In the book, I describe a way you can do this with seashells and pebbles, right? By putting them in a bag, you can actually sample with this probability. But, you know, you can use a spinner or any number of things that nonliterate, any number of kings can use. And so you might move, say you do move, you move to island five, and then you wait a week, you spend a week, you know, kissing babies and changing hands, and receiving gifts from your minions. And then time begins again, and we go back to step one at the end of the week, make a new proposal. Sometimes it'll stay still. So if he finds himself on an island with a particularly high population density, for example, then chances are that in this ratio P5 to P4 will be less than one, right? Because the island he's on, the current island, is in the denominator, and it'll be a bigger number. So he'll stay on big islands. And on small islands, he'll kind of just split right across them, stay a week and say, bye. I'll kiss all your babies. And he moves on to the next island. And it's that particular property that makes this work. And this does work. It seems like madness. But this is a working instantiation of an algorithm, one of the most famous Markov chain-money-carlow algorithms, called the Metropolis Algorithm. So they meant for a person, so I'll introduce you to in a few slides. And here is a complete R script for running King Markov's wild tour that you saw in the previous slides. It's just this. And I'm not going to walk through this code with you in any detail, but I do want to quickly go through and give you an idea functionally what it's doing. And there's comments in here to help you understand it. This is intermediate level scripting. There's actually some programming going on finally in this course, right, as opposed to just like poking my package with your commands. You can get squeal, right. So at the top, at the top, these metaphors sometimes go really wrong. I'm sorry. I'm doing this on the fly, right. Okay, the number of weeks. I didn't know I heard that. I was like, wow. I said that out loud. Okay. So num.weeks is just how many weeks we're going to run the simulation over. And then we make a vector of positions. Just put zero in there to initialize the whole thing. That's equal to the length of number of weeks. We're just going to record the history in that vector. It's very important, though, and I wanted to say this. In R, you always preallocate your vectors when you do stuff like this, because if you increase the length of a vector in R, it has to actually make the new empty vector and then copy your old vector into the other parts of it. And that's really slow. Anything that involves making a new variable in R is super slow. How slow? Two orders of magnitude slower than just starting with the thing as long as you need it to be. In fact, you don't know how long it will be. Make it twice as long as you think you will ever need it. That's still going to be faster. Your computers have gigabytes upon gigabytes of room for like cat photos, right. So you can double the size of your vectors and be fine. Current is where King Markov is right now. We just put him on island 10. And in this, I'm going to say that the island's index is also a relative population size. Just maybe that doesn't have to be the case. And then there's just a loop. We loop over the weeks in a for loop. First, we record where the king is right now. We put that in the i-th position in positions. We flip his coin, right, and get a proposal island is the current plus or minus one. That's what we're sampling a minus one or a plus one. And then this is just bookkeeping so that we loop around. There's no island 11. Island 11 is island one, right. And then we just do an if-else to see if he moves. We construct the ratio proposal over current, and we generate a random number. If the probability of moving is greater than a random number, he moves. Otherwise, he stays still. Just keep doing this. And this works. Let me show you what the simulations look like. Here's the string starting with week one on the far left all the way up to a week 1,000 over there. And then the island indices are on the vertical. And each blue dot is the position of a king at any particular point. This looks pretty random. It is. He's drifting around. It turns out he's drifting around in a precise way that in the long run, he exactly spins the right proportion amount of time on each island in proportion to its population size. Now, long run is really long. In fact, he will die before he gets a good sample probably with this, because we'll talk about this later. The Metropolis Algorithm, it's honest, and it works, but it takes a very long time. His air will have to keep fulfilling this contract and pick up with the schedule as it left off. But in the very long run, it'll work fine. From the perspective of any particular island, especially the small ones, it may be a long time before he comes back, before he drifts back in. It makes it easier to see this if we connect these with lines. Now you can see the path a little bit better. And then there's some looping around that goes on. It looks quite random. But if we take this collection of points and we collapse them together like a histogram, just take every position in each week and treat it as bunch of data now from some distribution. There's a distribution of positions. Position is an island. And we want to know the distribution of those positions. We just collapse all together and make a histogram of it, and we can do that. And I'll show you at different time steps in this sequence. On the left, after only 100 weeks, the histogram is still pretty jagged. You see, he is spending twice as much time on Island 10 as any other island. But it hasn't quite settled down in the others. After 500 weeks, it's starting to look pretty good. After 2,000 weeks, it's a little better. And I told you it was really a long run. And then after 10,000 weeks, it's basically exact. A little bit of error off. And this is the guarantee. And King Markov honors his contract to his people by using the metropolis algorithm. In the long run, this is what you want Markov chains to do. You're drawing samples from some distribution. And the guarantee, if you've defined it correctly, is that in the long run, those, the individual values will be drawn in proportion to their probability in what's called the target distribution, the distribution you'd like to visualize but could not compute before. For us, this will be a posterior distribution of a Bayesian analysis. But it could be other things, too. Markov chain monitoring is a general procedure from sampling from unanalyzable target distributions of some kind. It's a great way to do that. It doesn't matter where the king starts, in fact. And this simplest algorithm works as long as the proposals are symmetric, meaning it has an equal chance of going up the island chain and down. There are more general methods. I'll give you a more conceptual introduction to in a moment where it's don't require symmetry. In fact, not requiring symmetry makes things more efficient. That's one of the ways to get things out of it. So this is the metropolis algorithm. Just quickly, a little bit of history, because I think the intellectual history of these things is pretty interesting. This is named after the lead scientist of a team that worked on a bunch of stuff, including fusion bombs, and published this paper, this famous paper with a horrible title, Equation of State Calculations by Fast Computing Machines, which used the metropolis algorithm to simulate, I think it was neutron decay, fusion reactors or something like that. It was an atomic physics problem. And this was, you know, the Manhattan Project people. Ed Teller and those folks working on blowing up things, right? We're not going to blow up anything with our Markov chains. But this is one of the first applications of it. And it's interesting, it's a great paper's title. It's Metropolis, who, Nick Metropolis, who led the group, Rosenbluth, Rosenbluth, Teller, and Teller, that's Marshall Rosenbluth, who was a famous atomic physicist, and Ed Teller also often called the father of the hydrogen bomb. And there's spouses who were computer programmers, because in this period computer IT was entirely, computer programming was almost entirely a female profession. Guys did math, they did the math here, and the women actually built the machines and did it. So there's a cool history here where I think it was, Augusta Teller took a part of bicycle and used it and then made magnetic tape and rounded around the wheel of this discontracted bicycle to feed it into the machine and she had etched with a magnet the program code on the magnetic tape she had wrapped around the bicycle wheel. It was because they made this, all computers were one off at this time, right? They had their own operating systems and machine languages and I think it was Augusta designed that whole system and had to feed this, you know, magnetic tape wrapped around the bicycle wheel into the machine. It's an incredible history, there's a whole book about it. And this fascinating stuff on the cutting edge of, you know, blowing up islands, but really amazing achievements. Here's the computer they used pictured in the background there. This was all like vacuum tubes and stuff like that. It was called Maniac, that's Nick Metropolis, the vaguely handsome fellow in the vest in front. This is very maddened, isn't it? I expect him to have an old-fashioned, he's got the cigarette, that's one part of it, but because everybody smoked constantly around computer hardware. I don't know who the fellow is in the back unfortunately. I should find out, but that's Maniac in the back, part of its racks and everything. That's the machines that they were like feeding magnetic tape into to reprogram it. It had no keyboard, right? You had to etch the program with bits on magnetic tape and then feed it in. Maniac was, it's a horrible acronym, but it was made as a joke by Nick Metropolis. He later wrote, he says he was already sick of acronyms back then that I always had everything I could have an acronym that made some cute word and he was already sick of it. So they were already sick of acronyms. We are drowning in them now. Now we have self-referential acronyms that loop back on themselves and all kinds of stuff. So it didn't work, but anyway, it stood for mathematical analyzer, numerical integrator, and compiler, kind of contrived. And it did simulate nuclear explosions. So there's some mania involved. This was a thousand-pound monster, five kilobytes of memory, which is such a small amount. Most of you probably have memory sticks in your bags, which are what, eight gigabytes of more? Those are the cheap ones, eight gigabytes. You get those in cereal boxes, I think, and could do 70,000 multiplications per second, which sounds pretty fast, but actually is really slow. Your laptop, bottom-of-the-line laptop, like buy a Chromebook or something, right, for 200 bucks, and it's going to be, well, it will be like one pound, but four to seven pounds, two to eight million kilobytes of memory and billions of multiplications a second, no without any optimization. Just incredible difference. And now we look at pictures of cats. But anyway, we can do some much fancier stuff is the point I want to make now than the struggles I went through to get the basic control software to work with. This was a huge breakthrough because it provided a way to simulate values from a distribution that could not be calculated analytically. And there are lots of good problems that can be defined completely based on their assumptions, but we can't close the intervals usually. Nobody can. You can still get famous and master-solving intervals, right? That's how the intervals are. And it's not like taking derivatives is easy, right? There's some rules, but the reverse process is the inverse processes can be really hard. So this is a way to do integral calculus. I told you before, we use samples in this class as a way of tricking you into doing integral calculus. Yeah, the same thing. Your computer is going to be doing in a very indirect fashion lots of cool integral calculus. So a little bit about the definitions. Metropolis is named after Nick Metropolis, a simple version of Markov Chain Monte Carlo. The chain in Markov Chain Monte Carlo refers to the sequence it draws. It's the path that Markov took among the islands. It's what we call a chain. It's a chain because what he does at any particular point depends upon where he is right now. But notice it does not depend upon any earlier positions. His algorithm only depends upon his current position. None of the history matters. And that's the Markov part. Named after this fellow, Andrei Andreiovich Markov. And you have to be careful. His son was also a mathematician and had the same name. But this is the one he studied stochastic processes. And they're both relatively famous. But this is the first one he studied stochastic processes. And lots of things stochastic processes have the word Markov attached to them after this fellow. So Markov means we're studying a stochastic process where history is irrelevant. All that matters is the current state of the system. And that's true in the Metropolis Algorithm because it doesn't matter if you visited Island 3 200 rounds ago. All that matters is what island you're on right now. And that affects the probability you're going to move to either of the neighboring islands or stay in place. Makes sense? That's the Markov part. History doesn't matter. Yeah. Then on the other island, don't worry what's now. That's not right. So that's the probability he moves is the ratio of those two numbers. And it is great. Yeah. The probability. And then he uses like a bag or a spinner or something. Exactly. Did that answer your question? Yeah. Yeah. It's a good question. I'm more precise in talking through the algorithm in the book. So I'll lean on the notes a bit here. So yeah. And then the Monte Carlo part of this refers to random simulation. After Damwon. Because there was a place. Hold Monte Carlo. Has anyone here been there? Yeah. It's pretty wild isn't it? Yeah. It's pretty strange. It's like Disneyland for absurdly rich people. And anyway, so I think it was Von Neumann who, although people tell stories about this, you can never confirm them, who proposed the idea of calling stochastic simulation Monte Carlo as a joke to play on the gambling. Yeah. So what would be a Monte Carlo simulation without a Monte Carlo? Anything that uses random numbers. Any stochastic simulation is a Monte Carlo simulation. So, but if you use history. Then it wouldn't be Markov, right? If more than the current state matters for determining the stochasticity, the processes, the time course evolution of the system, then it's not a Markov chain. Lots of things can be made into Markov processes if you're clever because you can define a huge number of states. Some of those states can involve history. So there are ways of defining it. So there are tricks involved to do stuff that you think history matters, but we can still make a Markov chain. So to go back to the island analogy, if there were two or more islands that were non-sequential with populations approaching zero, then there would be basically isolated islands that the King Markov would never visit. He would never visit. Yeah. Those wouldn't even be in set, right? Those are, we'll get to you later. Those islands have a prior of zero. So they're not even in the range of possible states of the system to zero. But you mean approximately zero. Yeah. He'll hardly ever visit them, right? And it may take many, many lifetimes of kings before anyone visits those islands actually. There's like one hermit on a tiny desert island, right? The elegant is out there on some things, yeah. If there's that one island with the one person on it and then there's an island across from that, that you need to go through that island to get to with a very large population, then... Yeah, yeah, yeah. Then this system, this metropolis system is going to be inefficient in that. It may have to loop around the other way or it'll have trouble getting through the state. That's right. That's one of the issues that we're going to... This is why we don't use raw metropolis usually. This works if you define it right. But it may take a really long time and it might be hugely inefficient. So we don't use it for the most part. Okay. That's just the vocabulary. Give you some idea what this MCMC thing is, right? Our use, as I said, is to draw samples from a posterior distribution. A posterior distribution... We don't have to do map estimation and we don't have to make quadratic assumptions. Use any prior you like, any likelihood you like. As long as you can program a Markov chain with it and you can. I'll show you how. You can draw samples from it. And the islands are parameter values. They're particular states. Think of it as positions of the system. Okay. But we're drawing parameter values from this distribution. So the positions of the king are now the parameter values for some particular dimension in the model. And population size is proportional to posterior probability. We get that from the product of the likelihood in the prior. That's where we get things that go into that ratio of the transition probabilities. The top is the product of the likelihood in prior for the other parameter value you're proposing moving to. And the bottom is the product of the likelihood prior for the parameter value you were at last time. You're still at right now in this sense. And then you may move or you may stay in the same place. And if you stay in the same place with some parameter value for a long time, that's how the time is high in posterior probability relative to the proposal center. Yeah. So history doesn't matter, but the king remembers how many people are on the IELTS. Is that a form of history? Maybe, I don't care. Okay. I mean, it's not, I mean, I don't mean to press stop, but it's hard to define the Markov chain now. I mean, you're right. Yeah. Sure. We just got to get, we got to know the likelihood function. We got to be able to make calculations in here. Okay. At least conditional on each parameter at a time. Absolutely. So whether that's history or not. But the reason to do it is because we don't know, the reason to use this is because we don't know. We can't calculate the whole posterior distribution analytically. This will work because we can actually cut off pieces of it. We can do little bits of the formula one at a time. When you update any particular parameter dimension, there's a little bit more about this in the notes. You don't necessarily have to consider all the other dimensions in the model at the same time, and that's what makes it computationally feasible. There's this thing called a Markov blanket, which I almost made a slide for because it's very tempting, right? Markov blanket. But my art skills failed, so I hit the slide. But that has to do with figuring out for any particular update, for any particular parameter, which other dimensions in the posterior you can ignore. So there's a way to do that. If you read the mathier books about this stuff, they'll talk about Markov blankets, things like that. That comes up with Gibbs sampling, that we're talking about next. Okay. All right, so this works for any number of dimensions parameters. So it may not just be the island chain or rate out. It might be, I don't know, elevation of buildings or something like that. But in models, we have many different dimensions. Each dimension is a parameter. And we need combinations. We need posterior probability for combinations. That's what we're sampling from. And the island example is discrete, but of course, our parameters are typically not. It still works. The proposals don't have to be discrete. And it'll still work, as long as they're symmetrical. Okay. So yeah, why do we do this? We do this when we can't write an integrated likelihood function. And for the models we've done so far, you can. You can do the optimization hill climbing because you can get a full likelihood for all the frangers at once by averaging things together. There are going to be models coming up soon where this is not possible. And when we get there, I'll give you some vague idea about why. But no one can close the intervals. So you can't, you don't have a single function you can call, which gives you, for all the parameter values, the probability of the data. But you can do it piecewise. And the usual case that we run into this and the main use in this class is multi-level models. And multi-level models, there are, since there's more than one level, there are these dimensions in the model. They're going to be the varying effects parameters which look like data from one perspective and they look like parameters from another. And they have an identity crisis. And this makes the interval calculus difficult depending upon other assumptions you make. And in general, for multi-level models you can't close the intervals. And you're stuck doing something else. Some kind of averaging. Markov chain Monte Carlo is not the only option. But it's very good in general. There are other ones you'll hear about expectation maximization, which work as well for lots of model types. But MCMC, it's very general and useful. So I'm going to teach you it instead. And some network and phylogenetic models as well, lots of spatial models. So there are a bunch of Markov chain Monte Carlo strategies that are, they're all in some sense variants of a metropolis process. Whether conceptually or analytically. The closest one to it is the so-called metropolis hastings. It's the same kind of process, but you don't have to have symmetrical proposals. That allows, opens up a bunch of different ways to improve these algorithms because there's this truism, I kind of, I quoted in this chapter in the book, that if you can do something, there's a random way of doing something. And there's probably a non-random way of doing it that's better. But that non-random way is going to require more effort. So doing it randomly is not bad, but it costs you on the back end. With the non-random ways you put in effort up front and you get a more efficient process. And the descendants of metropolis all remove some of the randomness and make intelligent proposals of some kind instead of random proposals. So think about King Markov saying, he knows if he's on Island 10 he's probably not going to move to Island 1. So he shouldn't make that proposal because he's just kind of rejected. He'd like to move every week. So he'd like to make proposals to islands which are plausible moves given where he currently is. And that would give you some of these descendant methods as we go down. Gibb sampling is the most famous. This is an efficient version of Troubles Hastings. I'll say a little bit more about it on the next couple slides. And we're going to focus on a newer one called Hamiltonian Monte Carlo or HMC, sometimes called hybrid Monte Carlo. With the same acronym. And I'll say more about that as well. All of these are still in use. They're all useful. And this is an area of rapid research. It wasn't until the 1990s that this became a big deal on the desktop. Bidding statistical models with Markov chains. There's a lot of active research and computational statistics on this. I figure in a decade someone will come up with an even new GWIS way to do things. Well, 10 years ago, around the year 2000, Hamiltonian Monte Carlo didn't really exist. There were people working on systems of analysis like that. But as applications, it didn't exist. Everybody thought Gibbs sampling was like the thing. Now Gibbs sampling looks pretty awful in comparison, although it's still useful. So who knows, in another decade we'll have some more stuff. Remember, Bayesian statistics was outlawed in most stats departments for the second half of the 20th century. So now that it's not outlawed anymore, research is moving at pace. And there's lots of effort on these things. Also, the more complex kinds of models are in interest now. Okay, so let me give you a conceptual version of Gibbs sampling. The Metropolis Algorithm requires symmetrical proposals. Metropolis Hastings doesn't. You can have asymmetric proposals left and right. You can give yourself some drift. But in general, the proposal can be a function that you draw from it. It can be a function of your state, the kinds of proposals you make, where is the movement proposing. What you'd like to do is choose a proposal function that makes smart proposals which are likely to be accepted. Well, that means you'll keep moving and you'll efficiently sample. Every calculation will give you a new sample into sitting on Island 10 forever. And that makes the chain more efficient and helps it do something called mix. Which is something we're going to want our chains to do is have them be mixed well. Well mixed so that we're moving constantly. We're not wasting computations sitting still. And Gibbs sampling achieves clever proposals by using analytical slices through the posterior distribution. I say more about this in the notes. I'm not going to say more about it here because we're not going to use it. So I decided not to spend lecture time on it. The point is though, it can only do this if you choose certain priors. So called conjugate priors. Each likelihood function has its own conjugate prior if it has any. What does that mean? Conjugate pairs are pairs of formal probability distributions for a prior and a likelihood which you can close the interval for so you can solve for the posterior distribution. And it's not required that the whole model that you'd be able to solve for the posterior distribution would be doing this. We just write down the solution. Instead all Gibbs sampling requires is for each parameter that it's the likelihood and the family of priors that you assign to it be conjugate so that you can close it. So there are lots of famous ones for Gaussian model, the mean of a normal prior on the mean gives you a normal posterior. And then for the standard deviation, inverse gamma turns out to be the conjugate prior which would mean now you're like what? The inverse what? And we'll talk about that later. Gamma distribution is actually really useful but not in the case of using it as a conjugate prior. So this is a constraint and I happen not to like it because these conjugate priors often have inconvenient features near a boundary in grammar space so they've fallen out of fashion. Gibbs sampling is the basis of the software packages like bugs and Jags that end in GS that GS stands for Gibbs sampling. The amazing difference using Gibbs sampling in Jags is just another Gibbs sampler, right? Got to have an acronym. And bugs and Jags are great, in particular Jags. If you've got some problem you're going to do Gibbs sampling with it, you can use Jags. There are lots of great books which have ready to go bugs and Jags code for doing all kinds of useful model types. Now all that remains very useful but Gibbs sampling can be frustrating. First let me show you what's great about it compared to regular old metropolis hastings. Realized metropolis hastings without intelligent proposals can take a really long time to get near the high density region of the posterior distribution. What I'm showing you on this slide is the top-down view of samples from a posterior distribution taken from two different Markov chains on the left to give sampling chain on the right ordinary metropolis chain and this is a simple linear regression to say we're just estimating the mean and the standard deviation, right? Using a Gaussian likelihood. It's two-dimensional. We start the chains in the same place in the lower right in both cases far from the actual target area and what you see is metropolis like King Markov takes forever and wanders and drifts very slowly but it's wandering around then eventually gets into the hot spot that stays there. It doesn't drift back out again. It stays on island 10 and doesn't go in. So it gets there and it does its job to turn in where it's spending all its time wandering into the target zone and Gibbs sampling takes exactly two steps to get there. Exactly two. Boom, boom. Right there, it's in the target zone and then it zigzags around and traverses and you see it's parrier. Is that a technical term? It looks like a curveball, right? Because it's moving faster in straight line segments in the bottom right. And this makes it way more efficient. You need a lot less time to get an equally good picture of the posterior distribution. Does that make sense? And that's why it's popular. It was a big revolution in desktop computing for Markov chain Monte Carlo. The problem with Gibbs sampling, one of the problems other than the conjugate prior problem, is when you have a high dimensional model and you will tends to get stuck because some of the dimensions some of the pairs of parameters or sets of parameters will be highly correlated with one another in the posterior distribution. I'm going to show you this several slides in a few slides for that. I'm going to show you what that looks like. As a consequence so Gibbs sampling is jumping in each dimension one at a time. It's doing these traversed intelligent jumps along particular dimensions in the model and since it's only considering one parameter at a time you can't see these funnels of high probability. I'll show you a picture of this later. So it tends to get stuck in little corners and this is horrible. Sometimes you don't have a long enough lifespan to wait for the same to convert actually. For high dimensional models it can be a big problem. What's high dimensional? 20,000 parameters which is not unusual these days. 5th models with 20,000 parameters. So the other thing about it is it's still pretty inefficient because it's jumping around at random. All it remembers is where it was last time and that makes it a Markov change and what's annoying about that is what you really like your little king to do is sweep across the islands. Why can't King Markov just have a schedule where he stays on island 10 for 10 days then moves to island 9 and stays there for 9 days and then moves to island 8 day for 8 days and the answer of course is because he's a literate Numeran. That would be a good schedule because then you're going in a sequence and you're still honoring your contract. You want to sweep across the busier distribution so that you get a picture of the tails and then you don't want to hang out in the tail over here, you want to sweep back across. You want to spend more time in the middle where there's more probability but you want to sweep across these dimensions. Gibbs sampling doesn't do that, it just erratically jumps around because it has no memory. That's what makes it a Markov chain. So there is a solution to this, a partial solution for Hamiltonian Monte Carlo. Now here's the wild thing about Hamiltonian Monte Carlo. It runs a physics simulation. It's named after Hamiltonian dynamics from physics. Anybody here was a physics major? Paul's not here today so he was going to be my hand but he's not here. Hamiltonian dynamics is equivalent to classical Newtonian dynamics as well as the Grangian dynamics. They're just different mathematical systems for describing motion and multi-dimensional systems. The Hamiltonian is parameterized by a vector of positions and momentums though. These are the things we're going to want because what Hamiltonian Monte Carlo does is it simulates a little particle moving in an indimensional space. What is this indimensional space? It's the dimension of your model of the parameter space. What's the particle? That's like Markov. That's where it begins at this particular time. But now Markov is a hockey puck and at the start of the simulation you whack this hockey puck really hard and you impart momentum to it in some particular direction. And it glides on a frictionless surface in an indimensional space. What is this frictionless surface? It's the posterior. It's the log posterior. It's like a big slightly curved hockey rink. Anybody here play hockey? Yeah? No? A bunch of humans. It's a noble sport. It's not for the squeamish. Anyway, so there's not a lot of friction on a hockey rink. That's my point. So this thing is flying along and what slows it down is eventually it starts going uphill because there's curvature in the log posterior and that makes it turn around. And we take time samples of the location of this hockey puck as it moves around at different speeds and those time samples are vectors of samples from the posterior distribution. If you define the simulations correctly it seems like madness but this is still a Markov chain and it still gives you the right solutions in the long run. It converges to the posterior distribution we seek and it's really efficient because it's making good proposals. The proposals are where the puck is heading and the puck is heading towards well it's a physics simulation. So it's obeying the potential energy of the curvature of this thing. So it's cruising in the direction of least energy resistance. That's the way the physics works and that makes the proposals likely to be accepted. In fact they're guaranteed to be accepted in this case because it's deterministic. So you don't need to trust your distribution instead of randomly jumping around it. And there is a clever way to do this so that it's time reversible so you can satisfy all the Markov chain properties. It's a feature of Hamiltonian dynamics is that they're time reversible. One of the things you would... I was going to reference Paul. Paul's not here, damn you Paul. So you don't need to understand the details of this. If you are interested in the book I showed a few slides back where there's pseudo code. I think there's actually a page with some R code in it and you can run a simple Hamiltonian simulation. If you're interested in the details go there. So I'll refer you to it. I cite it in the chapter as well. But you don't need to know the details of that. There's division of labor in this business and there are people that write the samplers and you can define your model within the general structure and that's what we're going to do. I do think it's good to have some sense of the basic algorithm though so that you can diagnose this behavior Let me give you the animated version of that story. Let's imagine that King Markov has a brother who lives on the mainland his name is Monty and he has a kingdom but his kingdom is arrayed along a straight line it's just a bunch of herb, it's a continuous city it's like the Southlands, right? Los Angeles, yeah and it's just like a bunch of cities that grew like cancer into one another and let's imagine we just take all of the Southlands and array them in a straight line and it's one dimension all we have to think about King Monty's got the same contract and he needs to visit his people in proportion to their population density which is strewn about continuously in this urban landscape So what does Monty do? Well he gets in his car and he's going to start driving and let's see if my animation works and the way he needs to do this is he needs to have a function of his relative speed and a function of the population density an inverse function because he wants to move more slowly through the dense area so that he can shake more hands out of his car window he goes more babies and stuff like that, right? and then when he gets into the boondocks out at the end he speeds up puts his title to the metal and just kind of like waves and he goes by and then he gets to the end and he turns back around so the long population density gives us the relative speed that's the idea, that's what that bucket is there so he starts to draw with my my buildings then it will look like a parabola that's the idea so he starts driving and let's see, watch this so he starts driving along and he slows down gets really slow in the middle gets in lots of babies right now so in the real version we're taking time samples and even time intervals as he goes we get the population density back from his position in different time points that's how Hamiltonian Monte Carlo works but it's doing it in a bunch of dimensions not just one, at the same time so it's hard to visualize, imagine this in 22,000 dimensional space and Hamiltonian Monte Carlo does all the dimensions it wants all of them and this is its biggest feature oh yeah, I think it's in the end and he turns back around so here's the summary the population density curve is our log posterior and all we actually need is the log posterior at the point we're at we need its curvature where we are in order to predictively simulate where we're heading the position of the car is a vector of parameter values the speed of the car the momentum of the parameter values you want to go fast when you're high up on the log posterior you want to slow down when you're low the momentum is governing how this works and that guides you through this dimensional surface so that you're moving through the area of high acceptance of proposals and then yeah, there's a step size so to speak where there are time intervals where we check on the position of the car and record that position as a bunch of parameter values a sample from the posterior distribution and this works, it's pretty cool like I said, there's a great chapter by Radford Neal in that handbook where there's just a little bit of R code it's one page long to do this whole thing it's not that bad so imagine you've got a model where there are two anonymous parameters and the posterior distribution the high density region is shown by this ellipse this happens a lot, you've already seen it in your models say like an intercept and a slope in a simple linear regression in multilevel models this happens varying effects are always highly correlated nearly always and so there's nothing wrong with this it's just a challenge, you want to draw samples from this thing Gibbs sampling takes slices through each dimension in terms what ends up happening is it gets stuck in little corners here so if we start here it's going to take a jump to the right in dimension one and then okay, that proposal will very likely get accepted but then it's going to take a jump up in the other dimension and then a jump left and so on and it can just like stick around in a tiny region stuck in a corner if it could see the tilt to this thing and see that this is the major dimension there's an axis that combines both parameters then it would jump in that direction but it can't because that isn't what the algorithm does Hamiltonian Monte Carlo finds those dimensions and it zig-zags along the low energy paths in this thing it really solves this problem extremely well now it creates other problems, remember this is part of that truism there's a random way to do something there's a better non-random way that requires more thought Hamiltonian Monte Carlo is computationally very expensive it really is running hundreds of physics simulations but you don't need as many samples you don't have to run the chain for as many iterations because the individual samples have very low correlation with one another they are highly informative of the overall shape of the distribution so it's great, let me show you a quick movie before we do a little work with these things here's a great animation some computational statisticians made to compare the two you're looking at metropolis right now crawling along posterior distributions in the shape letters harlem shake you'll notice the only one that's doing well is the one on the O in the middle the thin posterior distributions of the other letters it's just ziggling around it can't move now Hamiltonian Monte Carlo on the same shape you'll notice it finds the ridges and crawls along them, look at the S in particular it's really grooving on the S and dances around because that's its job wherever the high density path is that's where the puck goes so this is cool you can watch this in your own free time so I'll leave this up here to see I think the people that made this movie deserve a reward of some kind maybe just a thank you note or something but it's a great teaching tool and in practice the way this manifests for us is that if you use classic metropolis which is just a random walk looking at iterations across the horizontal it wanders around it's just a random walk, it's Brownian motion it's just wandering around if the proposals are Gaussian it really is Brownian motion and at least locally and so it's very inefficient because it takes it a long time to get the whole shape Hamiltonian Monte Carlo in contrast it looks like the samples you pull from that where it is already defining the distribution they're really well mixed up so there's very low correlation from one sample to the next and it's designed that it's sufficiently exploring the space so you want the situation on the right I assert and not the situation on the left and that's what we get from Hamiltonian dynamics so we're going to use Stan I know some of you have fought with this and one, the only thing you had to fight with was your C++ compiler and there have been victories out there if you haven't tried this yet this is your mission, you must do this because your homework coming up is going to require running these things so there is no offing out of this this is now default PhD level skill you need to know multi-level models to have a run markup change you can't opt out, you want a job different opt out and I don't need that to scare you this is true, when I was in grad school it was like if you knew multiple regression you were hot shot but this isn't like that anymore now you're expected to launch rockets so luckily this is not rocket science we can get this installed in a nice safe environment where we are not ashamed of our computing errors and I will usher you through this and we'll get it going and you'll love it so go here and do it if you haven't yet Stan implements Hamiltonian Monte Carlo but it has its own algorithm called NUTS the no U-turn sampler I know what is this disease and the no U-turn sampler is Hamiltonian Monte Carlo but it's adaptive meaning it figures out the step size and other aspects of the physics simulations that make it more efficient so when you run it I'll show you this in a few minutes there's going to be this warm up phase or adaptation phase where it's tuning the process it's not taking samples from the posterior yet it's figuring out how to efficiently do it and when it does that's the adaptive part in the absence of this artificial intelligence assuming the adaptation for you you would have to tweak a bunch of knobs about Hamiltonian Monte Carlo to make it work well this is also true of Metropolis the step size of Metropolis is tunable as well and often you would have to do that as well all these things require some hand tuning unless you've got an artificial intelligence like Stan does there are some models that Stan doesn't do a good job with but all the kind of typical ones we'll use in this class it does the best job there's no other tool on your desktop that's going to do its good job the best thing about Stan and I'll show you this later if not today on Thursday when it doesn't work it's really obvious this is a huge advantage the GS tools like bugs and jags to give samplers when they don't work the chains look basically the same as when they do work they look like random walks and that's very troubling you're trying to figure out whether you've done something wrong when you can't tell it easily with Stan when you've done something wrong it is blindingly obvious and this is why I cherish it as a tool helps me see my idiocy really obviously on the screen it's half the day so we need tools that announce that does so what Stan does is you input a model formula well we're going to use map to Stan so you can keep using the same formula sets you've had before but there's this translation phase you provide the model formula map to Stan builds that into a Stan model which you have access to if you want I talk about this in the chapter and later on if you want to do fancier models you can program directly in Stan it looks very similar to what you've done already Stan then compiles that into C++ code that C++ code has been passed to your C++ compiler there's a million chains here and that makes a reusable sampler an executable that's a little Markov chain with just your model and you can feed new data into it and draw samples from that posterior you can reuse the thing so it's a redeployable thing that's independent of R once you've built it we're going to do it all in R because it's convenient because we can call that thing automatically that little executable and draw samples and then they can be populated directly into R's memory and we're going to process them like every other set of samples we've used so far because they're just samples this is where you now see the wisdom to my madness earlier of course you need to work with samples since the first week it was crazy back then and it was but now you're pros at working with samples yes? not? and I see some skepticism out there but you are and none of that changes you work with the output of these models exactly the same way the models are different so you still have to struggle to understand the new model types but processing the inferences stays the same okay let me give you some examples we got about half an hour here right so I think I think I can get through this without rushing and comfortably show you the actual part you're interested in which is doing things let's go back to a dataset you already know so that we can compare the results of sampling from the Markov chain versus the map estimates we got before and I'm going to do this next week as well we're going to do comparisons there are cases where they're the same that is the map the quadratic approximation the map makes is okay sometimes and sometimes it's not so I'm going to show you when that is and you can start to figure out cases where you really need to use the Markov chain we're going to look at the terrain ruggedness data again because it's fresh in your mind right you're doing a homework problem with it yes say shells she shells she shells I can't say it but the say shells problem and we're going to reestimate it using stand just to see how things work and to give you some idea about how that what a healthy and functioning Markov chain should look like here's the map code again this is familiar right this is predicting log GDP per capita in the year 2000 using the interaction between terrain ruggedness and the dummy variable for the nation being in Africa essentially flat priors very weakly regularizing to the extent that there's basically no regularization here and these are the estimates we saw before there's nothing new here with me it's just to refresh your memories about what's going on this uses the quadratic approximation it finds the peak in the multivariate area and then it finds the local curvature assumes the whole thing is parabolic log posterior is parabolic in every dimension all the way out and defines a Gaussian distribution that's the quadratic approximation that's what that table is at the bottom minus the covariance HMC is not going to use any approximation aside from the fact that it's magically just drawing samples and we can use the same formula list as in map but when you stand to keep it happy you should pre-process all your variable transformations so take the logs first make a new variable that's log GDP you get away with this inside maps because it's really just executing r-strips but with stand you can't inside of stand it doesn't have any access to the R interpreter anymore so you need to pre-process all your transformations and then trim the data frame just for your own sanity you don't technically have to do this but trim it down to only the variables you're using in this particular model so you're not pushing more data into it this is also healthy because it will suppress some warnings stand doesn't like care for data so if you've got categorical text variables in your data set stand will throw these ugly serial warnings about something was ignored and it's harmless but it creeps people out so I started just advising people to remove them from the data set so that you don't send me emails about it now I'm happy to receive your emails about point of warnings they're easy I just say no I ignore that so that's all I've done here is pre-process that and now the only thing that absolutely has to change in the code is map at the top becomes map to stand but one thing we're going to I'm going to introduce here because it's a good place to do it is instead of a uniform prior for Sigma which would work fine we're going to use a Koshy distribution a half Koshy actually for Sigma which is a weekly regularizing it's a regularizing prior for scale parameters like standard deviations so let's take a quick minute before we look at the output of that Markov chain to talk about who Koshy is this is Koshy Augustine, we Koshy and like many French mathematicians he was born of the silver spoon in his mouth in every place else and did some amazing math in his lifetime and so his namesake is a distribution among some other things which arises from all kinds of physical processes in physics I think it's called the Lorentz distribution but Koshy analyzed it Koshy was one of the pioneers improving the calculus work and why so calculus have been used for a long time and no one actually told you why and eventually mathematicians were like okay we got this calculus thing maybe we need to figure out why it's justified and they did that he was one of the pioneers in that area so some serious important mathematics actually the Koshy distribution is the ratio of two Gaussian samples that definition is unhelpful our interest in it is just because it's a weekly regularizing prior distribution for scales when you use only the positive half of it so we're going to use half touches and the positive half there's a weak preference for small values and there's a really thick tail which is what our interest in is it's effectively flat for a long time that's why it's only weekly regularizing but it's not perfectly flat the important thing is infinity is impossible for the Koshy distribution unlike a truly flat prior and that's what you want Koshy's performed very well in regularization under simulation trials lots of people recommended for that reason so I want to introduce you to it here we're going to need it when we do multi-level models we'll have to do some regularization and this is related those of you who study foraging or movement ecology this is related to levy flights which are sampled from these thick tail distributions oh I'd say the mean and variance are undefined if you think this is a probability distribution without a mean without a variance if that is intriguing to you ask me afterwards and I'll explain it to you on the chalkboard but I'm in the lecture and some of you are like yeah I don't want to hear about that it's terrifying it's an easy explanation for it actually but afterwards I can tell some people about it okay when you run the map to stand code very often you're going to get this message minus the keep calm part it should show that actually this is a very frightening message but it's nearly always harmless and so what's going on here is there's some parameter which is bounded to be positive and during warm up stand has tried out a negative value for it and then this informational message appears where it says it's rejecting something the current metropolis proposal there's metropolis again is about to be rejected which sounds harsh and then the scariest part is the last line but if this warning occurs often then your model may be either severely ill conditioned or mis-specified right which sounds like the worst thing possible so nearly always this is harmless especially if it happens during the adaptation or warm up phase this is just stand feeling out the posterior distribution to find a good way to sample from it there's no problem if you get thousands of these filling up your screen by all means close off and walk away very slowly break the process and then look for the problem chances are you've defined the modeling correctly or email me whichever comes first and but that thing is nearly always harmless so keep calm and carry on but sometimes it's worth worrying about this again this is one of the things I like about stand when things are wrong you get a good indication of it it's overly cautious in its warning messages which I like to be like that let me compare just the marginal posterior distributions from the HME samples up from map to stand and the quadratic approximation from map at the bottom once you see that they're practically identical practically identical within Monte Carlo error here so this posterior distribution is actually almost entirely multivariate Gaussian because the priors are Gaussian and the likelihood is Gaussian so in theory it should be Gaussian and yes it is Monte Carlo didn't make any assumptions but it should be this way so this might give you some confidence in the fact that map was explicitly lying to you for all these weeks and that's often true so classical statistics also makes these Gaussian assumptions and often it's a perfectly fine assumption you get samples out of a map to stand fit model the same way you get them out of a map model by calling the same function although R is actually realizing it's map to stand and calling a different bit of code so the model samples they were returned by the stand sampler and they're a list just like before you get dollar sign in each parameter name you access them as before but the models can get more complicated so we can have parameters that are actually matrices and we'll have some of those in a couple weeks in there as well so you don't have to worry about that right now okay the most important thing to do is check your chain sometimes the model doesn't work right and so this is what the chain looks like for this model this is a healthy chain this is what you're looking for this chain is both has low auto correlation that means it's exploring the space well that means it's zigzagging a lot right it has good mixing you see by the it looks kind of like a noise distribution right that means there's low correlation between the subsequent steps that's what you want that's what makes it efficient I mean since your face looks like you were unhappy with that sorry I'm happy with that sorry okay sorry my head is often a problem so yes and the other thing is it's stationary meaning it's zigzagging around the same center of gravity as it moves along right so both of those things are what you want if it were drifting in a direction that would be bad that would mean it was non-stationary and if there was a really strong correlation between subsequent samples and tracing out a line that would also be bad that would mean it was inefficient you'll get this in your homeworks it'll happen to you I'll show you some examples though maybe if not today on the start of Thursday I've just zoomed in here in the lower right to show you the first 100 samples after the warm-up phase this gray region is when it was adapting so these aren't part of the posterior this is just where Stan was figuring out the step size I ran it much longer than it was necessary so let me show you the region ends, we get real samples from the posterior and we see it's moving around it's zigzagging so this is zoomed in this is the monkey's car it zooms around this high-dimensional space and there are like five dimensions in this posterior distribution I think here they are the intercept in the upper left and then the three regression parameters and then sigma down below this is what you're looking for this is a good change they don't always look this nice and they may still be fine sometimes you do get auto-correlations between samples that doesn't mean it's not working it just means you need more samples and we're going to look at some diagnostic criteria for that in a few slides you guys with me so far about these? it seems sometimes there's a sign there going on and the change is that okay? that's fine, I don't know there's not actually a sign there's a sense it's interesting that you say a sign curve because in a two-dimensional Hamiltonian Monte Carlo simulation the path will trace out sinusoidal paths because that's how the physics works you'll actually get the deterministic path will trace out the sinusoidal path in a two-dimensional bucket and you've spent a marble in it you've probably done this at museums so it'll be the sinusoidal path that's traced out so it's a great question that you you have a good eye for trigonometry so it's not exactly doing that here but you notice they're responding to one another right when some of these parameters are sampled high others are being sampled low and that's part because they're independent so let's talk about warm-up for a second what is warm-up? warm-up is this adaptation phase it's on these trace plots so I should have said these are called trace plots they're just sequential samples one after another and the gray region is those are not samples that are going to be returned to you by extract samples for use they're automatically truncated out by math to stand so you're not going to see them unless you really want them and then ask me how to get them but they're in there but you don't need them they're there for it to become an efficient sampler it reduces the autocorrelation by tuning step size and such and something called the mass matrix which sounds pretty awesome right it's a good band name too so these things are not samples from the posterior they're a collection of experiments nevertheless when mixing and looks good and it looks stationary in the warm-up phase that's a good sign but as I'll show you in a minute a few minutes with a bad example the warm-up phase can look great like it's mixing well I'll give you an example of that later so you want to look at the actual samples not the warm-up to diagnose things and it's oh and say at the bottom it's this is not the same as burn-in on Gibbs sampling Gibbs sampling to burn-in is from the posterior distribution all the samples are it's just the idea is you started way out of the target area so you want to lop off the first part the part where it crawled in to the target area you don't need to do that here because it's not the same thing when stand starts taking samples it's the target area already, automatically, it's always in the target zone if everything's working right so there's no burn-in with Hamiltonian Monte Carlo alright some moroscopic advice on running these things there's lots of superstition about Markov chains just like with everything else in statistics to some extent this is unavoidable because it's like there's ritual pollution and you have to cleanse yourself of it I want to give you what I think is some generally useful advice that is there's a lot of superstition in it it's alright to be anxious about this though and ask questions about it just keep in mind the goal the goal is to get samples from the starter distribution and there's going to be lots of effective ways to do that but there are things to look out for and the horoscopic advice I'm about to give you are ways, is a way to anticipate advice and notice it when it, problems to anticipate problems and notice it notice those problems when they happen so the usual questions how many samples do you need how many chains do you need to run because you can run as many as you like within reason and I want to give you some examples of chains that aren't working quite right I'm definitely going to finish up that last part on Thursday because we've only got 10 minutes here so first thing to notice is when you use Precy on a map to stand fit you get the usual four columns the difference being that now your your interval points are highest posterior density and they're labeled lower and upper instead of the percentile intervals before the reason is now there's no guarantee that these posterior distributions are symmetric so you really don't want to use percentile intervals to describe their shape you may not want to use these intervals either but at least these intervals are guaranteed to include the map whereas the percentile intervals are not so you may remember that from week one so that's the first difference the next difference is you get these two diagnostic columns which help you diagnose how well the chain is working, these are measures of its efficiency and its convergence NF is a measure of how well it's mixing that is how low the correlations are between subsequent points NF is the effective number of samples it's a crude estimate that takes into account the correlation between subsequent pairs of points in the market chain when they're highly correlated there's high auto correlation and then the effective number of samples is low because you're basically just staying on the same island over and over again when there's a very low correlation or none between subsequent samples then the effective number of samples is nearly the same as the actual number of samples that'll happen sometimes with simple models but usually it's lower because there's some correlation so that makes sense so NF is an idea of how many samples you've actually got you might think 300 is very little but if it's multivariate Gaussian that's all you need I'll talk about this on the next slide but that may be plenty actually for reasons I'll explain NFS is stationarity and mixing and convergence to the target lots of things it really uses works best when you use multiple chains you run multiple chains and you want them all to converge with the same stationary distribution so you need multiple chains to confirm that with one chain you really can't tell so that's a reason to run more than one chain at least when you're testing things or working for both of these kinds of criteria they are both known to fail that is to give you false signals of security so I encourage you to use them as alarms and not security blankets when R hat is one that doesn't necessarily mean it's working alright just means it's consistent with it when R hat is much bigger than one it's almost certainly broken that said even when it's much greater than one the chain long enough the chain may be fine you need to run it longer so that it converges with the stationary distribution so don't get overconfident how many samples do we need when we talk about samples focus on NF the number of effective samples not the actual number of samples you draw because they're highly auto correlated you're getting done with information over and over again you use these two optional parameters to map the stand called warmup and EIDR for iterations to control the number of samples and warmup you're going to draw and warmup is the length of the adaptation period so when you're just getting started and these are the defaults in map to stand you're just seeing if it's going to compile and run and this will be your biggest battle to begin with sometimes it won't and you'll curse so you want to do a short run at first and see if it looks insane so run 2000 samples 1000 of which are warmup get an idea if it's working and that's fine and that may be all you need actually for lots of cases the faster distribution is multivariate Gaussian 1000 effective samples is a lot because there's only the variance covariance matrix and the vector means to estimate there's only two moments that define the whole Gaussian distribution in one dimension so you don't need a lot of information to do it eventually you'll want to do more you check it works right you may want to run more samples so you increase EIDR you may want to run multiple chains for convergence David you have a question yeah when you say samples do you mean we're not taking every iteration after the warmup we're not taking every iteration it's like a thinning thing we're doing no thinning ever you do not need the thinning thinning is a thing for those old engines like if sampling so if there's really high autocorrelation thinning loses no information typically the autocorrelation is going to be a lot lower so you don't need to do thinning so yeah these good sampling papers you'll see that we ran the chain for 500,000 steps and then we thinned every 50th sample you will not be doing that kind of madness with this and what's great about that is you need less hard drive states for your model and less memories you can make stand thin I think I've never done it but I think there's a thinning parameter in there but you probably will never need to use it because of that and we're not thinning was there a hand over here with just autocorrelation sorry it's like back in my head I see autocorrelation back in my head okay let me justify really quickly the question of how many samples do you need this depends on your purpose if you want to know where the posterior means are you don't need very many samples at all why because you learn once upon a time I think if not let me teach you now that the standard error of a population mean is one over the square root of the sample size so in F here is the sample size so that goes down really fast if you have 200 samples you have a fantastically accurate estimate of the posterior mean you really do and then the variance it requires an order of magnitude more but that's still not necessarily a lot and so things are multivariate Gaussian you don't need a ton of samples you don't need to hammer it to death a thousand is actually one maybe if you really did picture once and get an idea of how it works if you're interested in the tails the thin little tails of the posterior distribution well you're going to be an unhappy person for a long time because you're going to have to sample to death because this is the Gilligan's Island in the Markov chain problem it doesn't want to spend much time out there so that's something that Markov chains are just not efficient for is figuring out the shapes of the thin tails probability distributions luckily we don't usually care about a million percentile of something right if so you need to get an analytical solution this is not a way to do it it's my advice okay and again always focus on a number of effective samples and not the actual number of samples the chains parameter controls how many independent chains you run on the first run I really encourage you to do only one chain see if it compiles see if it looks like it's reasonably mixing so check the R hat statistics and four is a good is a good idea and four chains started from different starting points would be even better right because then they're starting from different initial values and you want to see them all approach the same region of parameter space and then when you do your final run look it's up to you this is where they're superstition people will say you've got to run four chains and then like permutum or look any one of the chains is defined correctly and it's converging any one chain is fine if you've got multiple chains that's great if they've all converged you can mix them together and that's fine too it depends upon issues which depend upon your exact setup how many processors you have and how comfortable you are in multi-processing whether you like RStudio because running multiple chains in RStudio may lead it to crash because I've already told some of you for multi-processer stuff you can't run it in a graphical interface on most of these systems you've got to run it in a terminal and then it works anyway my heuristic usually again this is horoscopic advice is one short chain to see if I've programmed it right four medium ones to check convergence then I run two really long ones from different starting positions and then use those samples to make inferences for them but in particular cases for simple models in this class the one short run at the start is often plenty but I encourage you to play around and try your computer will never explode from running my code I promise so you're not going to break anything and you want to play around and actually try to break stuff because that's how you figure out the operating limits of these things okay got a couple minutes here I think I should probably stop right here why don't we stop right here when you guys come back on Thursday the workout chain the chain has been defined badly so you can see what a bad chain looks like and then I'll show you how to fix the bad chain train it and I'll show you a couple examples of that and then we'll do some maximum accuracy alright, I'll see you on Thursday