 Welcome back everybody. We're going to hit the major transition point in the course today where we switch algorithms. Our algorithms for approximating the posterior distribution. Before we get into the details, let me remind you really what the heart of Bayesian inference is. Bayesian inference is not about how you compute the posterior distribution. It's just about the posterior distribution, however you get it. There are lots of valid ways to get an approximation. Remember the globe tossing from the beginning? In this case, you can calculate this exact posterior a huge number of ways and they're all valid. That's fine. I emphasize this because often in applied statistics, especially in biology, Bayes is thought of as using Markov chains. They're the same thing. You use the Markov chain as Bayes. Actually, they're not the same thing. You can use a Markov chain and not have a Bayesian model. Actually, the Markov chain strategy I'm going to show you is essentially frequentist. There's this weird thing about it, but it's used to calculate a posterior distribution. You can use a Markov chain to do lots of different things. The reason in this course that Markov chains come so late is partly to emphasize that to you that there's an essential thing about Bayesian inference, but it's not how you compute the posterior. It's just that you're after the posterior, however you want to get it. The second thing was to make it easier. There's a bunch of additional machinery you need to fool around with once you start running Markov chains. Today, I'm going to introduce you to that machinery and start to train you to be expert users so that you can make good decisions and recognize when the machine malfunctions. Remind you, in this course we've discussed four different, so far only three, but four different ways to compute a posterior distribution. There are others. First, there's this analytical approach, which I did not show you, but I used to make the graph on the previous slide. For very simple models, you can just get a closed form analytical expression for the posterior distribution. It's great. You take MAS stats courses, you'll learn that sort of thing. But for interesting models and applied stats, this is nearly always impossible, so we don't do it. So you need some other numerical way to do the integration. This is all about doing numerical integration. And I taught you grid approximation in the beginning, partly because I wanted it to reinforce the concept of what the posterior distribution is. It's just a thing that's proportional to the product of the probability of the data times the prior probability. That's all it is. That's the posterior. And then you can use in quadratic approximation with this funny tool named Quap for months now. Yeah, you've been loving it. Natalia's eyes rolled very significantly there. Sorry, I didn't mean to single you out. Lots of eyes rolled. But quadratic approximation is in this course instead of starting with Markov chains because I also want to... Well, first of all, it's very effective. It's unreasonably effective for a large number of very simple models. And it's also numerically the same thing as the way lots of non-Basian models are fit, estimated. If you're doing maximum likelihood estimation and getting some sort of symmetric standard error around each estimate, then you're doing the same mathematical steps. Even though people tell you it's philosophically completely different for some reason, but the numbers will be the same. Yeah. And I want you to see that connection. That there's a connection there between lots of standard tools and very basic Basian models. But now we're going to get into things where that connection gets a lot blurrier. So before getting into the details of what Markov chains are, let's have a parable, okay? I want you to stop thinking about Basian models for a second and instead imagine you're a Polynesian king. King Markov and you rule an island kingdom like all Polynesian kings did. And it's good to be king, but it also comes with obligations. There are certain things you must do to avoid being killed by your people. It's like all kings. Royalty is fragile as an institution, yeah. And so you rule over a kingdom called the Metropolis Archipelago and it has ten islands. There aren't ten here, but there'll be ten in a moment. And the islands are numbered and in order and they have different population sizes on them, different population densities. And the contract you have with your people is they'll let you continue to tax them mercilessly as long as you visit them. They love you. Well, you're a descendant of a god, right? The Hawaiian kings. And so as long as you visit them and they can bask in your godly glow, then they'll be happy. So you must visit them in proportion to their population density. So this is your goal. You're not numerate, not highly educated. You're king after all. No, sorry. Now my politics are shut. But a Polynesian king. And so you need some simple algorithm to do this that doesn't involve having a reliable census and a schedule and calendar for the whole year. So your advisors come up with a simple rule you can use to wander around the islands and fulfill your contract. And here it is. The first thing you do, so you find yourself at any point in time on an island, you've just finished visiting some particular island, you, you know, waved and kissed some babies and did all the other things that politicians do. And now you're ready to move to another island. And so the first thing you do is you flip a coin. And this coin determines whether you go left or right. Well, it doesn't matter if you go left or right. This generates what's called a proposal. The proposal island, you're going to consider going to one of the islands, either to the left or to the right. You get this randomly from an unbiased coin toss or a shell toss or whatever you want. Think of it as. Then step two, you send a servant in a boat across to the proposal island. And they take a rough census of the population density on that island. You have another servant do the same for the island you're currently on. So you get the population of the proposal island from this. And we're going to call that number of the population of the proposal island little p sub five for the fifth island. You're also going to get the population of the current island. So this is p sub four. You want to compare these two numbers in a particular way. What you want to do is construct the ratio of them, p five or p four. And then this should be your probability of accepting the proposal of moving from island four to island five in this case is p five over p four. Now you're saying that p five over p four can be greater than one. Yes, it can. In that case, you'll move. Right, for sure. But you might stay. It depends. If p five is less than p four, then you might stay. I'm not guaranteed to stay. So imagine that p five is half of p four. Then you've got some chance, but you've got a good chance of moving and a good chance of staying. So you move except move to the proposal island with that probability. And then finally, then you get either you stay and then you do another tour of the island you're currently on. Or you move and then you kiss some babies on the new island, whatever. And then you just repeat this starting over with stage one. And you keep doing this your whole life. And this is this is the life of the politician king. The thing about this is that this is a valid way to fill the contract. This algorithm guarantees this is provable mathematically guarantees that in the long run, you will visit each island in proportion to its relative population size and fulfill your royal contract and not be assassinated by your people. At least not for this reason. This is an example of Markov chain Monte Carlo. In fact, as I'll explain in a second, it's the most famous and most primitive algorithm of Markov chain Monte Carlo. And what I want to do today in this lecture is help you understand the general strategies of these algorithms. Why we would use them because it seems mad, right? I hope you felt that that parable on the previous slide was crazy. It's just sort of madness. You can think of lots of better ways to do something like that. But it has a huge advantage if you don't know the distribution. So if you don't know the distribution of population sizes on the islands, you don't need to actually visit each of the islands in proportion to its population size. And those are the kinds of problems that we have when we're trying to find posterior distributions. We don't know the posterior distribution and yet we can visit each part of it in proportion to its relative probability using an algorithm like this. And that's the magic. We can sample from a distribution that we don't know. That's what we're doing. And this is just a tool for doing numerical integration. So I also want to introduce you to the interfaces to Markov chain Monte Carlo that I'm going to use in the rest of the course. The engine is something called STAN. I'll tell you what it is when we get there. STAN, it's not an acronym, it doesn't stand for anything. It's not in all caps. It's a person's name, Stanislav Ulam. I'll show you a picture of it later. And the tool that I wrote in rethinking called Ulam, which is a simplified input. It uses formulas exactly like the ones you've been using so far, but it'll make a custom Markov chain out of it for you. So I'm going to show you how that works too. Then I need to show you how to sample responsibly from this and diagnose when things go wrong. And you'll continue to practice this and develop lifelong skills and a love for Markov chains. Okay, here is the R script version of King Monte's Royal Tour. This is a very short and heavily commented chain. It's called the metropolis algorithm. Metropolis was also a person. I'll show you a picture of it in a minute. I don't usually step through R code in this course, but if you'll indulge me, I'd like to stay on this life and run and walk through each line for you to have a sense of how simple this algorithm is and that this code instantiates that weird story that I gave you on the previous slides. You're not going to write code like this yourself to run Markov chains, but you could. It really is super simple sort of thing to do. Okay, at the top, we define how many weeks King Monte is going to do a tour. And in this case, it's past his lifespan probably. But 10 to the fifth. Then we do some record keeping. Positions is an empty vector full of zeros with the length and number of weeks. We're just going to store which island the King is on in this vector. In R, you always want to initialize all of your storage space at the beginning instead of growing it as you go because this is much faster this way. And then I'm going to just put him on an island on Island 10 to start. You could randomize that if you want. That's what current is. It's just the current position. And then we loop over the number of weeks. And the first step is we just record in the record keeping position specter where the King is now. That's that first line of positions. I is current. And now we're going to flip the coin to generate a proposal. The proposal is the current plus a sample from minus one or one, size one. So this is just a clever our way of adding minus one or one to work in your current position. That gives us a proposal. Island to go. And now we've got to handle the looping. So in the book I've got a picture of this archipelago or badly drawn picture of this archipelago. And it's in a ring. Imagine it's a collapsed volcano or something like that. Super volcano. And so it's in a ring and so you can loop from island 10 to island one the other way. So these two lines here are not necessarily essential but they just make it a loop, a donut of islands. And then finally there's the action, the magic part. We construct this ratio of the proposal to the current and I've got the island's number such that their population sizes, relative population size are their numbers. That's what I'm asserting here. When you do this with a stat model you have probabilities of data in there. That's what changes. And then I do a random number if that random number is less than the probability of the moving then we go to the proposal otherwise we stay in current. And this gives you a valid chain. This is what it looks like. You run that code and you should it's the safety of your own home. Run that code and plot it out. You get this sort of thing and what we're looking at is time moving from left to right and then the king's position on the vertical and you'll see it zigzags around and it loops from top to bottom in various places. There are places where it stays on the same island for a while. Because this is possible too. You get stuck on very densely populated islands like island 10. You need to stay there for weeks and weeks and weeks and kiss the same babies over and over again. There's lots of babies, right? Dense islands. You need to stay there longer. That's the whole point. But then you can take a tour. He'll flip over the small islands and only spend a week there. Those sorts of things. And in the long run when we construct a density out of these data points it's in the right proportions. So the way we usually think of chains is you think about connecting these with lines. So it's not just a bunch of points and you get this zigzag sort of pattern of movement in space. So in the long run let me give you an idea about the evolution of the frequency properties here. These bar graphs, these histograms show you different points moving left to right on this graph. On the left it's only after 100 weeks. You can see it's starting to emerge. He is visiting island 10 more than the others. But it's not. What we want is a perfect staircase. That would be the right. That's the Royal Contract. So in the middle after 500 weeks it's looking better. After 2000 weeks we're almost there. That's a lot of weeks though. I don't know how many weeks are in a year. Not that many. This is a lot of years. So this is King Monty's like great grandson has already taken over the throne and picked up the Royal Contract again. But it's still a valid Markov chain. It keeps going. This is an important thing about Markov chains. They guaranteed to work but only in the long run. And what the long run means is a very sensitive issue. And this will be something the reason we're not going to use this algorithm to get estimates of the posterior distribution we're going to use a better one that is much more efficient than this. You can do a lot better than this. But this works. Guaranteed. It's just that you might be dead by the time it delivers on its guarantee. So there's my summary side of that. This algorithm, the metropolis algorithm it will converge to the correct proportions in the long run. But that's a kind of vague promise. That's the kind of promise you get from math. It's like, yeah, I guarantee it, eventually it'll happen before the universe implodes. It doesn't matter where you start. It's not sensitive to initial conditions. And for this exact algorithm you need symmetric proposals. Your probability of going one direction needs to be the same as going the other. Well, the proposals. Your probability of going one direction and going the other is actually not the same because of the relative population sizes. But the proposal time needs to be fair. There are other algorithms where that's not required. You can extend this to something called metropolis hastings. I mentioned this in the book notes, where the proposals can have any distribution you like. And that leads to a bunch of improvements that you can make. You can make more intelligent proposals. So this is called the metropolis algorithm. And we're not going to actually use it, as you might imagine, to tour archipelagos. In our applications, the islands are parameter values. And there can be an infinite number of them. They don't have to even be discrete. You can just have a continuous smear of islands and you can jump among them. And the population size is the posterior probability. That's what we want to sample from. Because there's this distribution, which is the posterior. And we need to visit parts of it in proportion to their height, local height, to the posterior probability in that location. This works no matter how many parameters you have in the long run. This is going to be a big snag in a moment. If you get a lot of parameters, the long run is really long. But we have a solution. Nope, Fred. And it works for continuous and discrete parameters. So the reason it's called metropolis is because Nick Metropolis was a Greek-American mathematician who headed a research group at Los Alamos that developed the first computer application of this algorithm on a computer called Maniac. Here's the famous paper where they describe it, a really terrible title, right? Equation of state calculations by fast computing machines. Fast computing has got some irony if you read this now, right? Every one of you has a phone in your pocket that is vastly more powerful than this computer was. But this was really an extraordinary achievement at the time. And some famous people here, including famous physicists, Ariana and Marshall Rosenbluth, Gusta Teller and Edward Teller. So I think it was Ariana Rosenbluth who did most of the programming. And I think there's a story where Gusta Teller figured out the memory. They got a bicycle wheel and a bunch of magnetic tape and they spun the bicycle wheel to feed the tape in and out of the machine. There was no, like, store where you go buy a disc or a memory stick or something, right? They did this all from scratch. And so there's a great book. I forget who wrote it right now. I'm sorry, where they tell all these stories about how they invented all this stuff. It made this thing work. It's really an amazing thing. But I should say, though, this was all part of the Manhattan Project. It was this post-war, but this was making fusion bombs, right? They were running the invented disc algorithms so they could simulate fusion of, you know, a bomb explosion. That's what it was for. This is Teller, Ed Teller. Some of you know this history. So there's some monstrous stuff here. We're not going to do that. We're not going to do simulated fusion explosions. But you could if you wanted. But please don't. All right. Here's Nick Metropolis in the foreground smoking because everybody smoked. Everywhere. And that's maniac in the background, part of it. He was distributed across rooms and there's all these things, you know, making tubes and everything. Let me give you a sense of it. Maniac is an acronym that Nick Metropolis came up with because already acronyms have gotten ridiculous by then and he was like, okay, I think I've got to put it into this. Let's make a ridiculous acronym. And it didn't do any good, did it? Acronyms are still bad. But so it's the mathematical analyzer, numerical integrator, and computer. And I think there's also an irony here because many of the members of the Manhattan Project had moral doubts about what they were doing. Right? So this encrypted stuff that they did like this. So Maniac was a thousand pounds and had five k bytes of memory, which is basically nothing. That's not one meme. Right? You can't fit one cat meme in five k bytes of memory. Right? And 70 k multiplications per second, which sounds fast compared to what I could do with a pencil. But compared to your laptop or your mobile phone, your laptop, a really cheap Barbie-based laptop, would be four to seven pounds, two to eight million kilobytes of storage, and could do billions of multiplications per second. That's super cheap. And your phone as well could do that easily. So we've come a long way in computing power. But the fundamental algorithms and the mathematics that justifies those algorithms and guarantees they work is the same. It was developed during this time that computing was really difficult. So what are Markov chains? So the Metropolis algorithm is the simplest version of what's called Markov chain Monte Carlo. What's a chain? A chain is the sequence of moves. That's why it's called a chain. There's a sequential simulation that's being run. Then we call that a chain. In particular, this is a Markov chain, which is named after this fellow, Andrey Andreevich Markov, who is a very famous Russian mathematician. He did a lot of foundational work in probability theory. And he didn't call these things Markov chains himself. Other people did, but he did all this work on probability systems that have no memory. So what makes something a Markov chain is that it's a sequential simulation where the probability you go to any new place in the chain depends only on where you are now. It doesn't depend upon where you've been in the past. So that's why we say they're memoryless systems. What matters is the current state and not past states. Does that make sense? And that's so in King Monte's tour all that matters is where he is now. It doesn't matter where he was last week. And that's why it's a Markov chain. Markov chains are great for computing because you don't have to store a bunch of numbers. So if you can translate a system into a memoryless version of itself, then you can call it a Markov chain and then there are a bunch of great things, tricks you can use to calculate things about it. And this fellow, Andre, did the most foundational work in this. And then the Monte Carlo is named after the city where people gamble. You may have heard of it because it's random. It's stochastic. That's not what the Monte Carlo card is. There wasn't a person named Monte Carlo. It's really there was at some point. Nothing to do with this story. Okay. So why do we use these crazy things? I said a little bit about this before. Sometimes you get a model that's too complicated to have an analytical expression from the integrated posterior year. Nobody can do it. Integrals are hard. Those of you who've done much integral calculus know that with integrals you're usually guessing. There are nice ones where there are techniques that work. You can integrate by parts or something like that. But there are a whole bunch of integrals that no one has ever solved. And if you could solve them, you'd be famous. It's not like derivatives. Derivatives, you just give it to your computer. Those are the derivative. They're just a rule. But going the other direction is a lot harder. And so our goal here, you get the posterior distribution by integrating. And integration is sometimes just not practical. So we do numerical integration instead. We figure out the integral through numerical techniques. And Mark Auté Monte Carlo is just one of those. And it's a very, very capable one that works for a very large class of problems. So we're going to be interested in problems like multi-level models if you're interested in networks. Network models have intractable likelihoods as well. Phylogenies. Mark Auté Monte Carlo's very big in phylogenetic inference for this reason as well. And spatial models also have lots of really difficult things about them. Optimization, which is what CRAP does, is in general a non-starter in high dimensions. Once you've got a lot of parameters, which we're going to call dimensions, optimization actually targets the wrong part of the distribution. I say more about this in the text. I don't think I'll have time today to talk about that more. There's this phenomenon I described in the textbook called concentration of measure, which guarantees that in a high-dimensional distribution the mode is very, very far from where all the probability mass is. And it's just a super weird thing. I have a whole section in the book about it. I might have time today to talk about it, but I don't think I do. You just have to realize this is why everybody gives up on optimization as soon as you get 100 parameters, basically. Optimization is giving you the wrong answer. It's giving you the right answer. You just ask the wrong question. The mode is now not relevant. Bottom line, the bold on this slide. Mark Auté Monte Carlo is not fancy. It's old. It's from the 1950s. This is middle 20th century technology. It's not some fancy thing. I say this because in some fields people say, oh, you're running a markup chain. Here's your fancy statistician now. No, this is from the 50s. This is old stuff. This is before people knew that smoking caused cancer. This is old technology. It's robust. It's used for a ton of applications, routine applications all the time. There's nothing fancy or shown off about markup chains. I want you to see it. It's just bread and butter sort of stuff in applied statistics. And it is. So there's a lot of text on this slide. I apologize. I just want to give you an overview before I get into some more interesting visualizations. There are a bunch of different markup chain Monte Carlo strategies. The thing I showed you is Metropolis. You can extend Metropolis very directly to something called Metropolis Hastings. Hastings was also a person. And Hastings published a very clear and famous paper that showed that you don't have to have symmetric proposals. You can choose more intelligent asymmetric proposal distributions. And you can get more efficient tours that converge faster to the target distribution. And that's a big achievement. And Metropolis Hastings has been customized to a bunch of really clever strategies, including the most famous of which is Gibbs sampling. Gibbs was a physicist who didn't work on this, but it was named in honor of him. Gibbs worked on stochastic distributions, statistical mechanics. And so they named this thing after it posthumously. Gibbs sampling in low dimensional models at least is extremely efficient because it uses a very clever proposal distribution. But it's basically a Metropolis algorithm. Same thing. What all these algorithms have in common is they guess and check. They make a proposal of someplace to move to. They check the posterior probability at that location and they compare it to where they are now. This is the guess and check strategy. And all of the Metropolis descended algorithms do that. As a consequence, the quality of your proposals is the most important thing because if you make dumb proposals, you'll reject them all. And then you'll just spend a bunch of time sitting still and wasting computer power not touring. The goal is to accept every proposal. If you could constantly be moving, you would tour the space as fast as possible if you could make really smart proposals. And the problem is guessing and checking is going to make a bunch of bad guesses because you don't know the distribution. So it's very hard to say how you're going to make smart proposals so that they're all accepted if you don't even know the distribution. Well, happy news. I'm going to show you a way today to do that. But you need a completely different strategy. And the strategy is known as Hamiltonian Monte Carlo. It does not guess and check. It's super cool. And this is what we're going to use in this class. And this is the new hotness. Can I still use that phrase? That's sort of old now. It's not even new. But now everybody is switching to using Hamiltonian Monte Carlo because it's vastly more efficient. And you can do really hard to mention models with it. So as I mentioned occasionally, wrapping up a project where there are 27,000 parameters and we just run it as a Hamiltonian chain and there's no problem. It just samples them all. No problem. I mean, it takes two days to run the chain, but that's no big deal, right? It took 15 years to collect the data. So it's no big deal. Okay. And now I say there are lots of new methods being developed all the time. But all of them have in common with Hamiltonian Monte Carlo is they use something called a gradient. They don't guess and check. The error of guess and check is kind of over. You can still use all those algorithms for simple problems. That's great. There's no problem. But you get a lot of efficiency out of avoiding guess and check. So let's see the problem of guess and check. So I want to show you a quick movie here. This is going to be the Metropolis Hastings Algorithm. In a two-dimensional Gaussian posterior hill that we're looking down on. And this is a simulation of the Markov chain running. So this is like King Monty, but now it's on a hill and he needs to tour this hill in proportion to its height. He needs to spend more time in the middle where it's high and less time in the bottom. So we're in the middle there. The arrows are proposals, green arrows are acceptances, red arrows are rejections. So when there's a red arrow, it stays. When there's a green arrow, it moves. We tour around every time and we record our position at every point and we stack up on these histograms and you see it's building the distribution. It's a two-dimensional Gaussian. You see this? Yeah. So these are up on my blog, by the way. If you want to play with these, you can change the animations and play around with it. I got it from this library. I'll give you the link at the bottom. There's a fellow who's written a giant library of these interactive animations of Markov chains. They're fun. So this is the Metropolis Algorithm working. Works really well, right? It's still an amazing achievement in mathematics that this sort of thing functions at all. The problem with the Metropolis Hastings Algorithm comes when the distribution's not so nice. So here's a distribution, which is not a nice Gaussian hill. It's a donut. And you're thinking, like, when am I going to have a donut distribution? Oh, you will. Oh, you will. You won't realize it. But in high-dimensional spaces, basically, as I said, the probability gets concentrated into a thin shell, into a thin region. It's very far from the middle, from the mode. And so it becomes this hyper-donut, if you will. That's probably the first time that word has ever been used. But picture a hyper-donut. And high-dimensional donut. 27,000-dimensional donut. And you must sample from that 27,000-dimensional donut. The Metropolis Algorithm is really bad at this because it makes proposals outside the high probability region everywhere. So let me show you how this animation goes. So you see Metropolis. It's making proposals. Most of them are rejected. It's very hard for it to move because it's making proposals into dead space all the time. And it doesn't know there's a donut here. It's got to map it out, right? You see the donut, but it doesn't, right? And your real problems, you don't know what the shape looks like either. So in the long run, this will still work. But the long run now is very long indeed, right? You need to finish your PhD. And this algorithm is going to waste some years, yeah? So I'll let this run for a little bit. I think at some point I sped up this, yeah, speed it up. And you see, eventually it's going to work, but it hasn't even been to the south side yet. It's getting there. All right, yay, visit the south side. And it just gets stuck in a little narrow region because it's making bad proposals all the time. The proposals don't know anything about the target distribution. And that's the basic problem with Metropolis Hastings. Gibb sampling has the same issues. In a high-dimensional space, gibb sampling, it makes better proposals, but it still gets stuck because there's this concentration of the measure of the distribution. So let me show you the contrast here. You can tune the Metropolis family algorithm so they do better in a particular case, but there's this really tight trade-off. You're always given up on something. So I'll show you, this is the Metropolis algorithm. This is from this figure 9.3 in the book. And on the left, I'm showing you a distribution. This is two-dimensional Gaussian, but now there's a correlation in the posterior here. So this is negative .9 is the correlation between these two parameters. And we start in the upper left with the Markov chain. Field circles are accepted proposals. Open circles are rejected ones. And I made the step size really small. What's the step size? The step size is how wide the proposal distribution is. So if you have a very small proposal distribution, then you're only going to consider points near you. That means that a lot of them will be accepted because if you're in valid space, and you are, then you're going to make proposals in valid space. But as a consequence, you move very, very slow through the posterior distribution. So the acceptance rate is about 60%. We're rejecting 40%. We're going to do a lot better than this in a moment. Hang on. So we're moving. We're accepting most of them at least, but it takes us a very long time to get all the way down this valley, this pickle, whatever you want to call that shape. And on the right, it's the same posterior distribution, the same algorithm, but I increased the step size because we want to take a faster tour. And we do in a sense, but now we're rejecting the majority of proposals because they're outside the narrow space that we need to stay in. This is the fundamental trade-off, excuse me, in Metropolis. And really, you can't win. You've got to pick your poison on this. And again, Gibbs can do a little bit better, but as soon as the dimension of the posterior gets high, this problem always appears again. The problem is guessing and checking. So what do we do instead? Hamiltonian Monte Carlo has a different approach entirely. It does not guess and check. Instead, it runs a physics simulation. This is crazy, but bear with me. It really does. I'm not joking. This is not a metaphor. It actually runs a physics simulation. And what we're going to do is we're going to represent our parameter state. That is, there's a vector of parameter values in your model. And we're going to represent that as a coordinate in some high-dimensional space. So in a two-dimensional posterior distribution, it's just your x and y coordinates. In a three-dimensional one, it's your x, y, and z. And in more dimensions, you've got some hyper space that you have a coordinate position in. And now you're going to say you're a particle in this space and you have that position. And then we're going to come in and flick you. We're going to play God. We're going to flick this particle with some random momentum in a random direction in this high-dimensional space. And it's going to cruise on a surface. What's the surface? The surface is the posterior distribution. And it's going to cruise around on the posterior distribution. We're going to let the simulation run for a little while. And then we're going to stop the particle and we're going to record where it is. Then we're going to flick it in a new random direction where it is. And because it's following the curvature, I'm going to show you movies of this, so hang on. Because it's following the curvature, it always makes good proposals because it doesn't go into bad areas. Why not? Because you won't go into those areas because of the shape of the surface you're on. So there's no more guessing and checking. All proposals are good proposals. It's like dogs, Brent, are all good dogs. And we make really good proposals. Let me show you what this looks like. One more allegory. Imagine King Monty's... Now not King Markov, but King Markov's brother on the mainland is King Monty. He has a kingdom, but it's in a valley. It's a narrow valley. This is a north-south dimension. And there are mountains on the north end and the south end. Everybody just lives in this valley, is strung out along in a straight line. It's a continuous urban smear. And there are more people living in the bottom than on the edges. Same royal contract. You need to visit them in proportion to their population density. How can you do this? We can use the Hamiltonian Monte Carlo strategy to do it. How do you do it? He gets in the royal vehicle, starts driving randomly north or south at some random speed. The car obeys physics. It has to. If it starts going uphill, it slows down. You've got your foot at some constant pressure on the throttle. That's the idea. So you start to slow down when you go uphill. Eventually you stop and then turn around. You go downhill. That same constant throttle. So you speed up when you're going downhill. You slow down when you're going uphill. And you drive for some sort of pre-specified duration. Then you stop. You get out. You kiss some babies. Get back in the car. Repeat. This is the Hamiltonian Monte Carlo algorithm in silly story four. And if you do all this right, the stopping positions will be proportional to the population density at each point in the valley. In the book, I'll show you the simulation of this. This is what it looks like. We have time on the horizontal. On the vertical is the position. This is just a one-dimensional distribution. A long narrow valley with mountains on each end. So it's Gaussian. You're saying, how is it a Gaussian? Gaussian is not a valley. If you take the long government of Gaussian, it is. It's a parabola. So the minus long probability of a Gaussian is a valley. It's a parabolic valley. And that's what you end up with with a Gaussian. And so it starts in the middle. Randomly drives south. Starts to go uphill. He slows down in the car. He stops. And it turns around. He goes down. Then he stops. His timer goes off. Kisses some babies. Again, randomly drives uphill. Comes back down. Stops in the middle. Goes to the next point. Then the next point. It gets a lot of momentum after that third point. And drives uphill really far. But eventually turns around because it gets really steep. So he turns around. Comes back to Sample's point. Goes downhill. And he speeds up as he goes downhill. The width of the line segment here is the kinetic energy. It's his motion. It's momentum, actually, in this physics simulation. And over and over again. At each point, you randomize the momentum again. So that you're still cast dissipated. But because you're following the contour of the shape that you want to map out, you're making good proposals. You're not making silly proposals. But you need a lot more information to make this work. You need to know, well, the contour. So how does this work? Outside of the silly allegory, here's what's really going on. Your location is your parameter values. We really are doing a physics simulation of a frictionless particle on a frictionless surface. The surface is the minus log posterior, which makes a bowl. So look at the bottom. If you've got a Gaussian posterior in one dimension on the left, if you take the logarithm of that and multiply it by minus one, you get a parabolic bowl. And we're going to roll our marble in this parabolic bowl and periodically stop it and record its position. And in the long run, you will get position samples, which are proportional to this shape, as if there was more probability in the bottom as there is on the left. You get more position recordings like that in exactly the right proportions. This is what's cool about it. You're not guessing and checking anymore. Let me show you what this looks like in a movie. So again, two-dimensional Gaussian hill. We're looking down on it from the top. But now we're going to run a Hamiltonian simulation on it. There's going to be a particle. We're going to flick it. This is a bowl now. So you want to see the dark part in the middle as the bottom of a parabolic bowl. Nobody has parabolic bowls at home, I know. But imagine you had a parabolic bowl. We're going to flick our simulation here and do the pass. Again, the green arrow is the transition. It's where you move to. The gray arrow is your initial momentum at the flick point. So we flick it and it rolls in the bowl, ends up somewhere inside the bowl. You're never going to get out of the bowl, the good part of the bowl, because of the physics that's governing it. When you start to go uphill, it makes you turn around. That's what stops you from getting into the silly parts of the parameter space that you don't want to be in. Now, to do this, though, you need to know the local curvature at each point. So the physics simulation has to calculate what's called the gradient, which is the derivative in each direction of the minus log posterior. So you have to do more math. This is a lot more computationally intensive. The computer is doing all the work, though, don't worry. But it's much more computationally intensive than the metropolis algorithm. You need more information. But you accept almost every proposal. You can get 100% acceptance rates routinely in Hamiltonian money problem. And so, yeah, speed it up. It had good times, right? So this is better living through physics. This works really, really well. In practice, since nearly all the proposals are accepted, you need many, many fewer steps to get a good picture of the posterior distribution than Hamiltonian money parlor. So each step, each proposal you consider takes a lot more computer cycles, but you need many, many fewer such steps, such proposals to get a good picture of the posterior distribution. So you get a lot of efficiency gain out of this. Hamiltonian money parlor for a two-dimensional Gaussian, so what? It just seems like a lot of bother. The reason we use it is because life isn't so benign. Instead, we get things like these high posterior correlations. It's actually the same problem I showed you before, where metropolis gets stuck. Now, using the code that I have in the chapter, I show you a simple R-strip to do this. It isn't that much code to run a physics simulation, actually, it's just a loop. And you get these tours. And so what you're seeing in these line segments is we start at the X in the upper left here, and it gets flicked sort of up and to the right, but then it curves back down and rolls down the valley, right? Rocking. Now, there's no friction, so it's never going to stop moving. It just keeps rolling around. And eventually we stop the clock, we record a position, and then we flick it again. So it rolls from the X to number one. You see the number one there, one point. And then from one, it gets flicked again and it rolls around a bit, curls back around to number two, and then we flick it again and it eventually gets over to number three and then to number four. So we're getting, not only do we have a high acceptance rate, but the sequential samples are very far from one another as well. What we call the sequential auto correlation is very low. So you're getting a lot of information about the whole shape very quickly with Hamiltonian Monte Carlo. That's what's great about it. And so then on the right, I'm showing you the same thing, but now for 50 trajectories and 50 samples, there's one open point, which is one rejection. You can sometimes reject things in Hamiltonian Monte Carlo, and we'll talk more about that later on and why that happens and what we can do about it. Okay, another movie. So Hamiltonian Monte Carlo does really well with the donut. That's why we like it. So we're back in the donut. Yeah? Let's run Hamiltonian Monte Carlo on the donut. So it's like it knows the donut's there. Of course it doesn't. All it knows is the curvature where it is. But it still obeys the donut. So now this donut is like a narrow valley. Now it's a donut valley all the way around and you're rolling this little marble and it's zigzagging and oscillating like a pendulum inside donut valley. And it just tours the whole thing. And so you don't get stuck in any particular area. So now imagine you have 27,000 dimensions in your model. This is great. This is really what you want, right? Something like this. So guessing and checking is dead to me. I am forever now going to run physics simulations. So there we go. Full speed. We can tour the donut. Get a really nice picture of it. At this point Hamiltonian Metropolis would still be stuck in a corner, right? So this is really, really important. Sort of innovation in the applied stats community. Okay. Richard, can I ask a side question on this real quick? Mm-hmm. How did you describe Osterian? You're trying to show the whole donut? No, of course you can't. You can't show anything more than three dimensions, actually. I mean, there's this joking in stats. It comes from someone named Jeff Hinton who's worked a lot on these. He says, if you want to picture something in, say, 14 dimensions, draw a three-dimensional shape, and then stare at it intensely and say 14. So nobody can do it. You don't. You can work with high-dimensional models the same way you've been working with the ones in the first part of the course. You look at the prediction scale. You don't interpret the parameters anyway, right? These are vastly complicated machines where a bunch of little bits, it's a tide engine. Have I used that analogy already? It's the tide prediction engine. You've got a bunch of little bits in the machine, and they're combining in highly nonlinear ways to produce predictions. But that's what you care about, is the prediction scale where it can make sense. There's some low-dimensional compression at the level of predictions that you can make sense of. And so that's true for these. Yeah, no one can visualize a hypersphere. It's really tough. This is a big problem. In the text, though, there's a short section where I try to explain this phenomenon called concentration of measure. And I've got a graph for you where I would talk about is if you consider the radial distance of your coordinate, from the mode, from the point of highest posterior probability, as the dimensionality of the distribution goes up, most of the samples get further and further in radial distance from the middle. So there's a graph where I show you this. That doesn't mean it's going to be intuitive, though. In my experience, this is all just like, what, radial distance? It's bizarre, but yeah. High-dimensional spaces are very confusing things. But I counsel you to just work with it the way you work with other models and not worry about plotting the whole thing. You can do Paris plots for interesting parts of it, stuff like that. Okay, let me try to summarize. All right, so why does HMC work so good? It doesn't get stuck. It follows the gradient. The gradient is the curvature, the local curvature, at each point. We've got to calculate that. There's extra stuff you have to put into this, including momentum variables now. So this little particle, it really is like, there's a real physics simulation going on here. It has mass. You have to choose that mass, and you have to give it momentum and all these other decisions. And it has energy. The system has energy, which is actually nice because this is how we check if it's working. Energy must be conserved. You may have heard this. Right? So if energy is not conserved, you can check that criterion and you know the simulation is broken. This is something you can do with Hamiltonian Monte Carlo that you can't do with metropolis. You cannot know very easily if metropolis is not working. When Hamiltonian breaks, it's dramatic and obvious. This is really nice. It gives you a sense of security. There'll be all kinds of warnings that come from these extra seemingly nuisance parameters. They're not nuisances at all. They help you feel secure that the thing is working. But you have to put in more. You need gradients, as I said, so we have to calculate that. What does that mean? It means taking derivatives of the log posterior at any given point. You don't need the whole posterior here, but at any particular point you can evaluate it because you have your model definition. But then you need derivatives of that. For any particular model you want to program, like the example in the text I gave you, I just did the derivatives by hand. There's a box where I show you that. Gaussian derivatives are super easy. The Gaussian distribution is always easy. In general, this is annoying, super annoying. This is why the tool we're going to use does this for you automatically. It is an aggressive user of the chain rule. Your computer can do all the derivatives for you. There is tuning that needs to be done, though. I'm going to tell you a little bit about that. We're not going to stay on this slide. In the book, if you're interested, there's a working R script that will run a Hamiltonian simulation. I used it to make those figures. It's not that complicated. If you want to tour through it so you really understand how dead simple this is. There's a particle. We know the curvature at this point so we can simulate its movement for a little while. That's all it's doing. This is not rocket science. The problem we often encounter with Hamiltonian Monte Carlo is they're stuck to pick. The two things that we need to pick to tune it are the step size. The step size determines the length of time we run each little segment of a simulation. If you're looking on the left, this is the two-dimensional Gaussian bowl. We start at the X and we cruise to one. We see those little dots I've plotted on here. Each of those is a step size. It's a time interval that the thing runs. You've got to choose that. This determines the coarseness of the simulation. Those of you who've done much numerical integration, you're used to this, right? You choose the stiffness of your numerical integrator when you do ODE's and things. This is like that. You want basically the biggest step size that you can get away with because it makes it more efficient, but if you make it too big, then you overshoot the curvature of the shape. You've got to choose that. Then you have to choose how many of those steps you're going to take in each so-called trajectory between the points where you stop. Those are the two things you need to choose. If you choose bad values for those, then you're going to have a bad time. In fact, you could have a maximally bad time with this algorithm. Here's an example. This only works maximally bad in the two-dimensional Gaussian. In general, it's not this bad, but something happens called the U-turn phenomenon where you can revisit where you started almost exactly every time if you pick exactly the right step size and number of what are called leapfrog steps. Each of those little steps is called a leapfrog step for reasons I cannot tell you. It's a good name. There's a little frog, I guess, jumping on the surface. The frog is counting one, two, three, four, and it always jumps a certain distance. Since it's a parabolic bowl, you get these parabolic orbits in the bowl. You start at the X, you go up and then down and loop all around and come right back to where you started. Then you flick it again and then you're right back where you started. In the long run, this will still work because you're not coming back exactly where you were, but it's super inefficient again because you're getting a bunch of samples from the same local space. This is called the U-turn problem. How do you fix this? You've got to have good values for these tuning parameters, and that's annoying. Let me show you in the animated version what the U-turn problem looks like. Again, here's our Gaussian bowl. Since I'm evil, I've set the step size in a number of leapfrog steps so that it eats its own tail. This is Ouroboros, that snake that eats itself forever. Is that Mayan? Someone know. This is the Ouroboros of statistics, the Markov chain that eats its own tail. It gets stuck in this little region. It is wandering, but it's wandering very slowly. In the long run, this doesn't work. This doesn't ruin the chain. But we could do a lot better if we had better values for the step size. See, it's working in the long run, but it's not working as well as it did before. See, it's stuck in the regions for a little while. Since you don't know the distribution, though, it's hard to say before you start sampling what good step size and number of steps are. So how do you deal with this? Well, we're going to use an engine that deals with this for you. It uses a primitive form of artificial intelligence to find good step sizes, and then it runs the chain. So we're going to use a library called Stan. Stan is a C++ library. You will run it from within R, though. You just have to install Rstan, the package. You already have, if you've been using my package. Yeah? And my package will interface with Stan for you in returning samples. And Stan does two things. First, it has something called a warm-up phase where it figures out good step size. It figures that out by trying a bunch of random simulations and finding a step size which maximizes the acceptance-raising proposals. Right? Good step sizes. And then it uses a really clever thing to determine the number of leapfrog steps called the no-U-turn algorithm. The no-U-turn sampler or nuts, it's called. And this is super clever, and I'm not going to explain the details of it, but I would be want to show it to you in animation, what it looks like. The key thing about the no-U-turn sampler is that it runs the physics simulation in both directions. What do I mean directions? I mean directions in time. So it imagines a simulation that loops back on itself and it runs that one backwards from your starting point and it goes forward at the same time so it can guess when the turning around is happening. And when it sees itself turning around, it stops and it takes a sample. Let me show you what this looks like. And it sounds like madness and maybe it is. Yes, here it goes. See, it goes in both directions. Yeah. It's a nice thing about Hamiltonian systems is their time invariant so you can run it in both directions. You can go backwards in time. You can figure out where the particle must have been if it's on the trajectory. So this is why it works. And so it guesses when it's turning around and it takes smart positions by doing that in both directions. I know it looks super weird but this means you don't have to pick the number of leapfrog steps. It figures out a good number of leapfrog steps for every individual trajectory that is custom and the location it's in. And this gives you good uncorrelated samples from the distribution. Now it starts to speed up. Yeah. So this means you don't have to pick that. You don't have to make a bunch of decisions. A smart artificial intelligence will do it for you. Okay. So I wanted to say why is this thing called Stan? Stan is not an acronym. It's the name of a person. Stanislaw Ulaw pictured here was one of the people who, one of the mathematicians working with Nick Metropolis and that whole group that was actually directed by John von Neumann that was part of the Manhattan Project. And Ulaan is really the father of Marco Tain Monte Carlo. He figured out a bunch of the basic things and built mechanical Marco Tain Monte Carlo simulators. You can find these online. These wooden things. You can demonstrate them. They're mechanical tools that will take random samples. And he also did a lot of really important work in biology which he's standing next to that molecule model there. And so Stan is named after Stanislaw Ulaw. And you can run Stan in anything you like. If you go to the Stan webpage you'll see this but we're going to run it in R. If you want to use Python I know some of you prefer Python. Python is an inherently better language than R. I'm sorry I got to say it. You can run Stan in Python too. It's fine. You can also run it just on the command line. When I run models on the servers the computing cluster in my department I just upload a C++ executable to the server and then put it on a million cores. Not a million. 80. And then let it go. So I don't... R is like a lot of overhead that you don't need necessarily, right? You just want samples. Okay. Here's my favorite picture I have to show you. This is Stanislaw Ulaw with his daughter Claire at Maniac to show you this on the same project. That was a computer, right? They couldn't even contain one cat, Jeff, right? I love that phone too. Phones used to look like that. Okay. There are problems with Hamilton and Monte Carlo still. If you've got a posterior distribution like this one that has multiple separated hills and this does happen this happens a lot in latent variable models like factor analytic models then it can be very hard to transition between the hills and your waiting time between hills can be long indeed. So you get stuck on the hill for a while and again in the long run you're guaranteed to explore this space but the long run now can be pretty long. So people are working on this in my experience you handle models like this by changing the geometry of the model. That sounds crazy. It's not cheating. You're not changing the answer but you collapse hills together by dealing with symmetries in the model. But you've got to realize there's a problem first before you do that. So in my department we get problems like this when we run item response theory models which have this kind of labeling exchangeability to them. Okay. Let me give you in the last five minutes here a little bit of advice on what this looks like when you run it but you're going to go home to the chapter and you're going to run all these examples yourself. You've got to get in there and do it and play around with it and walk through. I give you two examples at the end of this chapter of stuff that goes wrong and how to fix it because I want you to recognize a bad chain when you see one and sometimes they're bad. They're usually bad because the model is bad not because the algorithm is bad. So here's the sound of one-hand crapping. It's been a joke I've been waiting all turn to make. So when we run Quap this is a model you ran before for the terrain ruggedness example from Monday and Quap results. Now let's do this model with a Markov chain and I'll show you what it looks like. We're going to use Oolong which is in the rethinking package and it looks exactly the same as using Quap, same formula but we're going to typically I'm going to encourage you to make a little slim data list to pass in that only contains the variables of interest and then you need to decide how many chains to run and how many quarters to put them on and how long to run each chain down at the bottom. So here we're going to do four chains we're going to use put each one on a separate core I don't know how many cores your laptop has but probably four. Most laptops have two true cores and four virtual cores these days. If you don't have that many cores you can still put in four your computer won't explode. Nothing bad will happen. And when you run this what does Oolong do? First it translates this style model formula into raw stand code. I show you there's a whole box in the chapter where I show you what that looks like it doesn't look too different but there's a bunch of formal variable definitions that will be awkward if you haven't done C programming before. Stand code is basically C code and it's like a weird dialect of C++ and then stand builds a custom nut sampler to know you turn sampler and then the sampler runs feeds your data in and it runs and you get samples back and then it's just like samples it's just like you've been working with samples the whole term just samples and you analyze them the same way. So we run this what happens here you can see if you show it it'll report each chain how long warm up and sample took what is warm up warm up is when it's picking the step size it's not getting valid samples but it's doing little experiments to figure out how long the numerical integration link should be figures that out and then the sampling period is typically shorter and then the total so this goes pretty fast yeah this model did it in a second it's so fast so the total number of samples you're going to get back are the length of each chain minus the warm up and by default warm up is half and I encourage you not to change that you don't need more than a couple hundred samples from the posterior if for a Hamiltonian chain if it's efficient remember the central limit theorem you get a really good estimate of the posterior mean very rapidly from the central limit theorem okay and then preci works the same way but you get these extra diagnostic things on the end called in F and our hat I say more about these in the chapter in F is the number of effective samples what I want you to see here is that the number of effective samples is greater than the actual number of samples you took from the Markov chain this is not wrong how can that be true that can be true because Hamiltonian Monte Carlo takes dispersed samples so what does it mean effective number of samples that means it's a number of samples that you would get if there was no auto correlation between sequential samples you can do even better than the true number of samples in Hamiltonian Monte Carlo because your samples are really far apart they're exploring different regions these sequential samples have a completely different region of the space they're negatively auto correlated when your sequential samples are negatively auto correlated you're doing better than best and you can actually have the Markov chain that's better than totally random I know it's crazy but this is how good Stan is it's really amazing version one of Stan was not this good but they've gotten so good at tuning it but now it does this routinely for simple models so I just want to show you this because it's so cool okay so and then our hat is the Gellman Rubin convergence diagnostic it starts out above one and converges to one it's a ratio of variances it's an F statistic and so you want it to converge to one and that tells you that the chains aren't different the variance it's a ratio of the variance between chains and within chains right and you want if the chains are all the same then that should all collapse they should all be exchangeable chains they should all look the same the information in any one chain is the same as the information in the other chains okay so there's you can look at the I give you a little bit more detail in the chapter things like pairs plots which I don't tend to use a lot but you can look at the whole posterior this way just as before and then at the end of the chapter I give you two examples which I really want you to go home and run yourself and your computer and fight with and go through patiently in the text and read through it simple things that happen all the time ways to check your chains to make sure they work there are examples of bad chains I'm here because I don't want to keep you but there are cases where what you want to see is this I call it the hairy caterpillar ocular inspection test you want your chains look like hairy caterpillars that doesn't mean they're good but if they don't look like this they're probably not good it's one of those kind of things what you might get instead again you can walk through this on yourself in the chapter is something like this this is bad news yeah this patient is not going to make it so and I show you how to fix this there are lots of common things you can do to fix it so you can get the hairy caterpillar back alright now I give you two examples the thing I want to drive home where I let you go is that typically when you're having a problem getting your chain to work is because there's something wrong with your model definition this is my experience so Andrew Gelman calls this the folk theorem of statistical computing he learned this the hard way just like all of us who do applied stats you'll fight with the thing you'll try tuning parameters and all kinds of stuff and then you'll look at your model and you'll realize oh I don't have a prior for three of the parameters that's why it's not working so what you want to do first is check your model yeah it's probably not the Markov chain it's probably you that's nearly always alright and the problem is as we say in computer science before the keyboard alright so remember the folk theorem there's one more example with bad chains I have a new homework problem up for you in which you're going to model the Wine's 2012 data so this is a combined do your first Markov chains and work with some simple interactions at the same time I hope you find the data interesting I think they are and I want to remind you if you haven't updated the book PDF or the rethinking package lately you should do so there have been a couple minor bug fixes for chapters yet to come in the rethinking package and like every week I upload a new version of the book where there's stuff we're still going to get to that I'm editing as we go okay so the next week I promise you we'll do some more exciting stuff on Monday I want to show you the value of maximizing entropy it's really nice to have maximum entropy I know it sounds bad but actually it's the state you want to be in and I'm going to use that to teach you about generalized linear models so we can do much more powerful modeling of different kinds of variables okay thank you have a good weekend